Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Brain damage is all in your head. -- Karl Lehenbauer


devel / comp.lang.python / Re: Best practice for caching hash

SubjectAuthor
o Re: Best practice for caching hashChris Angelico

1
Re: Best practice for caching hash

<mailman.316.1647418172.2329.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17532&group=comp.lang.python#17532

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Best practice for caching hash
Date: Wed, 16 Mar 2022 19:09:19 +1100
Lines: 108
Message-ID: <mailman.316.1647418172.2329.python-list@python.org>
References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
<YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
<CAPTjJmrwMjiZcu7zMtR31dF2ekFReJhsCMa4zVpQ0RXELafGOQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de 8ZWOy1A25GLkMeTcB+qdHw4bR45+bvJAkbdY70sn8pDg==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=BNfNrNgW;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url-ip:140.82/16': 0.03;
'comments': 0.03; '(which': 0.04; '(for': 0.05; '2022': 0.05;
'bunch': 0.05; "python's": 0.05; 'demands': 0.07; 'have,': 0.07;
'mar': 0.07; 'string': 0.07; 'tests': 0.07; '"""': 0.09;
'angelico': 0.09; 'children.': 0.09; 'comparison': 0.09;
'identical': 0.09; 'int': 0.09; 'likewise': 0.09; 'practice.':
0.09; 'tags': 0.09; 'test,': 0.09; 'unlike': 0.09; 'values.':
0.09; 'url:github': 0.14; 'that.': 0.15; 'url-ip:140/8': 0.15;
'(there': 0.16; '_not_': 0.16; 'bits': 0.16; 'both.': 0.16;
'caching': 0.16; 'cameron': 0.16; 'cases,': 0.16; 'chrisa': 0.16;
'collisions': 0.16; 'cpython': 0.16; 'dict': 0.16; 'dicts': 0.16;
'fast,': 0.16; 'flexibility.': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'hash': 0.16; 'inherently':
0.16; 'nested': 0.16; 'objective': 0.16; 'purely': 0.16;
'simpson': 0.16; 'strings,': 0.16; 'tuples': 0.16; 'url:cpython':
0.16; 'use-cases': 0.16; 'varied': 0.16; 'wrote:': 0.16; 'values':
0.17; 'uses': 0.19; 'to:addr:python-list': 0.20; 'all,': 0.20;
'basically': 0.22; 'maybe': 0.22; 'skip:_ 10': 0.22; "what's":
0.22; "i'd": 0.24; 'url-ip:188.166/16': 0.25; 'seems': 0.26;
'behavior': 0.26; 'object': 0.26; 'so.': 0.26; 'suspect': 0.26;
"isn't": 0.27; 'function': 0.27; 'chris': 0.28; 'url-
ip:188.166.48.69/32': 0.28; 'url-ip:188.166.48/24': 0.28;
'url:bugs': 0.28; 'it,': 0.29; 'deep': 0.31; 'url-ip:188/8': 0.31;
'think': 0.32; "doesn't": 0.32; 'objects': 0.32; 'raw': 0.32;
'suitable': 0.32; 'message-id:@mail.gmail.com': 0.32; 'unless':
0.32; 'but': 0.32; 'subject:for': 0.33; 'there': 0.33; 'same':
0.34; 'requires': 0.34; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'majority': 0.35; 'one.': 0.35;
'yes,': 0.35; 'from:addr:gmail.com': 0.35; "we're": 0.35; 'cases':
0.36; 'functions': 0.36; 'work,': 0.36; 'people': 0.36; 'source':
0.36; 'necessarily': 0.37; "it's": 0.37; 'received:209.85': 0.37;
'file': 0.38; 'internal': 0.63; 'involved': 0.63; 'skip:o 20':
0.63; 'simply': 0.63; 'everything': 0.63; 'skip:b 10': 0.63;
'range': 0.64; 'key': 0.64; 'down': 0.64; 'remains': 0.64;
'widely': 0.64; 'top': 0.65; 'agreement': 0.65; 'benefit': 0.65;
'similar': 0.65; 'look': 0.65; 'well': 0.65; 'less': 0.65;
'primary': 0.67; 'management': 0.68; 'items': 0.68; 'further':
0.69; 'latter': 0.69; 'slots': 0.69; 'vast': 0.69; 'took': 0.69;
'interesting': 0.71; 'relevant': 0.73; 'effective': 0.78;
'significant': 0.78; 'database': 0.80; 'accurate.': 0.84;
'cheap,': 0.84; 'contrary,': 0.84; 'indexed': 0.84; 'retains':
0.84; 'sure.': 0.84; 'type,': 0.84; 'violent': 0.84; 'badly':
0.91; 'meets': 0.91; 'former': 0.93; 'stable': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
bh=EGhva8BF8MALVVVIz5o/w84RI8zzcBwPQoNfO411pws=;
b=BNfNrNgWsFbEDtgY8Z25hOeOg4nofVfTVpMxPEGYdVfirqB0c8qcWpNNcXcQS83eaO
RKhQ7xqiPvncGcEK7k+4xYJCdoP8JkTOJ/DddzDlEt3nv6L/8UJ58nW7cWDXfUQCqAFS
OKbhD/qMJtW1K0Hh4UgUQAwdwRsYLQq4aKJ2D8ovKJQ+4wGQ6EPNk7se/DhHH7od+REv
ZwUPJ+dwBH3QgMMryHx3Nq3syimqp/5OfqWnPOotGC08CZTly1ivqg7kEQfGHdk0lgE/
yvJUDrNlvX7gq7LUDw6AyAR4a7aCtB6chfOAf+TLXbofKmVOntThwSOLjQAmkaU04+PD
maSA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=EGhva8BF8MALVVVIz5o/w84RI8zzcBwPQoNfO411pws=;
b=zUklNoZV/ONadXoSY1lagJmTgLUzZWxcvGR5K3ZN7IsIksyBdNvA+6+S8sPNkkGuVZ
QMrxA6Kxi1nKRmZ7btseD/tF1LiM4VjXsHkS60FZ4iGhfAlTQrTmEMo2QnF/f2uNOTKg
3JfWw+7CXoFtas6WKdKSwITS3X/qlwQoPqH3Wh48MH8VuyW+O5UOkpT0hsIEOP/MwB9s
BRlnlUk4/2AcnN3Ap2OtJN4e1pBOj3yyRLdm+U5ODjQyoeDLVVUG1uZLYOxfVBWg58O/
vPNuHuQ+9+P++srGPRmC5BKjRnPeG/a3Z1AVjEXYj3/0z7sOwRrcDhT1UOgRYxJw5VJ6
mmvg==
X-Gm-Message-State: AOAM533OwfxyXOx75W8BmTegMv0dwQUAJW8eJnACL6o1XEhF8Qvt+ni0
GmOvOcsexU2s9yZMk6vnmH1UQumLxFT/luvnuP6hdiHpuew=
X-Google-Smtp-Source: ABdhPJzGkfI4XGiIjmLg3Ux5oHqaoajN46j4YXcKQRoU1qvUrYQFOBdFy8R2OtOi0gPhuSCJVGlkGUyndBxma0L8IE0=
X-Received: by 2002:a05:600c:34d4:b0:38c:5ec7:1e38 with SMTP id
d20-20020a05600c34d400b0038c5ec71e38mr3239755wmq.184.1647418170380; Wed, 16
Mar 2022 01:09:30 -0700 (PDT)
In-Reply-To: <YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmrwMjiZcu7zMtR31dF2ekFReJhsCMa4zVpQ0RXELafGOQ@mail.gmail.com>
X-Mailman-Original-References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
<YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
 by: Chris Angelico - Wed, 16 Mar 2022 08:09 UTC

On Wed, 16 Mar 2022 at 18:48, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 16Mar2022 16:58, Chris Angelico <rosuav@gmail.com> wrote:
> >On Wed, 16 Mar 2022 at 14:00, Cameron Simpson <cs@cskk.id.au> wrote:
> >> On 16Mar2022 10:57, Chris Angelico <rosuav@gmail.com> wrote:
> >> A significant difference is that tuples have no keys, unlike a dict.
> >>
> >> A hash does not have to hash all the internal state, ony the relevant
> >> state, and not even all of that. The objective of the hash is twofold to
> >> my mind:
> >> - that "equal" objects (in the `==` sense) have the same hash, so that
> >> they hash to the same backet in dicts and can therefore be found
> >> - that hash values are widely distributed, to maximise the evenness of
> >> the object distribution in buckets
> >>
> >> For dicts to work, the former has to hold.
> >
> >Python only demands the first one.
>
> I think we're in violent agreement here.
>
> >And Python's own integers hash to
> >themselves, which isn't what I'd call "widely distributed", but which
> >works well in practice.
>
> This may be because the "raw" hash (in this case the int value) is
> itself further hashed (giving more evenness) and then moduloed into the
> finite number of slots in the dict.

Not in a CPython dict, no. (There is some collision management that
uses upper bits, but the initial selection of bucket simply uses the
raw hash.) See comments at the top of dictobject.c.

"""
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints. So
this gives better-than-random behavior in common cases, and that's very
desirable.
"""

> >> The latter has more flexibility. A dict has keys. If the dicts are quite
> >> varied (sets of tags for example), it may be effective to just hash the
> >> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
> >> likewise with database rows) this will go badly because the hashes will
> >> all (or maybe mostly) collide.
> >
> >The general recommendation is to use all the same data for hashing as
> >you use for equality checking. What's the point of just hashing the
> >keys, if the values also contribute to equality? I'd just keep it
> >simple, and hash both.
>
> Well, hash functions want to be fast. The case where you look up a key
> in a dict (which needs both the hash and a comparison) is not the only
> situation. When a dict's bucket array is resized, the hash functions of
> all stored objects require recomputation, but the equality test is _not_
> required. Is the hash is very cheap, that is a large benefit here.

Fast is less important than accurate. Hashing functions are usually
*fast enough*, and since the vast majority of the work of hashing is
usually going to be strings, every other data type can just
recalculate its hash based on its children. Caching the hash of a
string is very useful; caching the hash of a tuple, not so much; again
quoting from the CPython source code:

/* Tests have shown that it's not worth to cache the hash value, see
https://bugs.python.org/issue9685 */

> I just took a look at the CPython hashtable.c, you can see the rehashing
> process here:
>
> https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Not sure where that file is used; it's purely internal to CPython and
seems to be used for tracemalloc. It's not the dictionary type -
that's in Objects/dictobject.c.

In any case, the rehash operation is basically for after a resize on
the hashtable, and it doesn't rely on the objects caching their hashes
- it retains them in the hashtable.

> Just an aside, that's got a bunch of interesting opimitsations in it,
> people has put serious work into making this fast and it remains very
> readable!
>
> Suppose I have a nested tree of (hashable) frozen dicts of whatever
> flavour. An equality test requires a deep comparison all the way down
> including the values. You could make a much cheaper hash function which
> hashed just keys, yea, even only unto the upper layer or 2, and still
> meet the primary criterion: all equal items have the same hash value.

Yes, but "def __hash__(self): return 42" also meets the primary
criterion. I don't know what use-cases frozendicts have, but I would
suspect that if they are used at all, they'll often be used in cases
where their keys are identical (for instance, the __dict__ of an
immutable object type, where the keys are extremely stable across
objects of the same type).

> My recommendation is not inherently to hash everything involved in the
> equality test. If that's cheap, then sure. But it isn't necessary and
> (given a suitable use domain) it may even be very detrimental to hash
> everything involved in equality, as in the example above.
>

My recommendation is to hash everything involved in the equality test,
unless there's demonstrable benefit from not doing so.

ChrisA

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor