Message-ID:

The sooner all the animals are extinct, the sooner we'll find their money. -- Ed Bluestone

devel / comp.lang.python / Re: Best practice for caching hash

Re: Best practice for caching hash

<mailman.315.1647416830.2329.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=17531&group=comp.lang.python#17531

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: cs...@cskk.id.au (Cameron Simpson)
Newsgroups: comp.lang.python
Subject: Re: Best practice for caching hash
Date: Wed, 16 Mar 2022 18:47:05 +1100
Lines: 66
Message-ID: <mailman.315.1647416830.2329.python-list@python.org>
References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
<YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: news.uni-berlin.de GHnE2B5/Sir9SqutxAcu3AZVItuc79Cj58ww+I8b15uA==
Return-Path: <cameron@cskk.id.au>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url-ip:140.82/16': 0.03;
'(which': 0.04; '2022': 0.05; 'bunch': 0.05; "python's": 0.05;
'demands': 0.07; 'mar': 0.07; 'angelico': 0.09; 'comparison':
0.09; 'int': 0.09; 'likewise': 0.09; 'practice.': 0.09; 'tags':
0.09; 'unlike': 0.09; 'values.': 0.09; 'cheers,': 0.11;
'url:github': 0.14; 'that.': 0.15; 'url-ip:140/8': 0.15; '_not_':
0.16; 'both.': 0.16; 'cameron': 0.16; 'cpython': 0.16; 'dict':
0.16; 'dicts': 0.16; 'flexibility.': 0.16; 'from:addr:cs': 0.16;
'from:addr:cskk.id.au': 0.16; 'from:name:cameron simpson': 0.16;
'hash': 0.16; 'inherently': 0.16; 'message-id:@cskk.homeip.net':
0.16; 'nested': 0.16; 'objective': 0.16; 'received:13.237': 0.16;
'received:13.237.201': 0.16; 'received:13.237.201.189': 0.16;
'received:cskk.id.au': 0.16; 'received:id.au': 0.16;
'received:mail.cskk.id.au': 0.16; 'simpson': 0.16; 'tuples': 0.16;
'url:cpython': 0.16; 'varied': 0.16; 'wrote:': 0.16; 'values':
0.17; 'to:addr:python-list': 0.20; 'maybe': 0.22; "what's": 0.22;
"i'd": 0.24; 'object': 0.26; "isn't": 0.27; 'function': 0.27;
'chris': 0.28; 'it,': 0.29; 'header:User-Agent:1': 0.30; 'deep':
0.31; 'think': 0.32; 'objects': 0.32; 'suitable': 0.32; 'but':
0.32; 'subject:for': 0.33; 'same': 0.34; 'requires': 0.34; 'header
:In-Reply-To:1': 0.34; 'one.': 0.35; 'received:au': 0.35; "we're":
0.35; 'functions': 0.36; 'work,': 0.36; 'people': 0.36; 'way':
0.38; 'could': 0.38; 'put': 0.38; 'quite': 0.39; 'necessary':
0.39; 'use': 0.39; 'wed,': 0.39; 'still': 0.40; 'serious': 0.40;
'both': 0.40; 'want': 0.40; 'including': 0.60; 'here.': 0.61;
'skip:o 10': 0.61; 'internal': 0.63; 'involved': 0.63;
'everything': 0.63; 'key': 0.64; 'down': 0.64; 'received:13':
0.64; 'remains': 0.64; 'widely': 0.64; 'agreement': 0.65;
'benefit': 0.65; 'similar': 0.65; 'look': 0.65; 'well': 0.65;
'received:userid': 0.66; 'primary': 0.67; 'items': 0.68;
'further': 0.69; 'latter': 0.69; 'slots': 0.69; 'took': 0.69;
'interesting': 0.71; 'relevant': 0.73; 'effective': 0.78;
'significant': 0.78; 'database': 0.80; 'cheap,': 0.84; 'sure.':
0.84; 'violent': 0.84; 'badly': 0.91; 'former': 0.93
Mail-Followup-To: Python List <python-list@python.org>
Content-Disposition: inline
In-Reply-To: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
User-Agent: Mutt/2.2.1 (2022-02-19)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
X-Mailman-Original-References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>

by: Cameron Simpson - Wed, 16 Mar 2022 07:47 UTC

On 16Mar2022 16:58, Chris Angelico <rosuav@gmail.com> wrote:
>On Wed, 16 Mar 2022 at 14:00, Cameron Simpson <cs@cskk.id.au> wrote:
>> On 16Mar2022 10:57, Chris Angelico <rosuav@gmail.com> wrote:
>> A significant difference is that tuples have no keys, unlike a dict.
>>
>> A hash does not have to hash all the internal state, ony the relevant
>> state, and not even all of that. The objective of the hash is twofold to
>> my mind:
>> - that "equal" objects (in the `==` sense) have the same hash, so that
>> they hash to the same backet in dicts and can therefore be found
>> - that hash values are widely distributed, to maximise the evenness of
>> the object distribution in buckets
>>
>> For dicts to work, the former has to hold.
>
>Python only demands the first one.

I think we're in violent agreement here.

>And Python's own integers hash to
>themselves, which isn't what I'd call "widely distributed", but which
>works well in practice.

This may be because the "raw" hash (in this case the int value) is
itself further hashed (giving more evenness) and then moduloed into the
finite number of slots in the dict.

>> The latter has more flexibility. A dict has keys. If the dicts are quite
>> varied (sets of tags for example), it may be effective to just hash the
>> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
>> likewise with database rows) this will go badly because the hashes will
>> all (or maybe mostly) collide.
>
>The general recommendation is to use all the same data for hashing as
>you use for equality checking. What's the point of just hashing the
>keys, if the values also contribute to equality? I'd just keep it
>simple, and hash both.

Well, hash functions want to be fast. The case where you look up a key
in a dict (which needs both the hash and a comparison) is not the only
situation. When a dict's bucket array is resized, the hash functions of
all stored objects require recomputation, but the equality test is _not_
required. Is the hash is very cheap, that is a large benefit here.

I just took a look at the CPython hashtable.c, you can see the rehashing
process here:

https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Just an aside, that's got a bunch of interesting opimitsations in it,
people has put serious work into making this fast and it remains very
readable!

Suppose I have a nested tree of (hashable) frozen dicts of whatever
flavour. An equality test requires a deep comparison all the way down
including the values. You could make a much cheaper hash function which
hashed just keys, yea, even only unto the upper layer or 2, and still
meet the primary criterion: all equal items have the same hash value.

My recommendation is not inherently to hash everything involved in the
equality test. If that's cheap, then sure. But it isn't necessary and
(given a suitable use domain) it may even be very detrimental to hash
everything involved in equality, as in the example above.

Cheers,
Cameron Simpson <cs@cskk.id.au>

Subject	Author
Re: Best practice for caching hash	Cameron Simpson