Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

The sooner all the animals are extinct, the sooner we'll find their money. -- Ed Bluestone


devel / comp.lang.python / Re: Best practice for caching hash

SubjectAuthor
o Re: Best practice for caching hashCameron Simpson

1
Re: Best practice for caching hash

<mailman.315.1647416830.2329.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17531&group=comp.lang.python#17531

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: cs...@cskk.id.au (Cameron Simpson)
Newsgroups: comp.lang.python
Subject: Re: Best practice for caching hash
Date: Wed, 16 Mar 2022 18:47:05 +1100
Lines: 66
Message-ID: <mailman.315.1647416830.2329.python-list@python.org>
References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
<YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: news.uni-berlin.de GHnE2B5/Sir9SqutxAcu3AZVItuc79Cj58ww+I8b15uA==
Return-Path: <cameron@cskk.id.au>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url-ip:140.82/16': 0.03;
'(which': 0.04; '2022': 0.05; 'bunch': 0.05; "python's": 0.05;
'demands': 0.07; 'mar': 0.07; 'angelico': 0.09; 'comparison':
0.09; 'int': 0.09; 'likewise': 0.09; 'practice.': 0.09; 'tags':
0.09; 'unlike': 0.09; 'values.': 0.09; 'cheers,': 0.11;
'url:github': 0.14; 'that.': 0.15; 'url-ip:140/8': 0.15; '_not_':
0.16; 'both.': 0.16; 'cameron': 0.16; 'cpython': 0.16; 'dict':
0.16; 'dicts': 0.16; 'flexibility.': 0.16; 'from:addr:cs': 0.16;
'from:addr:cskk.id.au': 0.16; 'from:name:cameron simpson': 0.16;
'hash': 0.16; 'inherently': 0.16; 'message-id:@cskk.homeip.net':
0.16; 'nested': 0.16; 'objective': 0.16; 'received:13.237': 0.16;
'received:13.237.201': 0.16; 'received:13.237.201.189': 0.16;
'received:cskk.id.au': 0.16; 'received:id.au': 0.16;
'received:mail.cskk.id.au': 0.16; 'simpson': 0.16; 'tuples': 0.16;
'url:cpython': 0.16; 'varied': 0.16; 'wrote:': 0.16; 'values':
0.17; 'to:addr:python-list': 0.20; 'maybe': 0.22; "what's": 0.22;
"i'd": 0.24; 'object': 0.26; "isn't": 0.27; 'function': 0.27;
'chris': 0.28; 'it,': 0.29; 'header:User-Agent:1': 0.30; 'deep':
0.31; 'think': 0.32; 'objects': 0.32; 'suitable': 0.32; 'but':
0.32; 'subject:for': 0.33; 'same': 0.34; 'requires': 0.34; 'header
:In-Reply-To:1': 0.34; 'one.': 0.35; 'received:au': 0.35; "we're":
0.35; 'functions': 0.36; 'work,': 0.36; 'people': 0.36; 'way':
0.38; 'could': 0.38; 'put': 0.38; 'quite': 0.39; 'necessary':
0.39; 'use': 0.39; 'wed,': 0.39; 'still': 0.40; 'serious': 0.40;
'both': 0.40; 'want': 0.40; 'including': 0.60; 'here.': 0.61;
'skip:o 10': 0.61; 'internal': 0.63; 'involved': 0.63;
'everything': 0.63; 'key': 0.64; 'down': 0.64; 'received:13':
0.64; 'remains': 0.64; 'widely': 0.64; 'agreement': 0.65;
'benefit': 0.65; 'similar': 0.65; 'look': 0.65; 'well': 0.65;
'received:userid': 0.66; 'primary': 0.67; 'items': 0.68;
'further': 0.69; 'latter': 0.69; 'slots': 0.69; 'took': 0.69;
'interesting': 0.71; 'relevant': 0.73; 'effective': 0.78;
'significant': 0.78; 'database': 0.80; 'cheap,': 0.84; 'sure.':
0.84; 'violent': 0.84; 'badly': 0.91; 'former': 0.93
Mail-Followup-To: Python List <python-list@python.org>
Content-Disposition: inline
In-Reply-To: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
User-Agent: Mutt/2.2.1 (2022-02-19)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <YjGV+UvIxF4Z7L7Y@cskk.homeip.net>
X-Mailman-Original-References: <CAPTjJmo9K9odB1PCKdUc3f4F_noWQb43dRHFhPOE1Rq=3ckXqg@mail.gmail.com>
 by: Cameron Simpson - Wed, 16 Mar 2022 07:47 UTC

On 16Mar2022 16:58, Chris Angelico <rosuav@gmail.com> wrote:
>On Wed, 16 Mar 2022 at 14:00, Cameron Simpson <cs@cskk.id.au> wrote:
>> On 16Mar2022 10:57, Chris Angelico <rosuav@gmail.com> wrote:
>> A significant difference is that tuples have no keys, unlike a dict.
>>
>> A hash does not have to hash all the internal state, ony the relevant
>> state, and not even all of that. The objective of the hash is twofold to
>> my mind:
>> - that "equal" objects (in the `==` sense) have the same hash, so that
>> they hash to the same backet in dicts and can therefore be found
>> - that hash values are widely distributed, to maximise the evenness of
>> the object distribution in buckets
>>
>> For dicts to work, the former has to hold.
>
>Python only demands the first one.

I think we're in violent agreement here.

>And Python's own integers hash to
>themselves, which isn't what I'd call "widely distributed", but which
>works well in practice.

This may be because the "raw" hash (in this case the int value) is
itself further hashed (giving more evenness) and then moduloed into the
finite number of slots in the dict.

>> The latter has more flexibility. A dict has keys. If the dicts are quite
>> varied (sets of tags for example), it may be effective to just hash the
>> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
>> likewise with database rows) this will go badly because the hashes will
>> all (or maybe mostly) collide.
>
>The general recommendation is to use all the same data for hashing as
>you use for equality checking. What's the point of just hashing the
>keys, if the values also contribute to equality? I'd just keep it
>simple, and hash both.

Well, hash functions want to be fast. The case where you look up a key
in a dict (which needs both the hash and a comparison) is not the only
situation. When a dict's bucket array is resized, the hash functions of
all stored objects require recomputation, but the equality test is _not_
required. Is the hash is very cheap, that is a large benefit here.

I just took a look at the CPython hashtable.c, you can see the rehashing
process here:

https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Just an aside, that's got a bunch of interesting opimitsations in it,
people has put serious work into making this fast and it remains very
readable!

Suppose I have a nested tree of (hashable) frozen dicts of whatever
flavour. An equality test requires a deep comparison all the way down
including the values. You could make a much cheaper hash function which
hashed just keys, yea, even only unto the upper layer or 2, and still
meet the primary criterion: all equal items have the same hash value.

My recommendation is not inherently to hash everything involved in the
equality test. If that's cheap, then sure. But it isn't necessary and
(given a suitable use domain) it may even be very detrimental to hash
everything involved in equality, as in the example above.

Cheers,
Cameron Simpson <cs@cskk.id.au>

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor