Skip to content

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841

Merged
vstinner merged 7 commits into
python:mainfrom
vstinner:frozendict_pair_hash
May 20, 2026
Merged

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841
vstinner merged 7 commits into
python:mainfrom
vstinner:frozendict_pair_hash

Conversation

@vstinner
Copy link
Copy Markdown
Member

@vstinner vstinner commented May 14, 2026

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 14, 2026

It's good to avoid the frozenset hash code. It's not a good hash function. You can check this by constructing subsets of [i/2**n for i in range(2**n)[. The hashes of those elements vary in only the high-order bits, and the frozenset hash function is poor at "avalanching" high-order changes to lower-order bits. It's good in the other direction, though. For frozensets of low-precision floats, collisions are far too common.

This showed up when trying to construct "bad cases" for the xxHash-based tuple hashing. Raymond was made aware of it, but never got around to "doing something" about it.

No idea how the Boost-inspired scheme would work. Its scrambler does do some high-to-low propagation (via right shifts), but xxHash's rotate is best-of-all (and we took care to ensure that all major compilers did emit a "rotate" instruction instead of the longer-winded portable C spelling we use).

Long story short: properly validating a compound hash function in the context of how it plays with CPython's hash results for primitive types (which, apart from string hashes, make no attempt at creating "random-looking" results) can require weeks of work.

I can't make time for that, and have less than no interest in doing it again anyway ;-) I do have confidence in the tuple hashing approach - which was hard won.

@lambda-abstraction
Copy link
Copy Markdown

what about changing the starting hash to avoid the collision with frozenset(frozendict.items())? im not sure if its realistic for a single set/dict to use both frozensets and frozendicts as keys, but i dont think making the hash of a frozendict just exactly identical to the frozenset of entries is perfect

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 15, 2026

I don't expect it matters. The context is unlikely, and it wouldn't make much difference if it crops up. Collisions are actually pretty cheap on their own! Comparing objects of very different types for equality typically returns False at once.

What does matter is "pileup": the number of distinct objects that all have the same hashcode. That leads to long collision chains, which kill hash-based performance. In the absence of that, no number of "just pairs" that collide can slow things down much.

OTOH, I have no objection either to starting with different seeds.

@vstinner
Copy link
Copy Markdown
Member Author

This change basically implements hash(frozendict) as hash(frozenset(frozendict.items())) without having to create a concrete frozenset object. It reuses hash(tuple) and hash(frozenset) code, it doesn't invent a new hashing function.

hash(frozenset(frozendict.items())) is used by other frozendict implementation such as https://pypi.org/project/frozendict/ : see the __hash__() method.

It's good to avoid the frozenset hash code. It's not a good hash function.

I reused hash(frozenset) code since it fits well frozendict needs: hash an unordered set of items.

If someone proposes a better hash function for frozenset, the frozendict can be update to reuse it. For me, that's out of the scope of the issue gh-149807 bug report.

@vstinner
Copy link
Copy Markdown
Member Author

cc @corona10

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 15, 2026

Ya, but the code is getting ever more cryptic and mysterious as bit-fiddling tricks got copied from one module to another.

The cardinal sin of frozendict's frozenset's hashing is its feeble _shuffle_bits function., which is also cop;ied. Multiplication and left shift can only propagate bit changes "to the left". High-order bit changes never propagate to "the right", Which is why it's a disaster for sets of simple floats, whose hashes differ only in the high-order bits.

Comments in the original correctly point out that it's aimed at propagating low-order bit changes to higher-order bits, but is blind to that ;propagation in the other direction is also important.

In the forzendict context, that doesn't matter, because xxHash already does a good job of propagating changes in both directions. Indeed, calling _shuffle_bits on top of that is almost certainly a waste of cycles.

Copy link
Copy Markdown
Contributor

@MojoVampire MojoVampire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).

Inline comments on performance and spec compliance.

Comment thread Objects/dictobject.c Outdated
Comment thread Objects/dictobject.c Outdated
Comment thread Lib/test/test_dict.py
@vstinner vstinner dismissed MojoVampire’s stale review May 18, 2026 12:46

@MojoVampire:

The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).

#include in C only have to mention one included symbol, so later this symbol can be checked to decide if the include is still useless or if it can be removed. We don't mention all included symbols.

I addressed your other comments.

@vstinner
Copy link
Copy Markdown
Member Author

I plan to merge this change this week. It adds more tests and reduces hash collisions. So it's better than the current code.

@tim-one: I understand that frozenset hash implementation can be enhanced. Would you mind to open an issue for that? frozendict hash can be updated at the same time, since it reuses the same code.

Copy link
Copy Markdown
Contributor

@MojoVampire MojoVampire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from that sign-compare warning that looks like a fairly trivial fix (just remove the cast, int to Py_hash_t is fine?), looks good.

@vstinner
Copy link
Copy Markdown
Member Author

Aside from that sign-compare warning that looks like a fairly trivial fix (just remove the cast, int to Py_hash_t is fine?), looks good.

I fixed the warning by removing the useless cast to Py_uhash_t.

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 20, 2026

Would you mind to open an issue for that? frozendict hash can be updated at the same time, since it reuses the same code.

Sure! Overdue. Here it is:

#150134

@vstinner vstinner merged commit 2443001 into python:main May 20, 2026
55 checks passed
@vstinner vstinner deleted the frozendict_pair_hash branch May 20, 2026 11:22
@vstinner vstinner added the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label May 20, 2026
@miss-islington-app
Copy link
Copy Markdown

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 20, 2026

GH-150149 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label May 20, 2026
vstinner added a commit that referenced this pull request May 20, 2026
GH-149841) (#150149)

gh-149807: Fix hash(frozendict): compute (key, value) pair hash (GH-149841)
(cherry picked from commit 2443001)

Co-authored-by: Victor Stinner <vstinner@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants