gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841
Conversation
|
It's good to avoid the frozenset hash code. It's not a good hash function. You can check this by constructing subsets of This showed up when trying to construct "bad cases" for the xxHash-based tuple hashing. Raymond was made aware of it, but never got around to "doing something" about it. No idea how the Boost-inspired scheme would work. Its scrambler does do some high-to-low propagation (via right shifts), but xxHash's rotate is best-of-all (and we took care to ensure that all major compilers did emit a "rotate" instruction instead of the longer-winded portable C spelling we use). Long story short: properly validating a compound hash function in the context of how it plays with CPython's hash results for primitive types (which, apart from string hashes, make no attempt at creating "random-looking" results) can require weeks of work. I can't make time for that, and have less than no interest in doing it again anyway ;-) I do have confidence in the tuple hashing approach - which was hard won. |
|
what about changing the starting hash to avoid the collision with |
|
I don't expect it matters. The context is unlikely, and it wouldn't make much difference if it crops up. Collisions are actually pretty cheap on their own! Comparing objects of very different types for equality typically returns What does matter is "pileup": the number of distinct objects that all have the same hashcode. That leads to long collision chains, which kill hash-based performance. In the absence of that, no number of "just pairs" that collide can slow things down much. OTOH, I have no objection either to starting with different seeds. |
|
This change basically implements
I reused If someone proposes a better hash function for |
|
cc @corona10 |
|
Ya, but the code is getting ever more cryptic and mysterious as bit-fiddling tricks got copied from one module to another. The cardinal sin of Comments in the original correctly point out that it's aimed at propagating low-order bit changes to higher-order bits, but is blind to that ;propagation in the other direction is also important. In the forzendict context, that doesn't matter, because xxHash already does a good job of propagating changes in both directions. Indeed, calling |
MojoVampire
left a comment
There was a problem hiding this comment.
The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).
Inline comments on performance and spec compliance.
The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).
#include in C only have to mention one included symbol, so later this symbol can be checked to decide if the include is still useless or if it can be removed. We don't mention all included symbols.
I addressed your other comments.
|
I plan to merge this change this week. It adds more tests and reduces hash collisions. So it's better than the current code. @tim-one: I understand that frozenset hash implementation can be enhanced. Would you mind to open an issue for that? frozendict hash can be updated at the same time, since it reuses the same code. |
MojoVampire
left a comment
There was a problem hiding this comment.
Aside from that sign-compare warning that looks like a fairly trivial fix (just remove the cast, int to Py_hash_t is fine?), looks good.
I fixed the warning by removing the useless cast to |
Sure! Overdue. Here it is: |
|
Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.15. |
|
GH-150149 is a backport of this pull request to the 3.15 branch. |
frozendict_hashdoesnt match the PEP and might have too many collisions #149807