How to sort hashable objects

How to sort hashable objects - python

I have a situation where I need to sort the keys of a dictionary to make sure that the list always has the same order. I don't care what that order is. But I need to have consistency of order. This is causing people using a package I've written difficulties because they want to set a random seed and have consistent results for testing purposes, but even when they set a key, the order that the dictionary returns its values changes, and this ends up affecting the randomization.
I would like to do some thing like sorted(D.keys()). However, in principle the keys may not be sortable. I don't have any control over what those keys are.
I know this is solved in recent versions of python, but I don't want to restrict uses to 3.7.
More detail here: https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/22#issuecomment-631219800
Since people think this is a bad question, here's more detail (or see the original link I provided for even more detail):
I have written an algorithm to do stochastic epidemic simulations. For reproducibility purposes people would like to specify a seed and have it run and produce the same outcome on any machine. The algorithm uses a networkx graph (which is built on a dictionary structure).
As part of the steps it needs to perform a weighted selection from the edges of the graph. This requires that I put the edges into a list. If the list is in a different order, then different outcomes occur regardless of whether the same seed is used.
So I'm left with having to find a way to make the list of edges be in a consistent order on any machine.

So if I understand correctly... the task is to consistently select the same random key from a vanilla dict on an old version of Python where insertion order is not preserved, while knowing nothing at all about the key type, and without setting the hash seed explicitly. I believe that's impossible in the general case, because the whole notion of object "identity" does not even exist with such restrictive assumptions.
The only thing that comes to mind is to serialize the keys in some way, and sort their serialized forms. pickle.dumps should work on most key types (although not everything can be pickled). But if the key type does allow sorting, it's probably more robust to simply use that instead.
import pickle
try:
sorted_keys = sorted(my_dict)
except TypeError:
sorted_keys = sorted(my_dict, key=lambda x: pickle.dumps(x, protocol=3))
There are some caveats though:
The pickled representation is not the same across Python versions (see Data stream format). That's why I'm setting protocol=3 in the example above; this should work the same for Python 3.0 and newer, although it doesn't support as many object types as protocol 4.
Objects can define their own pickling, so there is no guarantee that it's reproducible. In particular...
The pickled representation of dictionaries is still dependent on the ordering of the dictionary, which depends the Python version and the hash seed. The same goes for objects, because by default these are pickled by invoking their __dict__ method.
If you want to get really tricky, maybe you can create a custom Pickler that sorts dictionaries (using OrderedDict for portability across Python versions) before pickling them... but in the end, it's not going to solve the full problem.

If you only want to preserve order one possibility is insertion order with a list of pairs. If you combine with a OrderedDict you will preserve order and have dictionary functions.
>>> import collections
>>> d = collections.OrderedDict([(1,'a'),(3,'2')])
>>> d.keys()
odict_keys([1, 3])

Related

How can I recreate Python 2's dictionary "ordering" in Python 3?

I am trying to migrate a Python 2 dictionary to Python 3.
The dict ordering behaviour has of course changed:
Python 2: Unordered - but in practice appears to be deterministic/consistent between subsequent calls of code using the same key-value pairs inserted in a consistent order on the same machine etc. The number of possible reorderings is restricted (though that is computer science territory from my point-of-view).
Python 3.7+: Insertion order is guaranteed.
I am looping over the elements of the dictionary and adding the key-value pairs to another structure in a way that is a function of the dict values that is order-dependent. Specifically, I am removing the elements based on their geometric proximity to certain landmarks ('nearest neighbours'). If the ordering changes, the removal order changes and the outcome is not the same.
I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
Caveats:
Of course I can use an OrderedDict in Python 2 (or 3.7) and often do (and should have here). But for the purposes of this question please don't suggest it, or that not using one was a bad idea etc.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?

I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
You cant.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?
No. Not without reimplementing Python 2's hashmap anyway (or compiling the "old" dict under a new name).
The Python 2 ordering (and that of hashmaps in general) is a consequence of the hash function, the collision resolution algorithm and the resizing thresholds (the load factor):
a hashmap is fundamentally an array of buckets, without special handling the iteration order is simply that of iterating the buckets array and yielding the entries in the non-empty buckets
item is stored at buckets[hash(key) % len(buckets)], hence the implications of the hash algorithm and size of the buckets array
if two items have the same modular hash, the collision will be resolved, for open addressing there are lots of algorithms (all of which end up with putting one of the items in the "wrong" bucket, but the way they do and which one they move can be quite different)
different collision resolution algorithms have different sensitivities to the buckets array being full, some will start degrading at 70% while others will be fine at 90%, this thus informs the load factor limits, therefore when you resize, and thus the new locations (and order) of the items
furthermore different hash tables can have different growth factors, even if all other factors are identical different growth factors will lead to different sizes of buckets arrays, and thus different distribution (and order) once modulo is applied
Since Python 3.6, P3 has changed the very implementation of hashmaps to use one that's "naturally ordered"1, the only way to get the not-insertion-order would be to interate the dk_indices field of the hashmap then explicitly dereference the corresponding dk_entries item, and neither field is part of the API (at the C level to say nothing of the Python), and even if they were you would have to check the implementation details to see if the new hashmap uses the same hashing and growth thresholds. Which again doesn't matter, because the internal details are not accessible.
[1]: you can see pypy's explanation of the scheme for a primer, though CPython's implementation is different/

Repeatably hashing an arbitrary Python tuple

I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.
My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.
I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?
I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)
I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.

Not sure if I understand your question fully, but will just give it a try.
Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.
params = (1, 3, 2)
hashlib.sha224(json.dumps(params)).hexdigest()
# '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'
If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.
params = {'b': 123, 'c': 345}
hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest()
# '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'

One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:
export PYTHONHASHSEED=0

How to force dicts to be unordered (for testing)?

I've just spent half a day tracing down a bug to a dict that I forgot to sort when iterating over it. Even though that part of code is tested, the tests did not pick it up because that dict had a repeatable ordering during the test. Only when I shuffled the dict did the tests fail! (I used an intermediate random.shuffle'd list and built an OrderedDict)
This kinda scares me, because there might be similar bugs all over the place!
Is there any way to globally force all dicts to be unordered during testing?
Update: I think I at least figured out what caused the bug. As is described here, dicts with int keys are usually sorted, but might not always be. My hypothesis: In my tests, the ints are always small (order 10), and thus always in order. In the real run, however, the ints are much bigger (order 10^13), and thus not always ordered. I was able to reproduce this behaviour in an interactive session: list(foo.keys()) == sorted(foo.keys()) was always True for small keys, but not for every dict with very large keys.

As of 3.6, dicts maintain key insertion order. If you would like them to behave a specific way, you could create a class that inherits from a dict and give it the behavior you want (eg ensuring that it shuffles the key list before returning it).
Regardless of the version of python you use, be sure to check out the implementation details if you want to try to rely on a specific behavior of the dict implementation. Better yet, for portability, just code a solution that ensures they behave as you expect, either as I mentioned above, or by using shuffle over a list of the keys, or something else.

difference between python set and dict "internally"

Can anybody tell me how the internal implementation of set and dict is different in python? Do they use the same data structure in the background?
++ In theory, one can use dict to achieve set functionality.

In CPython, sets and dicts use the same basic data structure. Sets tune it slightly differently, but it is basically a hash table just like dictionaries.
You can take a look at the implementation details in the C code: setobject.c and dictobject.c; the implementations are very close; the setobject.c implementation was started as a copy of dictobject.c originally. dictobject.c has more implementation notes and tracing calls, but the actual implementations of the core functions differs only in details.
The most obvious difference is that the keys in the hash table are not used to reference values, like in dictionaries, so a setentry struct only has a cached hash and key, the dictentry struct adds a value pointer.
Before we had the built-in set, we had the sets module, a pure-Python implementation that used dict objects to track the set values as keys. And in Python versions before the sets module was available, we did just that: use dict objects with keys as the set values, to track unique, unordered values.

These two are use the same datastructure in the backend. e.g in sets you cannot store duplicate values but in dict you can store multople same values and you can turn the dict to sets by changing the behavior of dict

Cache results of a time-intensive operation

I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.
I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).
My approach
My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).
Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.
Question
Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?
Ideally, I'm looking to achieve this in Python.

Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.
Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.

Hashing seems a very viable way, but doing this yourself seems a bit overkill. Why not use the tuple of inputs as key for your dictionary? You wouldn't have to worry about hashing and possible collisions yourself. All you have to do is fix a order for the keyword arguments (and depending on your requirements add a flag-object for keywords that are not set).
You also might find the functools.lru_cache useful, if you are using Python 3.2+.
This is a decorator that will enable caching for the last n calls of the decorated function.
If you are using a older version there are backports of this functionality out there.
Also there seem to be a project with similar goals called FileDict which might be worth looking at.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.