Repeatably hashing an arbitrary Python tuple - python

I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.
My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.
I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?
I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)
I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.

Not sure if I understand your question fully, but will just give it a try.
Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.
params = (1, 3, 2)
hashlib.sha224(json.dumps(params)).hexdigest()
# '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'
If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.
params = {'b': 123, 'c': 345}
hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest()
# '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'

One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:
export PYTHONHASHSEED=0

Related

How to sort hashable objects

I have a situation where I need to sort the keys of a dictionary to make sure that the list always has the same order. I don't care what that order is. But I need to have consistency of order. This is causing people using a package I've written difficulties because they want to set a random seed and have consistent results for testing purposes, but even when they set a key, the order that the dictionary returns its values changes, and this ends up affecting the randomization.
I would like to do some thing like sorted(D.keys()). However, in principle the keys may not be sortable. I don't have any control over what those keys are.
I know this is solved in recent versions of python, but I don't want to restrict uses to 3.7.
More detail here: https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/22#issuecomment-631219800
Since people think this is a bad question, here's more detail (or see the original link I provided for even more detail):
I have written an algorithm to do stochastic epidemic simulations. For reproducibility purposes people would like to specify a seed and have it run and produce the same outcome on any machine. The algorithm uses a networkx graph (which is built on a dictionary structure).
As part of the steps it needs to perform a weighted selection from the edges of the graph. This requires that I put the edges into a list. If the list is in a different order, then different outcomes occur regardless of whether the same seed is used.
So I'm left with having to find a way to make the list of edges be in a consistent order on any machine.
So if I understand correctly... the task is to consistently select the same random key from a vanilla dict on an old version of Python where insertion order is not preserved, while knowing nothing at all about the key type, and without setting the hash seed explicitly. I believe that's impossible in the general case, because the whole notion of object "identity" does not even exist with such restrictive assumptions.
The only thing that comes to mind is to serialize the keys in some way, and sort their serialized forms. pickle.dumps should work on most key types (although not everything can be pickled). But if the key type does allow sorting, it's probably more robust to simply use that instead.
import pickle
try:
sorted_keys = sorted(my_dict)
except TypeError:
sorted_keys = sorted(my_dict, key=lambda x: pickle.dumps(x, protocol=3))
There are some caveats though:
The pickled representation is not the same across Python versions (see Data stream format). That's why I'm setting protocol=3 in the example above; this should work the same for Python 3.0 and newer, although it doesn't support as many object types as protocol 4.
Objects can define their own pickling, so there is no guarantee that it's reproducible. In particular...
The pickled representation of dictionaries is still dependent on the ordering of the dictionary, which depends the Python version and the hash seed. The same goes for objects, because by default these are pickled by invoking their __dict__ method.
If you want to get really tricky, maybe you can create a custom Pickler that sorts dictionaries (using OrderedDict for portability across Python versions) before pickling them... but in the end, it's not going to solve the full problem.
If you only want to preserve order one possibility is insertion order with a list of pairs. If you combine with a OrderedDict you will preserve order and have dictionary functions.
>>> import collections
>>> d = collections.OrderedDict([(1,'a'),(3,'2')])
>>> d.keys()
odict_keys([1, 3])

How to force dicts to be unordered (for testing)?

I've just spent half a day tracing down a bug to a dict that I forgot to sort when iterating over it. Even though that part of code is tested, the tests did not pick it up because that dict had a repeatable ordering during the test. Only when I shuffled the dict did the tests fail! (I used an intermediate random.shuffle'd list and built an OrderedDict)
This kinda scares me, because there might be similar bugs all over the place!
Is there any way to globally force all dicts to be unordered during testing?
Update: I think I at least figured out what caused the bug. As is described here, dicts with int keys are usually sorted, but might not always be. My hypothesis: In my tests, the ints are always small (order 10), and thus always in order. In the real run, however, the ints are much bigger (order 10^13), and thus not always ordered. I was able to reproduce this behaviour in an interactive session: list(foo.keys()) == sorted(foo.keys()) was always True for small keys, but not for every dict with very large keys.
As of 3.6, dicts maintain key insertion order. If you would like them to behave a specific way, you could create a class that inherits from a dict and give it the behavior you want (eg ensuring that it shuffles the key list before returning it).
Regardless of the version of python you use, be sure to check out the implementation details if you want to try to rely on a specific behavior of the dict implementation. Better yet, for portability, just code a solution that ensures they behave as you expect, either as I mentioned above, or by using shuffle over a list of the keys, or something else.

Fastest Get Python Data Structures

I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.

A random noise function in Python

I'm writing a game in Python in which the environment is generated randomly. Currently, the game's "save" function works by writing out all parts of the environment which the player has explored. The result is that save files are larger than they need to be—why write random data to disk when you can just generate it again?
What I could use is a random noise function: a function noise such that noise(x) returns a random number, and always the same number whenever it's called with the same value of x. Now, for each point (x,y) in the game's environment, instead of generating a random number using random() and storing the result in env[(x,y)], I can generate a random number using noise((x,y)), throw it away, and generate the same number later.
Not quite sure if I'm stating the obvious, but using some variation of a Perlin noise generator is a common way to do this. This post is a nice description of doing exactly this (as mentioned in the comments, it's not exactly Perlin noise)
For a given position, the Perlin function will return a random value (the position can be 2D, 3D or any dimensionality).
There is a noise module, and this page has a implementation of it
There's a similar thread on gamedev.SE
First, if you need it to be true that noise(x) would always return the same value for the same x, no matter what, even if it's never been called, then you can't really use randomness at all. A good hash function is the only possibility.
However, if you just need to be able to restore a previous state consisting of the values for all of the previously-explored points (never-explored points may turn out different after save and load than if you hadn't quit… but how can anyone tell without access to multiple universes?), and you don't want to store all of those points, then it might be reasonable to regenerate them.
But let's back up a step. You want something that acts like a hash function. Is there a hash function you can use?
I'd imagine the algorithms in hashlib are too slow (md5 is probably the fastest, but test them all), but I wouldn't reject them without actually testing.
It's possible that the "random period" of zlib.adler32 (or zlib.crc32) is too short, but I wouldn't reject it (except maybe hash) without thinking through whether it's good enough. For that matter, even hash plus a decent fixed-side blender function might be good enough (at least on a 64-bit system).
Python doesn't come with anything "between" md5 and `adler32' out of the box. But you can find PyPI modules or source recipes for hundreds of other hash algorithms. For that matter, if you're familiar with any particular hash algorithm that sounds good, most of them are trivial—you could probably code up, e.g., an FNV hash with xor-folding in less time than it takes you to look through the alternatives.
Also, keep in mind that you can generate a bunch of random bytes at "new game" time, store that in the save file, and use it as salt to your hash function.
If you've exhausted the possibilities are you really do need more randomness than a fast-enough hash function with arbitrary salt can give you alone, then:
It sounds like you'll already need to store a list of the points the user has explored (because how else do you know which points you need to restore?). And the order doesn't really matter. So, you can store them in the order of exploration. That means you can regenerate the values deterministically (just by iterating the list). Which means you can use the suggestion by #delnan on your own answer.
However, seed is not the way to do that. It isn't guaranteed to put the RNG into the same state each time across runs, Python versions, machines, etc. For that, you need setstate:
To save, call random.getstate(), and pickle and stash the result.
To load, read and unpickle the state, and call random.setstate(state).
See the docs for full details.
If you're using a random.Random instance, it's exactly the same, except of course that you have to construct a random.Random before you can call setstate on it.
This is guaranteed to work between runs of your program, across machines, etc. Even with a newer version of Python. However, it's not guaranteed to work with an older version of Python. (That is, if the user saves a game with Python 2.6, then tries to load it with 2.5, the state will not be compatible. I believe the only problems come with 2.6->older and 2.3->older, but of course there's no guarantee there won't be additional ones in the future.) I'd suggest stashing the Python version, and if they've downgraded, show a warning saying "This save file requires Python 2.6 or later. You have Python 2.5. The load may fail. Continue anyway?"
This is only guaranteed for random.Random and for the random module itself (since the top-level module functions just use a hidden random.Random). In particular, random.SystemRandom are explicitly documented not to work.
Practically speaking, you can also just pickle a random.Random directly, because the state gets pickled in. It seems like that ought to work, or what would be the sense of pickling a Random object? And it definitely does work. But it isn't actually documented to work, so I'd stick with pickling the getstate, for safety.
One possible implementation of noise is this:
import random
def noise(point):
gen = random.Random()
gen.seed(point)
return gen.random()
I don't know how fast Random.seed() is, though. In addition, Random may change from one version of Python to the next, causing the players of my game to find that the environment changes when they upgrade.

Cache results of a time-intensive operation

I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.
I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).
My approach
My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).
Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.
Question
Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?
Ideally, I'm looking to achieve this in Python.
Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.
Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.
Hashing seems a very viable way, but doing this yourself seems a bit overkill. Why not use the tuple of inputs as key for your dictionary? You wouldn't have to worry about hashing and possible collisions yourself. All you have to do is fix a order for the keyword arguments (and depending on your requirements add a flag-object for keywords that are not set).
You also might find the functools.lru_cache useful, if you are using Python 3.2+.
This is a decorator that will enable caching for the last n calls of the decorated function.
If you are using a older version there are backports of this functionality out there.
Also there seem to be a project with similar goals called FileDict which might be worth looking at.

Categories