I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.
I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).
My approach
My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).
Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.
Question
Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?
Ideally, I'm looking to achieve this in Python.
Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.
Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.
Hashing seems a very viable way, but doing this yourself seems a bit overkill. Why not use the tuple of inputs as key for your dictionary? You wouldn't have to worry about hashing and possible collisions yourself. All you have to do is fix a order for the keyword arguments (and depending on your requirements add a flag-object for keywords that are not set).
You also might find the functools.lru_cache useful, if you are using Python 3.2+.
This is a decorator that will enable caching for the last n calls of the decorated function.
If you are using a older version there are backports of this functionality out there.
Also there seem to be a project with similar goals called FileDict which might be worth looking at.
Related
I have a situation where I need to sort the keys of a dictionary to make sure that the list always has the same order. I don't care what that order is. But I need to have consistency of order. This is causing people using a package I've written difficulties because they want to set a random seed and have consistent results for testing purposes, but even when they set a key, the order that the dictionary returns its values changes, and this ends up affecting the randomization.
I would like to do some thing like sorted(D.keys()). However, in principle the keys may not be sortable. I don't have any control over what those keys are.
I know this is solved in recent versions of python, but I don't want to restrict uses to 3.7.
More detail here: https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/22#issuecomment-631219800
Since people think this is a bad question, here's more detail (or see the original link I provided for even more detail):
I have written an algorithm to do stochastic epidemic simulations. For reproducibility purposes people would like to specify a seed and have it run and produce the same outcome on any machine. The algorithm uses a networkx graph (which is built on a dictionary structure).
As part of the steps it needs to perform a weighted selection from the edges of the graph. This requires that I put the edges into a list. If the list is in a different order, then different outcomes occur regardless of whether the same seed is used.
So I'm left with having to find a way to make the list of edges be in a consistent order on any machine.
So if I understand correctly... the task is to consistently select the same random key from a vanilla dict on an old version of Python where insertion order is not preserved, while knowing nothing at all about the key type, and without setting the hash seed explicitly. I believe that's impossible in the general case, because the whole notion of object "identity" does not even exist with such restrictive assumptions.
The only thing that comes to mind is to serialize the keys in some way, and sort their serialized forms. pickle.dumps should work on most key types (although not everything can be pickled). But if the key type does allow sorting, it's probably more robust to simply use that instead.
import pickle
try:
sorted_keys = sorted(my_dict)
except TypeError:
sorted_keys = sorted(my_dict, key=lambda x: pickle.dumps(x, protocol=3))
There are some caveats though:
The pickled representation is not the same across Python versions (see Data stream format). That's why I'm setting protocol=3 in the example above; this should work the same for Python 3.0 and newer, although it doesn't support as many object types as protocol 4.
Objects can define their own pickling, so there is no guarantee that it's reproducible. In particular...
The pickled representation of dictionaries is still dependent on the ordering of the dictionary, which depends the Python version and the hash seed. The same goes for objects, because by default these are pickled by invoking their __dict__ method.
If you want to get really tricky, maybe you can create a custom Pickler that sorts dictionaries (using OrderedDict for portability across Python versions) before pickling them... but in the end, it's not going to solve the full problem.
If you only want to preserve order one possibility is insertion order with a list of pairs. If you combine with a OrderedDict you will preserve order and have dictionary functions.
>>> import collections
>>> d = collections.OrderedDict([(1,'a'),(3,'2')])
>>> d.keys()
odict_keys([1, 3])
I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.
My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.
I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?
I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)
I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.
Not sure if I understand your question fully, but will just give it a try.
Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.
params = (1, 3, 2)
hashlib.sha224(json.dumps(params)).hexdigest()
# '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'
If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.
params = {'b': 123, 'c': 345}
hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest()
# '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'
One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:
export PYTHONHASHSEED=0
I am running some calculations in python that are time consuming and I would like to serialize the intermediate results.
My problem is the following: each calculation is configured by multiple parameters: a couple of numbers and strings. Of course, I can concatenate everything, but it will be also extremely long string and I am afraid that it will exceed allowed length of a file name.
Any ideas how to cope with this?
An easy way would be to use md5 (e.g. https://docs.python.org/2/library/md5.html)
import md5
tmp=md5.new()
tmp.update(<parameter1>)
...
filename=tmp.hexdigest()
This should generate filenames which are unique enough. You can add the current timestamp as a parameter to raise uniqueness.
I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.
I am trying to implement a data structure which allows rapid look-ups based on keys.
The python dict is great when my look-ups involve an equality
(e.g. key == somevalue translates to datadict[somevalue].
The problem is that I also need to be able to efficiently look up keys based on a more complex comparison, e.g. key > 50, or key.startswith('abc').
Obviously I can't use the same solution in both cases, but at the moment I can't figure out how to solve either case. Can anyone suggest a way of doing this?
It doesn't sound like you want a hash algorithm - instead some form of binary tree. Or even a list which you use the bisect module with. It'd be worth looking at: Python's standard library - is there a module for balanced binary tree?
Another option (depending on your data), would be to use use an in-memory sqlite3 database and create appropriate indices for possible lookups -- but you'll trade performance/memory and SQL syntax for flexibility...
Put all data items in a list.
Sort the list on the key.
Use binary search to efficiently find items where key > 50 or where key.startswith('abc').
Of course, this only pays off if you have really very many data items. If you have not so many, simply loop through the list and apply your condition to every key.