I've just spent half a day tracing down a bug to a dict that I forgot to sort when iterating over it. Even though that part of code is tested, the tests did not pick it up because that dict had a repeatable ordering during the test. Only when I shuffled the dict did the tests fail! (I used an intermediate random.shuffle'd list and built an OrderedDict)
This kinda scares me, because there might be similar bugs all over the place!
Is there any way to globally force all dicts to be unordered during testing?
Update: I think I at least figured out what caused the bug. As is described here, dicts with int keys are usually sorted, but might not always be. My hypothesis: In my tests, the ints are always small (order 10), and thus always in order. In the real run, however, the ints are much bigger (order 10^13), and thus not always ordered. I was able to reproduce this behaviour in an interactive session: list(foo.keys()) == sorted(foo.keys()) was always True for small keys, but not for every dict with very large keys.
As of 3.6, dicts maintain key insertion order. If you would like them to behave a specific way, you could create a class that inherits from a dict and give it the behavior you want (eg ensuring that it shuffles the key list before returning it).
Regardless of the version of python you use, be sure to check out the implementation details if you want to try to rely on a specific behavior of the dict implementation. Better yet, for portability, just code a solution that ensures they behave as you expect, either as I mentioned above, or by using shuffle over a list of the keys, or something else.
Related
Consider a function that looks similar to this one
def get_sorted_values():
return sorted(get_unsorted_values(), key=some_sort_key)
Basically it returns an unsorted list from somewhere and sorts it before.
How would I reliably unit-test such a function? Of course I can assert that the resulting list has the expected order, but that would not reliably enforce the sorting because get_unsorted_values() might deliver it sorted in the expected way by coincidence.
So there is a certain chance that the unit test will pass, even when I drop the sorted call.
One possibility is to mock the get_unsorted_values() function so that the order is always wrong but I would like to avoid mocking.
What other options are there?
P.S. The root cause for non deterministic behavior is that the values come from a dict which has the non-deterministic order.
I have a situation where I need to sort the keys of a dictionary to make sure that the list always has the same order. I don't care what that order is. But I need to have consistency of order. This is causing people using a package I've written difficulties because they want to set a random seed and have consistent results for testing purposes, but even when they set a key, the order that the dictionary returns its values changes, and this ends up affecting the randomization.
I would like to do some thing like sorted(D.keys()). However, in principle the keys may not be sortable. I don't have any control over what those keys are.
I know this is solved in recent versions of python, but I don't want to restrict uses to 3.7.
More detail here: https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/22#issuecomment-631219800
Since people think this is a bad question, here's more detail (or see the original link I provided for even more detail):
I have written an algorithm to do stochastic epidemic simulations. For reproducibility purposes people would like to specify a seed and have it run and produce the same outcome on any machine. The algorithm uses a networkx graph (which is built on a dictionary structure).
As part of the steps it needs to perform a weighted selection from the edges of the graph. This requires that I put the edges into a list. If the list is in a different order, then different outcomes occur regardless of whether the same seed is used.
So I'm left with having to find a way to make the list of edges be in a consistent order on any machine.
So if I understand correctly... the task is to consistently select the same random key from a vanilla dict on an old version of Python where insertion order is not preserved, while knowing nothing at all about the key type, and without setting the hash seed explicitly. I believe that's impossible in the general case, because the whole notion of object "identity" does not even exist with such restrictive assumptions.
The only thing that comes to mind is to serialize the keys in some way, and sort their serialized forms. pickle.dumps should work on most key types (although not everything can be pickled). But if the key type does allow sorting, it's probably more robust to simply use that instead.
import pickle
try:
sorted_keys = sorted(my_dict)
except TypeError:
sorted_keys = sorted(my_dict, key=lambda x: pickle.dumps(x, protocol=3))
There are some caveats though:
The pickled representation is not the same across Python versions (see Data stream format). That's why I'm setting protocol=3 in the example above; this should work the same for Python 3.0 and newer, although it doesn't support as many object types as protocol 4.
Objects can define their own pickling, so there is no guarantee that it's reproducible. In particular...
The pickled representation of dictionaries is still dependent on the ordering of the dictionary, which depends the Python version and the hash seed. The same goes for objects, because by default these are pickled by invoking their __dict__ method.
If you want to get really tricky, maybe you can create a custom Pickler that sorts dictionaries (using OrderedDict for portability across Python versions) before pickling them... but in the end, it's not going to solve the full problem.
If you only want to preserve order one possibility is insertion order with a list of pairs. If you combine with a OrderedDict you will preserve order and have dictionary functions.
>>> import collections
>>> d = collections.OrderedDict([(1,'a'),(3,'2')])
>>> d.keys()
odict_keys([1, 3])
I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.
My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.
I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?
I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)
I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.
Not sure if I understand your question fully, but will just give it a try.
Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.
params = (1, 3, 2)
hashlib.sha224(json.dumps(params)).hexdigest()
# '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'
If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.
params = {'b': 123, 'c': 345}
hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest()
# '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'
One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:
export PYTHONHASHSEED=0
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
I have a Python dictionary that uses integers as keys
d[7] = ...
to reference custom objects:
c = Car()
d[7] = c
However, each of these custom objects also has a string identifier (from a third party). I want to be able to access the objects using both an integer or a string. Is the preferred way to use both keys in the same dictionary?
d[7] = c
d["uAbdskmakclsa"] = c
Or should I split it up into two dictionaries? Or is there a better way?
It really depends on what you're doing.
If you get the different kinds of keys from different sources, so you always know which kind you're looking up, it makes more sense, conceptually, to use separate dictionaries.
On the other hand, if you need to be able to handle keys that could be either kind, it's probably simpler to use a single dictionary. Otherwise, you need to write code like that uses type-switching, or tries one dict and then tries the other on KeyError, or something else ugly.
(If you're worried about efficiency, it really won't make much difference either way. It's only a very, very tiny bit faster to look things up in a 5000-key dictionary as in a 10000-key dictionary, and it only costs a very small of extra memory to keep two 5000-key dictionaries than one 10000-key dictionary. So, don't worry about efficiency; do whichever makes sense. I don't have any reason to believe you are worried about efficiency, but a lot of people who ask questions like this seem to be, so I'm just covering the bases.)
I would use two dicts. One mapping 3rd party keys to integer keys and another one mapping integer keys to objects. Depending on which one you use more frequently you can switch that of course.
It's a fairly specific situation, I doubt there is any 'official' preference on what to do in this situation. I do however, feel that having keys of multiple types is 'dirty', although I can't really think of a reason why it is.
But since you state that the string keys come from a third party, that alone might be a good reason to split off to another dictionary. I would split as well. You never know what the future might bring and this method is easier to maintain. Also less error prone if you think of type safety.
For setting values in your dictionaries you can then use helper methods. This will make adding easier and prevent you from forgetting to add/update to one of the dictionaries.