Can anybody tell me how the internal implementation of set and dict is different in python? Do they use the same data structure in the background?
++ In theory, one can use dict to achieve set functionality.
In CPython, sets and dicts use the same basic data structure. Sets tune it slightly differently, but it is basically a hash table just like dictionaries.
You can take a look at the implementation details in the C code: setobject.c and dictobject.c; the implementations are very close; the setobject.c implementation was started as a copy of dictobject.c originally. dictobject.c has more implementation notes and tracing calls, but the actual implementations of the core functions differs only in details.
The most obvious difference is that the keys in the hash table are not used to reference values, like in dictionaries, so a setentry struct only has a cached hash and key, the dictentry struct adds a value pointer.
Before we had the built-in set, we had the sets module, a pure-Python implementation that used dict objects to track the set values as keys. And in Python versions before the sets module was available, we did just that: use dict objects with keys as the set values, to track unique, unordered values.
These two are use the same datastructure in the backend. e.g in sets you cannot store duplicate values but in dict you can store multople same values and you can turn the dict to sets by changing the behavior of dict
Related
I am trying to migrate a Python 2 dictionary to Python 3.
The dict ordering behaviour has of course changed:
Python 2: Unordered - but in practice appears to be deterministic/consistent between subsequent calls of code using the same key-value pairs inserted in a consistent order on the same machine etc. The number of possible reorderings is restricted (though that is computer science territory from my point-of-view).
Python 3.7+: Insertion order is guaranteed.
I am looping over the elements of the dictionary and adding the key-value pairs to another structure in a way that is a function of the dict values that is order-dependent. Specifically, I am removing the elements based on their geometric proximity to certain landmarks ('nearest neighbours'). If the ordering changes, the removal order changes and the outcome is not the same.
I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
Caveats:
Of course I can use an OrderedDict in Python 2 (or 3.7) and often do (and should have here). But for the purposes of this question please don't suggest it, or that not using one was a bad idea etc.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?
I am trying to reproduce Python 2's exact mangling of the insertion order. This can be thought of as a python function or 'behaviour'. So, how can I replicate a Python 2 dict ordering in Python 3.7+?
You cant.
I could reorder the keys of the original dictionary. Could this be a route to solving the conundrum? In general can Python 2 be "emulated" in Python 3?
No. Not without reimplementing Python 2's hashmap anyway (or compiling the "old" dict under a new name).
The Python 2 ordering (and that of hashmaps in general) is a consequence of the hash function, the collision resolution algorithm and the resizing thresholds (the load factor):
a hashmap is fundamentally an array of buckets, without special handling the iteration order is simply that of iterating the buckets array and yielding the entries in the non-empty buckets
item is stored at buckets[hash(key) % len(buckets)], hence the implications of the hash algorithm and size of the buckets array
if two items have the same modular hash, the collision will be resolved, for open addressing there are lots of algorithms (all of which end up with putting one of the items in the "wrong" bucket, but the way they do and which one they move can be quite different)
different collision resolution algorithms have different sensitivities to the buckets array being full, some will start degrading at 70% while others will be fine at 90%, this thus informs the load factor limits, therefore when you resize, and thus the new locations (and order) of the items
furthermore different hash tables can have different growth factors, even if all other factors are identical different growth factors will lead to different sizes of buckets arrays, and thus different distribution (and order) once modulo is applied
Since Python 3.6, P3 has changed the very implementation of hashmaps to use one that's "naturally ordered"1, the only way to get the not-insertion-order would be to interate the dk_indices field of the hashmap then explicitly dereference the corresponding dk_entries item, and neither field is part of the API (at the C level to say nothing of the Python), and even if they were you would have to check the implementation details to see if the new hashmap uses the same hashing and growth thresholds. Which again doesn't matter, because the internal details are not accessible.
[1]: you can see pypy's explanation of the scheme for a primer, though CPython's implementation is different/
I have been reading about these dictionary view objects that are returned by the likes of dict.keys(), including the posts on here about the subject. I understand they act as windows to the dictionary's contents without storing a copy of said contents explicitly and in so are more efficient than dynamically updating a list of keys. I also found they are containers (allow use of in operator) but are not sequences (not indexable), although they are iterable.
Overall this sounds to me like a set, since they have access to the dictionary's hash table they even offer the use of set-like operations like intersection/difference. One difference I can think of is that a set, while mutable like these view objects, can only store immutable (and therefore hashable) objects.
However, since a dictionary value doesn't have to be immutable, the values and items view objects are essentially sets with mutable contents, expectedly not supportive of set-like operations (subtraction/intersection). This makes me sceptical of considering these view objects as "a set with a reference to the dictionary".
My question is: are these view objects entirely different to sets but happen to have similar properties? Or are they implemented using sets? Any other major differences between the two? And most importantly - can it be damaging to consider them as "basically sets"?
The implicit point of your comparison is that dict.keys() and set elements can't have duplicates. However, the set-like Dictionary view obtained from the keys still retains order, while the set does not.
Duplicate dictionary keys:
If a key occurs more than once, the last value for that key becomes the corresponding value in the new dictionary.
Duplicate set elements:
A set object is an unordered collection of distinct hashable objects.
From the above, sets are unordered while in the current Python version dictionaries maintain insertion order:
Changed in version 3.7: Dictionary order is guaranteed to be insertion order.
Because dictionaries have an insertion order they can be reversed, while such operation in a set would be meaningless:
Dictionaries and dictionary views are reversible.
Finally, a set can be altered, deleted and inserted from. A Dictionary view object only allows looking at contents, not changing them.
My question is, are these view objects entirely different to sets but happen to have similar properties? Or are they implemented using sets?
The documentation makes no claim about implementation details.
Any other major differences between the two?
The documentations state the difference between "Keys views" and "items view" or "values views".
Keys views are set-like (...)
If all values are hashable, so that (key, value) pairs are unique and hashable, then the items view is also set-like.
(Values views are not treated as set-like (...))
I have a situation where I need to sort the keys of a dictionary to make sure that the list always has the same order. I don't care what that order is. But I need to have consistency of order. This is causing people using a package I've written difficulties because they want to set a random seed and have consistent results for testing purposes, but even when they set a key, the order that the dictionary returns its values changes, and this ends up affecting the randomization.
I would like to do some thing like sorted(D.keys()). However, in principle the keys may not be sortable. I don't have any control over what those keys are.
I know this is solved in recent versions of python, but I don't want to restrict uses to 3.7.
More detail here: https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/22#issuecomment-631219800
Since people think this is a bad question, here's more detail (or see the original link I provided for even more detail):
I have written an algorithm to do stochastic epidemic simulations. For reproducibility purposes people would like to specify a seed and have it run and produce the same outcome on any machine. The algorithm uses a networkx graph (which is built on a dictionary structure).
As part of the steps it needs to perform a weighted selection from the edges of the graph. This requires that I put the edges into a list. If the list is in a different order, then different outcomes occur regardless of whether the same seed is used.
So I'm left with having to find a way to make the list of edges be in a consistent order on any machine.
So if I understand correctly... the task is to consistently select the same random key from a vanilla dict on an old version of Python where insertion order is not preserved, while knowing nothing at all about the key type, and without setting the hash seed explicitly. I believe that's impossible in the general case, because the whole notion of object "identity" does not even exist with such restrictive assumptions.
The only thing that comes to mind is to serialize the keys in some way, and sort their serialized forms. pickle.dumps should work on most key types (although not everything can be pickled). But if the key type does allow sorting, it's probably more robust to simply use that instead.
import pickle
try:
sorted_keys = sorted(my_dict)
except TypeError:
sorted_keys = sorted(my_dict, key=lambda x: pickle.dumps(x, protocol=3))
There are some caveats though:
The pickled representation is not the same across Python versions (see Data stream format). That's why I'm setting protocol=3 in the example above; this should work the same for Python 3.0 and newer, although it doesn't support as many object types as protocol 4.
Objects can define their own pickling, so there is no guarantee that it's reproducible. In particular...
The pickled representation of dictionaries is still dependent on the ordering of the dictionary, which depends the Python version and the hash seed. The same goes for objects, because by default these are pickled by invoking their __dict__ method.
If you want to get really tricky, maybe you can create a custom Pickler that sorts dictionaries (using OrderedDict for portability across Python versions) before pickling them... but in the end, it's not going to solve the full problem.
If you only want to preserve order one possibility is insertion order with a list of pairs. If you combine with a OrderedDict you will preserve order and have dictionary functions.
>>> import collections
>>> d = collections.OrderedDict([(1,'a'),(3,'2')])
>>> d.keys()
odict_keys([1, 3])
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
Porting some python code to LabVIEW and I run across the python set(). Is there a better way of representing this in LabVIEW other than with a variant or array?
The closest would be to use variant attributes.
You use a "dummy" variant to store key/value pairs. The Variant Set Attribute function prevents duplicates (overwrites existing with output indicating replaced) and the Get function will return all key/value pairs if no key value is specified.
The underlying functions use a red-black tree, making lookups very fast for large datasets.
http://forums.ni.com/t5/LabVIEW/Darren-s-Weekly-Nugget-10-09-2006/m-p/425269
As I recall LabView don't include analog of set() from box. Therefore you must create VI for delete duplicate values from Array. I hope below two links will help for you.
Remove duplicate values in an Array
Delete, Collapse, Array Duplicate Elements
Furthermore you can take some HashSet realisation (one, two, three) and call it from LabView.