How is set() implemented? - python

I've seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?
Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

According to this thread:
Indeed, CPython's sets are implemented as something like dictionaries
with dummy values (the keys being the members of the set), with some
optimization(s) that exploit this lack of values
So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.
If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, was originally mostly a cut-and-paste from the dict implementation.
Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys".

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.
The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.
A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.
Big-O refers only to asymptotic behavior as n → ∞.
Since k/n can be bounded by a constant, c<1, independent of n,
O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).
So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia
In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.
Related: Is a Java hashmap really O(1)?

We all have easy access to the source, where the comment preceding set_lookkey() says:
/* set object implementation
Written and maintained by Raymond D. Hettinger <python#rcn.com>
Derived from Lib/sets.py and Objects/dictobject.c.
The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
The initial probe index is computed as hash mod the table size.
Subsequent probe indices are computed as explained in Objects/dictobject.c.
To improve cache locality, each probe inspects a series of consecutive
nearby entries before moving on to probes elsewhere in memory. This leaves
us with a hybrid of linear probing and open addressing. The linear probing
reduces the cost of hash collisions because consecutive memory accesses
tend to be much cheaper than scattered probes. After LINEAR_PROBES steps,
we then use open addressing with the upper bits from the hash value. This
helps break-up long chains of collisions.
All arithmetic on hash should ignore overflow.
Unlike the dictionary implementation, the lookkey function can return
NULL if the rich comparison returns an error.
*/
...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif
/* This must be >= 1 */
#define PERTURB_SHIFT 5
static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)
{
...

Sets in python employ hash table internally. Let us first talk about hash table.
Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time.
To expound on the modulo operation, let me also write some code:
piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]
def hash_function(x):
return int(x*100 % 31)
[hash_function(pile) for pile in piles]
Output: [4, 17, 8, 0, 16, 11, 10, 20, 21, 18]

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify's the main difference of set's against dicts.
Use cases for sets differ considerably from dictionaries where looked-up
keys are more likely to be present. In contrast, sets are primarily
about membership testing where the presence of an element is not known in
advance. Accordingly, the set implementation needs to optimize for both
the found and not-found case.
source on github

Related

Suitable object type for a 2D look-up table of unknown size created at runtime

I'm writing a Python 3.4 script that does a large calculation for me. This calculation involves calculating many many binomial coefficients, and using each of them many times in sums and multiplications with other numbers. Each time a bc (binomial coefficient) is needed in the calculation, it checks whether the bc has already been calculated. If so, it returns this already calculated value. Otherwise, it calculates it and stores it for later look-up. Currently, my function bc(n,k), which calculates the bc "n choose k", looks as follows:
bcvalues = {}
def bc(n,k):
k = min(k,n-k) # take advantage of symmetry
if (n,k) in bcvalues: # check whether value has already been calculated
return bcvalues[(n,k)] # if so, return that already calculated value
if k == 0 or n <= 1: # base case
return 1
result = bc(n-1,k) + bc(n-1,k-1) # Use formula for Pascal's triangle
bcvalues[(n,k)] = result # store the value for later look-up
return result
My look-up table is a dictionary with the (n,k) tuple as the key and bc(n,k) as the value. It satisfies all the
Strict requirements
Can be filled / extended to an arbitrary size at runtime (before the calculation runs, I have no idea how many bc's it needs to calculate, but it's a lot of bc's)
The values can be arbitrarily large (either int (the Python 3 one) or the gmpy2 type mpz, I'm not sure yet). This is important as the values can become very very large
It can be indexed by two natural numbers n and k
The bc's for some tuples (n,k) can be skipped (e.g. there may be an entry for (100,50) but no entry for (100,49))
However, I'm not sure whether it is "the" optimal solution (if there is one) in terms of the
Performance requirements (in the order of importance)
Fast look-up / read-out
Low memory-usage (in tests, my dictionary already occupied several GBs; I may eventually rent computing power on large-memory machines)
Fast writing into the look-up table
In very small input size tests that I've just run, the function bc was called 16 million times, and this number is likely to grow a lot for input sizes that I'm actually interested in. Therefore, performance matters.
My current solution (dictionary) has the advantage that at the end of a computation run, I can serialize the look-up table (using pickle), so that when I perform a new run with higher input values, I can unpickle it and have all the bc's at hand that have been calculated in previous runs. This is a strong bonus point:
Bonus point
The look-up table can easily be serialized
My question
What, besides dictionary, could be a candidate for matching these criteria?
I thought of writing a function that maps tuples (n,k) of the triangle bijectively to natural numbers and then use a list for the look-up table. How promising is this? Other ideas?
Disclaimer: I maintain gmpy2.
Once you start working with integers longer than 50 to 100 decimal digits, you should be using gmpy2.mpz.
A dictionary seems like the best choice. Indexing a list is slightly faster than a dictionary lookup but the overhead of mapping (n,k) to an index value makes it slower on my system.
There may be a way to decrease the memory usage. You calculate binomial coefficient recursively and save all the intermediate values. bcvalues will get very large. If you don't need all the binomial coefficients for smaller values of n and k then you might try using gmpy2.comb to calculate the binomial coefficient and not saving all the intermediate values.

size efficient dictionary(associative array) implementation

What algorithms are available for size efficient A dictionary or associative array?
For example, with this key/value set, how can one avoid duplication "Alice" in values?
{
"Pride and Prejudice": "Alice",
"The Brothers Karamazov": "Pat",
"Wuthering Heights": "Alice"
}
I checked Python's implementation on dictionary, but it seems like the implementation is focused on speed (keeping O(1)) not size.
As mentioned by bennofs in comments, you could use intern() to ensure that identical strings are stored only once:
class InternDict(dict):
def __setitem__(self, key, value):
if isinstance(value, str):
super(InternDict, self).__setitem__(key, intern(value))
else:
super(InternDict, self).__setitem__(key, value)
Here's an example of the effect that has:
>>> d = {}
>>> d["a"] = "This string is presumably too long to be auto-interned."
>>> d["b"] = "This string is presumably too long to be auto-interned."
>>> d["a"] is d["b"]
False
>>> di = InternDict()
>>> di["a"] = "This string is presumably too long to be auto-interned."
>>> di["b"] = "This string is presumably too long to be auto-interned."
>>> di["a"] is di["b"]
True
One way to improve space efficiency (in addition to sharing values, which (as bennofs points out in the comments) you can probably accomplish efficiently by using sys.intern) is to use hopscotch hashing, which is an open addressing scheme (a variant of linear probing) for resolving collisions - closed addressing schemes use more space because you need to allocate a linked list for each bucket, whereas with an open addressing scheme you'll just use an open adjacent slot in the backing array without needing to allocate any linked lists. Unlike other open addressing schemes (such as cuckoo hashing or vanilla linear probing), hopscotch hashing performs well under a high load factor (over 90%) and guarantees constant time lookups.
If your dictionary can fit in memory then, a simple Hashtable can be use.
Try to insert every key-value in an hashtable.
If the key alredy exist before inserting, then you have found a duplication.
There is number of implementation of hashtable in many langage.
There is basically twice way : array & tree.
Array focus on speed at hight memory cost. The main difference between Hashtable implementation is behavior on unicity, some implementation enforce unicity some others no.
Tree focus on memory smart usage at cost of O(log(n)) cpu usage. g++ map relies on very power full red black tree.
If size is very very problematics, then you should search a Huffman compression and/or Lampel Ziv compression, but it cost a little more, to adapt for dictionnary.
If your dictionnary can't fit in memory
You should look at database.
The red black tree for database is know as BTree (almost). It have branch factor optimizations for the low latency hard drive case.
I have put many link to wikipedia but if you like this subject I recommand you :

Python unhash value

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance
It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects
Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.
You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)
This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)
Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.
How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.
Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.
Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)
Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).
Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.
It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.
Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie
If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).
Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

How are Python's Built In Dictionaries Implemented?

Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).
Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important.
Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).
The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1| ... |
-+-----------------+
.| ... |
-+-----------------+
i| ... |
-+-----------------+
.| ... |
-+-----------------+
n| ... |
-+-----------------+
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)
NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.
How are Python's Built In Dictionaries Implemented?
Here's the short course:
They are hash tables. (See below for the specifics of Python's implementation.)
A new layout and algorithm, as of Python 3.6, makes them
ordered by key insertion, and
take up less space,
at virtually no cost in performance.
Another optimization saves space when dicts share keys (in special cases).
The ordered aspect is unofficial as of Python 3.6 (to give other implementations a chance to keep up), but official in Python 3.7.
Python's Dictionaries are Hash Tables
For a long time, it worked exactly like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 (i.e. 2nd) index (like the example below.)
<hash> <key> <value>
null null null
...010001 ffeb678c 633241c4 # addresses of the keys and values
null null null
... ... ...
Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels for our purposes here - they don't actually exist in memory.)
If the hash ended the same as a preexisting key's hash, this is a collision, and then it would stick the key-value pair in a different location.
After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.
This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key's id - if they're the same object, they're equal. If not then compare the hash values, if they are not the same, they're not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.
Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the initialization of the hash function such that it computes different hashes for each new Python process.
The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new feature that dictionaries are now ordered by insertion.
The New Compact Hash Tables
We start, instead, by preallocating an array for the index of the insertion.
Since our first key-value pair goes in the second slot, we index like this:
[null, 0, null, null, null, null, null, null]
And our table just gets populated by insertion order:
<hash> <key> <value>
...010001 ffeb678c 633241c4
... ... ...
So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.
We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upsides that we save quite a lot of space over the pre-existing implementation and we retain insertion order. The only space wasted are the null bytes in the index array.
Raymond Hettinger introduced this on python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion was considered an implementation detail for 3.6 to allow other implementations of Python a chance to catch up.
Shared Keys
Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys' hashes. You can think of it like this:
hash key dict_0 dict_1 dict_2...
...010001 ffeb678c 633241c4 fffad420 ...
... ... ... ... ...
For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.
Shared Keys for Custom Objects & Alternatives
These shared-key dicts are intended to be used for custom objects' __dict__. To get this behavior, I believe you need to finish populating your __dict__ before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the __init__ or __new__, else you might not get your space savings.
However, if you know all of your attributes at the time your __init__ is executed, you could also provide __slots__ for your object, and guarantee that __dict__ is not created at all (if not available in parents), or even allow __dict__ but guarantee that your foreseen attributes are stored in slots anyways. For more on __slots__, see my answer here.
See also:
PEP 509 -- Add a private version to dict
PEP 468 -- Preserving the order of **kwargs in a function.
PEP 520 -- Preserving Class Attribute Definition Order
PyCon 2010: The Might Dictionary - Brandon Rhodes
PyCon 2017: The Dictionary Even Mightier - Brandon Rhodes
PyCon 2017: Modern Python Dictionaries A confluence of a dozen great ideas - Raymond Hettinger
dictobject.c - CPython's actual dict implementation in C.
Python Dictionaries use Open addressing (reference inside Beautiful code)
NB! Open addressing, a.k.a closed hashing should, as noted in Wikipedia, not be confused with its opposite open hashing!
Open addressing means that the dict uses array slots, and when an object's primary position is taken in the dict, the object's spot is sought at a different index in the same array, using a "perturbation" scheme, where the object's hash value plays part.

Categories