I use a dictionary to store unique keys mapped to unique values.
First keys were str and values int. I used sys.getsizeof(dict) to get the size and i also printed the dict into a file.
debugggg = csv.writer(open("debug1.txt", 'w'))
for key, val in diction.items():
debugggg.writerow([key, val])
I got 296 MB from the file and 805306648 from sys.getsizeof()
Then as values I stored the same keys but this time I hashed them before mapping
diction[hash(mykey_1)] = value_1
And I was expecting this to be a bit more compressed than the previous approach.
I ran the same functions to get the size. From the file i got 362MB(!) and from sys.getsizeof() i got the same result as previous (805306648)
Time for process was the same , as it was expected O(1) lookups. But I am a bit confused about the sizes.
sys.getsizeof(some_dict) accounts for the size of the internal hash table, which is roughly proportional to the number of keys. But it does not account for the size of the keys and values, partly because that would be tricky to do correctly, and partly because there can be many other references to those objects so it's a bit misleading to include them (their size might be amortized over many different dicts, for instance).
As for the file size: Aside from the fact that this does include the sizes of keys and values, the values (integers) are encoded differently, possibly more inefficiently. Overall this might be balanced by the fact that an int object contains considerable metadata and rounds up the actual data to 4 to 8 bytes. Other factors: A CSV file includes the commas and line breaks, and a hash is often a large number, larger than many short strings (hash("a") is -392375501 on my machine).
Side note: It's likely that the dict you built using diction[hash(mykey1_)] = ... is wrong. You're doing the hashing outside of the eyes of the dictionary, so it can't protect you from hash collisions. You might lose some values because their keys hash to the same integer. Since the internal hash table is always a power of two and only resized at certain threshold, having a few entries less doesn't necessarily show in sys.getsizeof.
First, O(1) is the average time for lookups in a dict, but it can be as bad as O(n). Here's the source for that information.
Second, the hash function returns an integer. sys.getsizeof(hash("Some value")) returns 12 on my platform. But if you read up on how python dictionaries are implemented the size of the key doesn't really have much to do with how many bytes the entire dictionary takes up. That has more to do with how many total items you're storing, not how you access them.
Related
After reading that interning string can help with performance. Do i just store the return value from the sys.intern call in the dictionary as the key and that is it?
t = {}
t[sys.intern('key')] = 'val'
Thanks
Yes, that's how you will use it.
To be more specific on the performance, the doc states that:
Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
There are two steps in a (classic) dict lookup: 1. hash the object into a number that is the index in the array that stores the data; 2. iterate over the array cell at this index to find a couple (key, value) with the correct key.
Usually, the second step is reasonabily fast because we choose a hash function that ensures very few collisions (different objects - same hash). But it has still to check the key you are looking for against every stored key having the same hash. This is the step 2 that will be faster : strings identity is tested before the expensive test, char by char, of the string equality.
The step 1 is harder to accelerate, because you can store the hash along with the interned string... but you have to compute the hash to find the interned string itself.
This was theory! If you really need to improve performance, first do some benchmarks.
Then think of the specificity of the domain. You are storing IPv4 addresses as keys. An IPv4 address is a number between 0 and 256^4. If you replace the human friendly representation of an address by an integer, you'll get a faster hash (hashing small numbers in CPython if almost costless: https://github.com/python/cpython/blob/master/Python/pyhash.c) and a faster lookup. The ip_address module might be the best choice in your case.
If you are sure that addresses are between boundaries (e.g. 172.16.0.0 – 172.31.255.255) you can try to use an array instead of a dict. It should be faster unless your array is huge (disk swap).
Finally, if this is not fast enough, be ready to use a faster language.
I have about 20 million key-value pairs . I need to create two dictionaries.
First dictionary:
The values are ints, from 0 to 20 million. The keys are strings of length 40 characters, for example '36ae99662ec931a3c20cffdecb39b69a8f7f23fd'.
Second dictionary:
Reverse of the first dictionary. The keys are ints, from 0 to 20 million. The values are strings of length 40 characters, for example '36ae99662ec931a3c20cffdecb39b69a8f7f23fd'.
I think for the second dictionary, there are more options, since the index can just be used as the key. For the second option, it looks like sqlite3 is promising.
Lookup speed is not too important, 1 second look up should be okay. The main concern is I don't have too much space to store the dictionary.
As for my best guess for the first type of dictionary, From this SO post
*large* python dictionary with persistence storage for quick look-ups
It looks like dbm would be a decent solution for the first type of dictionary since all the keys and values are stored as bytes, though the answer was given 7 years ago in 2012. I am not sure if it is a decent solution today.
Considering your second dictionary is just a reverse of the first, I think you probably want to use a single table database. You can have a primary key and then an index on the strings as well for fast lookups. Something like sqlite makes sense.
What size memory are you dealing with? It could still be in memory for python but all depends on how much memory you have.
The strings look hexadecimal. In that case you could start by using binascii.unhexlify to convert them to binary strings. That is a 50% space saving right there.
In [2]: import binascii
In [3]: binascii.unhexlify('36ae99662ec931a3c20cffdecb39b69a8f7f23fd')
Out[3]: b'6\xae\x99f.\xc91\xa3\xc2\x0c\xff\xde\xcb9\xb6\x9a\x8f\x7f#\xfd'
In [4]: len(binascii.unhexlify('36ae99662ec931a3c20cffdecb39b69a8f7f23fd'))
Out[4]: 20
20 million key/value pairs isn't all that much for a modern computer. Looking at the size of the pure data (20 bytes for the string, 4 bytes for the integer) we're talking around half a GB.
In [5]: 20e6 * (20 + 4) / 1e9
Out[5]: 0.48
The most space efficient way is to just make an array of key/value pairs, sorted by the key. Since we know that every pair is 24 bytes, accessing them in a mmapped file is trivial; you can just use slicing. And I would use a binary search for look-up.
This has no storage overhead. But inserting a value would be inefficient.
Context: I am trying to speed up the execution time of k-means. For that, I pre-compute the means before the k-means execution. These means are stored in a dictionary called means_dict which has as a key a sequence of the points id ordered in ascending order and then joining by an underscore ,and as a value the mean of these points.
When I want to access to the mean of a given points set in dict_mean dictionary during the k-means execution, I have to generate the key of that points set ie order the id points in ascending order and joining them by an underscore.
The key generation instruction takes a long time because I the key may contain thousands of integers.
I have for each key a sequence of integers separated by an underscore "-" in a dictionary. I have to sort the sequence of integers before joining them by an underscore in order to make the key unique, I finaly obtain a string key. The problem is this process is so long. I want to use an another type of key that permits to avoid sorting the sequence and that key type should be faster than the string type in terms of access, comparison and search.
# means_dict is the dictionary containing as a key a string (sequence of
# integers joined by underscore "-", for example key="3-76-45-78-344")
# points is a dictionary containing for each value a list of integers
for k in keys:
# this joining instruction is so long
key = "_".join([ str(c) for c in sorted(points[k])])
if( key in means_dict ):
newmu.append( means_dict[key] )
Computing the means is cheap.
Did you profile your program? How much of the time is spent recomputing he means? With proper numpy arrays instead of python boxed arrays, this should be extremely cheap - definitely cheaper than constructing any such key!
The reason why computing the key is expensive is simple: it means constructing an object of varying size. And based on your description, it seems you will be building first a list of boxed integers, then a tuple of boxes integers, then serialize this into a string and then copy the string again to append the underscore. There is no way this is going to be faster than the simple - vectorizable - aggregation when computing the actual mean...
You could even use MacQueens approach to update means rather than recomputing them. But even that is often slower than recomputing them.
I wouldn't be surprised if your approach ends up being 10x slower than regular k-means... And probably 1000x slower than the clever kmeans algorithms such as Hartigan and Wong's.
I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance.
import numpy as num
from collections import defaultdict
topKeys = range(16384)
keys = range(8192)
table = dict((k,defaultdict(int)) for k in topKeys)
dat = num.zeros((16384,8192), dtype="int32")
print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
print "done"
What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory.
topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]
I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead?
I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs?
Edit:
Python 2.6.4 on 64 bit machine.
Edit 2: I can see why, to a first approximation, my table should take up 3 GB
16384*8192*(12+12) bytes
and 6GB with a defaultdict load factor that forces it to reserve double the space.
Then inefficiencies in memory allocation eat up another factor of 2.
So here are my remaining questions:
Is there a way for me to tell it to use 32 bit ints somehow?
And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.
Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.
The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)
A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.
Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.
As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.
Well, look at what your code is actually doing:
topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)
This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.
Then you create a numpy array, which is actually pretty conservative with memory:
dat = num.zeros((16384,8192), dtype="int32")
This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.
And then you have this rather peculiar way of filling this numpy array:
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.
On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.
Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.
The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
to
for k in topKeys:
for j in keys:
if j in table[k]:
dat[k,j] = table[k][j]
else:
dat[k,j] = 0
then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.
I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)
Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.
How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.
Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.
Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)
Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).
Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.
It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.
Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie
If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).
Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}