Suppose I have 1000 different objects of the same class instantiated, and I assign it to a dictionary whose keys are integers from 1 to 1000, and whose values are those 1000 objects.
Now, I create another dictionary, whose keys are tuples (obj1, 1), (obj2,2), etc. The obj's being those same 1000 objects. And whose values are 1 to 1000.
Does the existence of those 2 dictionaries mean that memory usage will be doubled, because 1000 objects are in each dict's key and values?
They should NOT be, right? Because we are not creating new objects, we are merely assigning references to those same objects. So I can have 1000 similar dictionaries with those objects as either values or as keys (part of a tuple), and have no significant increases in memory usage.
Is that right?
Objects are not copied, but referenced.
If your objects are small (e.g. integers), the overhead of a list of tuples or a dict is significant.
If your objects are large (e.g. very long unique strings), the overhead is much less, compared to the size of the objects, so the memory usage will not increase much due to creation of another dict / list of the same objects.
Related
I am working with many datasets that are of the structure Key|Date|Value.
The Key values can be variable length strings, or integers. The value can be any date type. The dates can be non continuous. An example set might be:
ABC|12-Dec-2021|1.0
DE|21-Dec-2022|5.0
HIJGSDFSDF|13-Dec-2021|1.0
ABC|15-Dec-2021|5.0
In general there can be ~5000 dates and ~20000 identifiers for each dataset. I am trying to store this on disk, so that can be loaded into Python into Numpy arrays efficiently. The modes of access could be:
Return all Key, Dates and Values from a file
Return all Dates and Values for a given list of Key
Return all Values, for an input list of Keys and Dates (maintaining order of inputs). The date lookup can be fuzzy, with lookback and tolerance - e.g. return the most recent value within 10 days
The focus is on fast read speed - writing can be slower.
My idea so far is to lay the file out like:
a) Header information including data types etc
b) Array of Unique Keys, and Offset into the file for Data
c) At each offset, Store (Date, Value) pairs sorted in date order
All reads would be based on a memory map of the file. The three reads would then look like:
Read all keys from b), calculate size of required array from offsets and data size, then allocate the three arrays for Key/Date/Value and iterate through the file and copy across to each array
Same as 1, but filter the array of keys based on input
First sort the Key and Date arrays, then iterate through and for each Key, move to the offset, and perform a binary search for each date to get the value. Once this has been complete, perform another sort to take it back to the original order.
I am wondering if there are better data structures or approaches to this problem.
Edit: Have considered a database, e.g. SQLITE, however do not believe this is performant for read type 3. E.g. If my input key array was (a,b,a,a,b,b) and date array was (11-Nov, 11-Nov,13-Nov,12-Nov,12-Nov,15-Nov), the SQL query would need to: Build a where clause for each unique pair of key/date, extract this, then sort again.
In addition, the lookback would require even more complexity, as if there as no a,11-Nov pair, but there was a a,5-Nov pair, this should be returned.
I'm no expert but I use 'Parquet" to improve on disk storage and read times.
https://www.rstudio.com/blog/speed-up-data-analytics-with-parquet-files/
I know that in python, dictionaries are implemented by hashing the keys and storing pointers (to the key-value pairs) in an array, the index being determined by the hash.
But how are the key-value pairs themselves stored? Are they stored together (ie, in contiguous spots in memory)? Are they stored as a tuple or array of pointers, one pointing to the key and one to the value? Or is it something else entirely?
Googling has turned up lots of explanations about hashing and open addressing and the like, but nothing addressing this question.
Roughly speaking there is a function, let's call it F, which calculates an index, F(h), in an array of values. So the values are stored as an array and they are looked up as F(h). The reason it's "roughly speaking", is that hashes are computed differently for different objects. For example, for pointers, it's p>>3; while for strings, hashes are digests of all the bytes of a string.
If you want to look at the C code, search for lookdict_index or just look at the dictobject.c file in CPython's source code. It's pretty readable if you are used to reading C code.
Edit 1:
From the comment in Python 3.6.1's Include/dictobject.h:
/* If ma_values is NULL, the table is "combined": keys and values
are stored in ma_keys.
If ma_values is not NULL, the table is splitted:
keys are stored in ma_keys and values are stored in ma_values */
And an explanation from dictobject.:
/*
The DictObject can be in one of two forms.
Either:
A combined table:
ma_values == NULL, dk_refcnt == 1.
Values are stored in the me_value field of the PyDictKeysObject.
Or:
A split table:
ma_values != NULL, dk_refcnt >= 1
Values are stored in the ma_values array.
Only string (unicode) keys are allowed.
All dicts sharing same key must have same insertion order.
....
*/
The values are either stored as an array of strings which follows an array of "key objects". or each value's pointer is stored in the me_value of PyDictKeyEntry. The keys are stored in me_key fields of PyDictKeyEntry. The array of keys is really an array of PyDictKeyEntry structs.
Just as a reference, PyDictKeyEntry is defined as:
typedef struct {
/* Cached hash code of me_key. */
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /*This field is only meaningful for combined tables*/
} PyDictKeyEntry;
Relevant files to look at: Objects/dict-common.h, Objects/dictobject.c, and Include/dictobject.h in Python's source code.
Objects/dictobject.c has an extensive write up in the comments in the beginning of the file explaining the whole scheme and historical background.
How does a hash value of some particular string is calculated in CPython2.7?
For instance, this code:
print hash('abcde' * 1000)
returns the same value even after I restart the Python process and try again (I did it many times).
So, it seems that id() (memory address) of the string doesn't used in this computation, right? Then how?
Hash values are not dependent on the memory location but the contents of the object itself. From the documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
See CPython's implementation of str.__hash__ in:
Objects/unicodeobject.c (for unicode_hash)
Python/pyhash.c (for _Py_HashBytes)
When using a dictionary in Python, the following is impossible:
d = {}
d[[1,2,3]] = 4
since 'list' is an unhashable type. However, the id function in Python returns an integer for an object that is guaranteed to be unique for the object's lifetime.
Why doesn't Python use id to hash a dictionary? Are there drawbacks?
The reason is right here (Why must dictionary keys be immutable)
Some unacceptable solutions that have been proposed:
Hash lists by their address (object ID). This doesn’t work because if you construct a new list with the same value it won’t be found; e.g.:
mydict = {[1, 2]: '12'}
print mydict[[1, 2]]
would raise a KeyError exception because the id of the [1, 2] used in the second line differs from that in the first line. In other words, dictionary keys should be compared using ==, not using is.
It is a requirement that if a == b, then hash(a) == hash(b). Using the id can break this, because the ID will not change if you mutate the list. Then you might have two lists that have equal contents, but have different hashes.
Another way to look at it is, yes, you could do it, but it would mean that you could not retrieve the dict value with another list with the same contents. You could only retrieve it by using the exact same list object as the key.
In Python dictionaries keys are compared using ==, and the equality operator with lists does an item-by-item equality check so two different lists with the same elements compare equal and they must behave as the same key in a dictionary.
If you need to keep a dictionary or set of lists by identity instead of equality you can just wrap the list in a user-defined object or, depending on the context, may be you can use a dictionary where elements are stored/retrieve by using id explicitly.
Note however that keeping the id of an object stored doesn't imply the object will remain alive, that there is no way for going from id to object and that id may be reused over time for objects that have been garbage collected. A solution is to use
my_dict[id(x)] = [x, value]
instead of
my_dict[id(x)] = value
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python dictionary: are keys() and values() always the same order?
If i have a dictonary in python, will .keys and .values return the corresponding elements in the same order?
E.g.
foo = {'foobar' : 1, 'foobar2' : 4, 'kittty' : 34743}
For the keys it returns:
>>> foo.keys()
['foobar2', 'foobar', 'kittty']
Now will foo.values() return the elements always in the same order as their corresponding keys?
It's hard to improve on the Python documentation:
Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): pairs = zip(d.values(), d.keys()). The same relationship holds for the iterkeys() and itervalues() methods: pairs = zip(d.itervalues(), d.iterkeys()) provides the same value for pairs. Another way to create the same list is pairs = [(v, k) for (k, v) in d.iteritems()]
So, in short, "yes" with the caveat that you must not modify the dictionary in between your call to keys() and your call to values().
Yes, they will
Just see the doc at Python doc :
Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond.
Best thing to do is still to use dict.items()
From the Python 2.6 documentation:
Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): pairs = zip(d.values(), d.keys()). The same relationship holds for the iterkeys() and itervalues() methods: pairs = zip(d.itervalues(), d.iterkeys()) provides the same value for pairs. Another way to create the same list is pairs = [(v, k) for (k, v) in d.iteritems()].
I'm over 99% certain the same will hold true for Python 3.0.