Efficient way to store tuples in the datastore

Efficient way to store tuples in the datastore - python

If I have a pair of floats, is it any more efficient (computationally or storage-wise) to store them as a GeoPtProperty than it would be pickle the tuple and store it as a BlobProperty?
If GeoPt is doing something more clever to keep multiple values in a single property, can it be leveraged for arbitrary data? Can I store the tuple ("Johnny", 5) in a single entity property in a similarly efficient manner?

Here are some empirical answers:
GeoPtProperty uses 31B of storage space.
Using BlobProperty varies based on what exactly you store:
struct.pack('>2f', lat, lon) => 21B.
Using pickle (v2) to packe a 2-tuple containing floats => 37B.
Using pickle (v0) to packe a 2-tuple containing floats => about 30B-32B (v0 uses a variable-length ascii encoding for floats).
In short, it doesn't look like GeoPt is doing anything particularly clever. If you are going to be storing a lot of these, then you could use struct to pack your floats. Packing and unpacking them with struct will probably be unnoticeably different from the CPU cost associated with serializing/deserializing GeoPt.
If you plan on storing lots of floats per entity and space is really important, then you might consider leveraging the CompressedBlobProperty in aetycoon.
Disclaimer: This is the minimum space required. Actual space will be slightly larger per property based on the length of the property's name. The model itself also adds overhead (for its name and key).

GeoPt itself is limited to (-90 - 90, -180 - 180); it can't be used to store any data that won't fit this model.
However, a custom tuple property shouldn't be too difficult to create yourself; take a look at how SetProperty and ArrayProperty are designed in aetycoon.

Related

What is the fastest key type for the dictionaries in python? tuple, frozenset...?

Context: I am trying to speed up the execution time of k-means. For that, I pre-compute the means before the k-means execution. These means are stored in a dictionary called means_dict which has as a key a sequence of the points id ordered in ascending order and then joining by an underscore ,and as a value the mean of these points.
When I want to access to the mean of a given points set in dict_mean dictionary during the k-means execution, I have to generate the key of that points set ie order the id points in ascending order and joining them by an underscore.
The key generation instruction takes a long time because I the key may contain thousands of integers.
I have for each key a sequence of integers separated by an underscore "-" in a dictionary. I have to sort the sequence of integers before joining them by an underscore in order to make the key unique, I finaly obtain a string key. The problem is this process is so long. I want to use an another type of key that permits to avoid sorting the sequence and that key type should be faster than the string type in terms of access, comparison and search.
# means_dict is the dictionary containing as a key a string (sequence of
# integers joined by underscore "-", for example key="3-76-45-78-344")
# points is a dictionary containing for each value a list of integers
for k in keys:
# this joining instruction is so long
key = "_".join([ str(c) for c in sorted(points[k])])
if( key in means_dict ):
newmu.append( means_dict[key] )

Computing the means is cheap.
Did you profile your program? How much of the time is spent recomputing he means? With proper numpy arrays instead of python boxed arrays, this should be extremely cheap - definitely cheaper than constructing any such key!
The reason why computing the key is expensive is simple: it means constructing an object of varying size. And based on your description, it seems you will be building first a list of boxed integers, then a tuple of boxes integers, then serialize this into a string and then copy the string again to append the underscore. There is no way this is going to be faster than the simple - vectorizable - aggregation when computing the actual mean...
You could even use MacQueens approach to update means rather than recomputing them. But even that is often slower than recomputing them.
I wouldn't be surprised if your approach ends up being 10x slower than regular k-means... And probably 1000x slower than the clever kmeans algorithms such as Hartigan and Wong's.

size efficient dictionary(associative array) implementation

What algorithms are available for size efficient A dictionary or associative array?
For example, with this key/value set, how can one avoid duplication "Alice" in values?
{
"Pride and Prejudice": "Alice",
"The Brothers Karamazov": "Pat",
"Wuthering Heights": "Alice"
}
I checked Python's implementation on dictionary, but it seems like the implementation is focused on speed (keeping O(1)) not size.

As mentioned by bennofs in comments, you could use intern() to ensure that identical strings are stored only once:
class InternDict(dict):
def __setitem__(self, key, value):
if isinstance(value, str):
super(InternDict, self).__setitem__(key, intern(value))
else:
super(InternDict, self).__setitem__(key, value)
Here's an example of the effect that has:
>>> d = {}
>>> d["a"] = "This string is presumably too long to be auto-interned."
>>> d["b"] = "This string is presumably too long to be auto-interned."
>>> d["a"] is d["b"]
False
>>> di = InternDict()
>>> di["a"] = "This string is presumably too long to be auto-interned."
>>> di["b"] = "This string is presumably too long to be auto-interned."
>>> di["a"] is di["b"]
True

One way to improve space efficiency (in addition to sharing values, which (as bennofs points out in the comments) you can probably accomplish efficiently by using sys.intern) is to use hopscotch hashing, which is an open addressing scheme (a variant of linear probing) for resolving collisions - closed addressing schemes use more space because you need to allocate a linked list for each bucket, whereas with an open addressing scheme you'll just use an open adjacent slot in the backing array without needing to allocate any linked lists. Unlike other open addressing schemes (such as cuckoo hashing or vanilla linear probing), hopscotch hashing performs well under a high load factor (over 90%) and guarantees constant time lookups.

If your dictionary can fit in memory then, a simple Hashtable can be use.
Try to insert every key-value in an hashtable.
If the key alredy exist before inserting, then you have found a duplication.
There is number of implementation of hashtable in many langage.
There is basically twice way : array & tree.
Array focus on speed at hight memory cost. The main difference between Hashtable implementation is behavior on unicity, some implementation enforce unicity some others no.
Tree focus on memory smart usage at cost of O(log(n)) cpu usage. g++ map relies on very power full red black tree.
If size is very very problematics, then you should search a Huffman compression and/or Lampel Ziv compression, but it cost a little more, to adapt for dictionnary.
If your dictionnary can't fit in memory
You should look at database.
The red black tree for database is know as BTree (almost). It have branch factor optimizations for the low latency hard drive case.
I have put many link to wikipedia but if you like this subject I recommand you :

Python unhash value

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance

It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects

Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.

You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)

This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)

Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.

How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.