Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!
For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.
You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes
For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.
Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.
You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)
Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .
Related
I'm working on a text-parsing algorithm (open-source side project). I'd be very appreciative for any advice.
I have a tab-delimited txt file which is sorted by the first column (sample dataset below). Duplicate entries exist within this column.
Ultimately, I would like to use a hash to point to all values of which have the same key (first column value). Should a new key come along, the contents of the hash are to be serialized, saved, etc, and then cleared for the new key to populate it. As a result, my goal is to have only 1 key present. Therefore, if I have N unique keys, I wish to make N hashes each pointing to their respective entry. Datasets though are GBs in size and in-memory heaps won't be much help, hence my reasoning to create a hash per key and process each individually.
SAMPLE DATASET
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123
...
So in the above dataset snippet, I wish to have a hash for 'A' (pointing to its 3x respective items). When 'B' is read, serialize the 'A' hash and clear the hash-contents. Repeat for 'B' until end of dataset.
My pseudocode is as follows:
declare hash
for item in the dataset:
key, value = item[0], item[1:]
if key not in hash:
if hash.size is 0: // pertains to the very first item
hash.put(key, value)
else:
clear hash // if a new key is read but a diff. key is present.
else:
hash.put(key, value) // key already there so append it.
If any suggestions exist as to how to efficiently implement the above algorithm, I'd be very appreciative. Also, if my hash-reasoning/approach is not efficient or if improvements could be brought-up, I'd be very thankful. My goal is to ultimately create in-memory hashes until a new key comes along.
Thank you,
p.
Use itertools.groupby, passing it the file as an iterator:
from itertools import groupby
from cStringIO import StringIO
sourcedata = StringIO("""\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123""")
# or sourcedata = open("zillion_gigabyte_file.dat")
for key,lines in groupby(sourcedata, key=lambda s:s.split()[0]):
accum = [float(s.split()[2]) for s in lines]
print key, accum
groupby is very smart and efficient, keeping very little data in memory at a time, keeping things purely in the form of iterators until the last possible moment. What you describe out hashes and keeping only one in memory at a time and all that, is already done for you in groupby.
You could open an anydbm (2.x) or dbm (3.x) for each key in your first column, named by the value of the column. This is pretty trivial - I'm not sure what the question is.
You could also use something like my cachedb module, to let it figure out whether something is "new" or not: http://stromberg.dnsalias.org/~strombrg/cachedb.html I've used it in two projects, both with good results.
Either way, you could probably make your keys just lists of ascii floats separated by newlines or nulls or something.
You don't make it explicit whether the sorted data you give is typical or whether the same key can be interspersed with other keys throughout the file, and that does make a fundamental difference to the algorithm. I deduce from your sample code that they will appear in arbitrary order.
Neither do you say to what use you intend to put the data so extracted. This can matter a lot—there are many different ways to store data, and the application can be a crucial feature in determining access times. So you may want to consider the use of a various different storage types. Without knowing how you propose to use the resulting structure the following suggestion may be inappropriate.
Since the data are floating-point numbers then you may want to consider using the shelve module to maintain simple lists of the floating point numbers keyed agains the alphamerics. This has the advantage that all pickling and unpickling to/from external storage is handled automatically. If you need an increase in speed consider using a more efficient pickling protocol (one of the unused arguments to shelve.open()).
# Transform the data:
# note it's actually more efficient to process line-by-line
# as you read it from a file - obviously it's best to try
# to avoid reading the whole data set into memory at once.
data = """\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123"""
data = [(k, float(v))
for (k, _, v) in
[_.split() for _ in data.splitlines()]]
# create a shelve
import shelve
shelf = shelve.open("myshelf", "c")
# process data
for (k, v) in data:
if k in shelf:
# see note below for rationale
tmp = shelf[k]
tmp.append(v)
shelf[k] = tmp
else:
shelf[k] = [v]
# verify results
for k in shelf.keys():
print k, shelf[k]
You may be wondering why I didn't just use shelf[k].append(v) in the case where a key has already been seen. This is because it's only the operation of key assignment that triggers detection of the value change. You can read the shelve module docs for more detail, and to learn how to use the binary pickle format.
Note also that this program re-creates the shelf each time it is run due to the "c" argument to shelve.open().
I often keep track of duplicates with something like this:
processed = set()
for big_string in strings_generator:
if big_string not in processed:
processed.add(big_string)
process(big_string)
I am dealing with massive amounts of data so don't want to maintain the processed set in memory. I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
To cut down on memory use what do you think of using hashes like this:
processed = set()
for big_string in string_generator:
key = hash(big_string)
if key not in ignored:
processed.add(key)
process(big_string)
The drawback is I could lose data through occasional hash collisions.
1 collision in 1 billion hashes would not be a problem for my use.
I tried the md5 hash but found generating the hashes became a bottleneck.
What would you suggest instead?
I'm going to assume you are hashing web pages. You have to hash at most 55 billion web pages (and that measure almost certainly overlooks some overlap).
You are willing to accept a less than one in a billion chance of collision, which means that if we look at a hash function which number of collisions is close to what we would get if the hash was truly random[ˆ1], we want a hash range of size (55*10ˆ9)*10ˆ9. That is log2((55*10ˆ9)*10ˆ9) = 66 bits.
[ˆ1]: since the hash can be considered to be chosen at random for this purpose,
p(collision) = (occupied range)/(total range)
Since there is a speed issue, but no real cryptographic concern, we can use a > 66-bits non-cryptographic hash with the nice collision distribution property outlined above.
It looks like we are looking for the 128-bit version of the Murmur3 hash. People have been reporting speed increases upwards of 12x comparing Murmur3_128 to MD5 on a 64-bit machine. You can use this library to do your speed tests. See also this related answer, which:
shows speed test results in the range of python's str_hash, which speed you have already deemed acceptable elsewhere – though python's hash is a 32-bit hash leaving you only 2ˆ32/(10ˆ9) (that is only 4) values stored with a less than one in a billion chance of collision.
spawned a library of python bindings that you should be able to use directly.
Finally, I hope to have outlined the reasoning that could allow you to compare with other functions of varied size should you feel the need for it (e.g. if you up your collision tolerance, if the size of your indexed set is smaller than the whole Internet, etc, ...).
You have to decide which is more important: space or time.
If time, then you need to create unique representations of your large_item which take as little space as possible (probably some str value) that is easy (i.e. quick) to calculate and will not have collisions, and store them in a set.
If space, find the quickest disk-backed solution you can and store the smallest possible unique value that will identify a large_item.
So either way, you want small unique identifiers -- depending on the nature of large_item this may be a big win, or not possible.
Update
they are strings of html content
Perhaps a hybrid solution then: Keep a set in memory of the normal Python hash, while also keeping the actual html content on disk, keyed by that hash; when you check to see if the current large_item is in the set and get a positive, double-check with the disk-backed solution to see if it's a real hit or not, then skip or process as appropriate. Something like this:
import dbf
on_disk = dbf.Table('/tmp/processed_items', 'hash N(17,0); value M')
index = on_disk.create_index(lambda rec: rec.hash)
fast_check = set()
def slow_check(hashed, item):
matches = on_disk.search((hashed,))
for record in matches:
if item == record.value:
return True
return False
for large_item in many_items:
hashed = hash(large_item) # only calculate once
if hashed not in fast_check or not slow_check(hashed, large_item):
on_disk.append((hashed, large_item))
fast_check.add(hashed)
process(large_item)
FYI: dbf is a module I wrote which you can find on PyPI
If many_items already resides in memory, you are not creating another copy of the large_item. You are just storing a reference to it in the ignored set.
If many_items is a file or some other generator, you'll have to look at other alternatives.
Eg if many_items is a file, perhaps you can store a pointer to the item in the file instead of the actual item
As you have already seen few options but unfortunately none of them can fully address the situation partly because
Memory Constraint, to store entire object in memory
No perfect Hash function, and for huge data set change of collision is there.
Better Hash functions (md5) are slower
Use of database like sqlite would actually make things slower
As I read this following excerpt
I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
I feel if you work on this, it might help you marginally. Here how it should be
Use tmpfs to create a ramdisk. tmpfs has several advantages over other implementation because it supports swapping of less-used space to swap space.
Store the sqlite database on the ramdisk.
Change the size of the ramdisk and profile your code to check your performance.
I suppose you already have a working code to save your data in sqllite. You only need to define a tmpfs and use the path to store your database.
Caveat: This is a linux only solution
a bloom filter? http://en.wikipedia.org/wiki/Bloom_filter
well you can always decorate large_item with a processed flag. Or something similar.
You can give a try to the str type __hash__ function.
In [1]: hash('http://stackoverflow.com')
Out[1]: -5768830964305142685
It's definitely not a cryptographic hash function, but with a little chance you won't have too much collision. It works as described here: http://effbot.org/zone/python-hash.htm.
I suggest you profile standard Python hash functions and choose the fastest: they are all "safe" against collisions enough for your application.
Here are some benchmarks for hash, md5 and sha1:
In [37]: very_long_string = 'x' * 1000000
In [39]: %timeit hash(very_long_string)
10000000 loops, best of 3: 86 ns per loop
In [40]: from hashlib import md5, sha1
In [42]: %timeit md5(very_long_string).hexdigest()
100 loops, best of 3: 2.01 ms per loop
In [43]: %timeit sha1(very_long_string).hexdigest()
100 loops, best of 3: 2.54 ms per loop
md5 and sha1 are comparable in speed. hash is 20k times faster for this string and it does not seem to depend much on the size of the string itself.
how does your sql lite version work? If you insert all your strings into a database table and then run the query "select distinct big_string from table_name", the database should optimize it for you.
Another option for you would be to use hadoop.
Another option could be to split the strings into partitions such that each partition is small enough to fit in memory. then you only need to check for duplicates within each partition. the formula you use to decide the partition will choose the same partition for each duplicate. the easiest way is to just look at the first few digits of the string e.g.:
d=defaultdict(int)
for big_string in strings_generator:
d[big_string[:4]]+=1
print d
now you can decide on your partitions, go through the generator again and write each big_string to a file that has the start of the big_string in the filename. Now you could just use your original method on each file and just loop through all the files
This can be achieved much more easily by performing simpler checks first, then investigating these cases with more elaborate checks. The example below contains extracts of your code, but it is performing the checks on much smaller sets of data. It does this by first matching on a simple case that is cheap to check. And if you find that a (filesize, checksum) pairs are not discriminating enough you can easily change it for a more cheap, yet vigorous check.
# Need to define the following functions
def GetFileSize(filename):
pass
def GenerateChecksum(filename):
pass
def LoadBigString(filename):
pass
# Returns a list of duplicates pairs.
def GetDuplicates(filename_list):
duplicates = list()
# Stores arrays of filename, mapping a quick hash to a list of filenames.
filename_lists_by_quick_checks = dict()
for filename in filename_list:
quickcheck = GetQuickCheck(filename)
if not filename_lists_by_quick_checks.has_key(quickcheck):
filename_lists_by_quick_checks[quickcheck] = list()
filename_lists_by_quick_checks[quickcheck].append(filename)
for quickcheck, filename_list in filename_lists.iteritems():
big_strings = GetBigStrings(filename_list)
duplicates.extend(GetBigstringDuplicates(big_strings))
return duplicates
def GetBigstringDuplicates(strings_generator):
processed = set()
for big_string in strings_generator:
if big_sring not in processed:
processed.add(big_string)
process(big_string)
# Returns a tuple containing (filesize, checksum).
def GetQuickCheck(filename):
return (GetFileSize(filename), GenerateChecksum(filename))
# Returns a list of big_strings from a list of filenames.
def GetBigStrings(file_list):
big_strings = list()
for filename in file_list:
big_strings.append(LoadBigString(filename))
return big_strings
I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.
Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.
So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.
What's the right way of doing this?
edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:
import hashlib
def md5_from_file (fileName, block_size=2**14):
md5 = hashlib.md5()
f = open(fileName)
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
f.close()
return md5.hexdigest()
Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.
Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.
# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)
def store_file(file, nchunks, hash):
"" Tells you whether there is another file with the same contents already, by
making a table lookup ""
# This can be a DB lookup or some way to obtain your hash map
big_table = ObtainTable()
# Two level lookup table might help performance
# Will vary on the number of entries and nature of big_table
if nchunks in big_table:
if hash in big_table[hash]:
raise DuplicateFileException,\
'File is dup with %s' big_table[nchunks][lookup_hash]
else:
big_table[nchunks] = {}
big_table[nchunks].update({
hash: file.filename
})
file.save() # or something
To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.
Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.
The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.
Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.
See http://en.wikipedia.org/wiki/Birthday_Paradox
Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.
The issue with MD5 is that it's broken. For most common uses there's little problem and people still use both MD5 and SHA1, but I think that if you need a hashing function then you need a strong hashing function. To the best of my knowledge there is still no standard substitute for either of these. There are a number of algorithms that are "supposed" to be strong, but we have most experience with SHA1 and MD5. That is, we (think) we know when these two break, whereas we don't really know as much when the newer algorithms break.
Bottom line: think about the risks. If you wish to walk the extra mile then you might add extra checks when you find a hash duplicate, for the price of the performance penalty.
I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)
Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.
How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.
Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.
Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)
Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).
Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.
It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.
Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie
If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).
Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}
I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.
Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.
I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing
If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.
How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.
Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.
It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.