I'm working on a text-parsing algorithm (open-source side project). I'd be very appreciative for any advice.
I have a tab-delimited txt file which is sorted by the first column (sample dataset below). Duplicate entries exist within this column.
Ultimately, I would like to use a hash to point to all values of which have the same key (first column value). Should a new key come along, the contents of the hash are to be serialized, saved, etc, and then cleared for the new key to populate it. As a result, my goal is to have only 1 key present. Therefore, if I have N unique keys, I wish to make N hashes each pointing to their respective entry. Datasets though are GBs in size and in-memory heaps won't be much help, hence my reasoning to create a hash per key and process each individually.
SAMPLE DATASET
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123
...
So in the above dataset snippet, I wish to have a hash for 'A' (pointing to its 3x respective items). When 'B' is read, serialize the 'A' hash and clear the hash-contents. Repeat for 'B' until end of dataset.
My pseudocode is as follows:
declare hash
for item in the dataset:
key, value = item[0], item[1:]
if key not in hash:
if hash.size is 0: // pertains to the very first item
hash.put(key, value)
else:
clear hash // if a new key is read but a diff. key is present.
else:
hash.put(key, value) // key already there so append it.
If any suggestions exist as to how to efficiently implement the above algorithm, I'd be very appreciative. Also, if my hash-reasoning/approach is not efficient or if improvements could be brought-up, I'd be very thankful. My goal is to ultimately create in-memory hashes until a new key comes along.
Thank you,
p.
Use itertools.groupby, passing it the file as an iterator:
from itertools import groupby
from cStringIO import StringIO
sourcedata = StringIO("""\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123""")
# or sourcedata = open("zillion_gigabyte_file.dat")
for key,lines in groupby(sourcedata, key=lambda s:s.split()[0]):
accum = [float(s.split()[2]) for s in lines]
print key, accum
groupby is very smart and efficient, keeping very little data in memory at a time, keeping things purely in the form of iterators until the last possible moment. What you describe out hashes and keeping only one in memory at a time and all that, is already done for you in groupby.
You could open an anydbm (2.x) or dbm (3.x) for each key in your first column, named by the value of the column. This is pretty trivial - I'm not sure what the question is.
You could also use something like my cachedb module, to let it figure out whether something is "new" or not: http://stromberg.dnsalias.org/~strombrg/cachedb.html I've used it in two projects, both with good results.
Either way, you could probably make your keys just lists of ascii floats separated by newlines or nulls or something.
You don't make it explicit whether the sorted data you give is typical or whether the same key can be interspersed with other keys throughout the file, and that does make a fundamental difference to the algorithm. I deduce from your sample code that they will appear in arbitrary order.
Neither do you say to what use you intend to put the data so extracted. This can matter a lot—there are many different ways to store data, and the application can be a crucial feature in determining access times. So you may want to consider the use of a various different storage types. Without knowing how you propose to use the resulting structure the following suggestion may be inappropriate.
Since the data are floating-point numbers then you may want to consider using the shelve module to maintain simple lists of the floating point numbers keyed agains the alphamerics. This has the advantage that all pickling and unpickling to/from external storage is handled automatically. If you need an increase in speed consider using a more efficient pickling protocol (one of the unused arguments to shelve.open()).
# Transform the data:
# note it's actually more efficient to process line-by-line
# as you read it from a file - obviously it's best to try
# to avoid reading the whole data set into memory at once.
data = """\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123"""
data = [(k, float(v))
for (k, _, v) in
[_.split() for _ in data.splitlines()]]
# create a shelve
import shelve
shelf = shelve.open("myshelf", "c")
# process data
for (k, v) in data:
if k in shelf:
# see note below for rationale
tmp = shelf[k]
tmp.append(v)
shelf[k] = tmp
else:
shelf[k] = [v]
# verify results
for k in shelf.keys():
print k, shelf[k]
You may be wondering why I didn't just use shelf[k].append(v) in the case where a key has already been seen. This is because it's only the operation of key assignment that triggers detection of the value change. You can read the shelve module docs for more detail, and to learn how to use the binary pickle format.
Note also that this program re-creates the shelf each time it is run due to the "c" argument to shelve.open().
Related
I currently store about 50k hashes in my Redis table, every single one has 5 key/value pairs. Once a day I run a batch job which updates hash values, including setting some key values to the value of the other key in a hash.
Here is my python code which iterates through keys and sets old_code to new_code if new_code value exists for a give hash:
pipe = r.pipeline()
for availability in availabilities:
pipe.hget(availability["EventId"], "new_code")
for availability, old_code in zip(availabilities, pipe.execute()):
if old_code:
availability["old_code"] = old_code.decode("utf-8")
for availability in availabilities:
if "old_code" in availability:
pipe.hset(
availability["EventId"], "old_code", availability["old_code"])
pipe.hset(availability["EventId"], "new_code", availability["MsgCode"])
pipe.execute()
It's a bit weird to me that I have to iterate through keys twice to achieve the same result, is there a better way to do this?
Another thing I'm trying to figure out is how to get all hash values with the best performance. Here is how I currently do it:
d = []
pipe = r.pipeline()
keys = r.keys('*')
for key in keys:
pipe.hgetall(key)
for val, key in zip(pipe.execute(), keys):
e = {"event_id": key}
e.update(val)
if "old_key" not in e:
e["old_key"] = None
d.append(e)
So basically I do keys * then iterate with HGETALL across all keys to get values. This is way too slow, especially the iteration. Is there a quicker way to do it?
How about an unpside down change. Transpose the way you store the data.
Instead of having 50k hashes each with 5 values. Have 5 hashes each with 50k values.
For example your hash depends on eventid and you store new_code, old_code and other stuffs inside that hash
Now, for new_code have a hash map which will contain eventid as a member and it's value as value. So new_code alone is a hash map containing 50k member value pair.
So looping through 5 instead of 50k will be relatively quicker.
I have done a little experiment and following are the numbers
50k hashes * 5 elements
Memory : ~12.5 MB
Time to complete loop through of elements : ~1.8 seconds
5 hashes * 50k elements
Memory : ~35 MB
Time to complete loop through of elements : ~0.3 seconds.
I have tested with simple strings like KEY_i and VALUE_i (where i is the incrementer) so memory may increase in your case. And also I have just walked through the data, I haven't done any manipulations so time also will vary in your case.
As you can see this change can give you 5x performance boost up, and 2 times more memory.
Redis does compression for hashes within a range (512 - default). Since we are storing more than that range (50k) we have this spike in memory.
Basically it's a trade off and it's upto you to choose the best one that would suit for your application.
For your 1st question:
you are getting values of new_code in each hashes, now you have
everything in a single hash -> just a single call.
Then you are updating old_code and new_code one by one. Now you can do them using hmset using a single call.
Hope this helps.
For your first problem, using a Lua script will definitely improve performance. This is untested, but something like:
update_hash = r.register_script("""
local key = KEYS[1]
local new_code = ARGS[1]
local old_code = redis.call("HGET", key, "new_code")
if old_code then
redis.call("HMSET", key, "old_code", old_code, "new_code", new_code)
else
redis.call("HSET", key, "new_code", new_code)
end
""")
# You can use transaction=False here if you don't need all the
# hashes to be updated together as one atomic unit.
pipe = r.pipeline()
for availability in availabilities:
keys = [availability["EventId"]]
args = [availability["MsgCode"]]
update_hash(keys=keys, args=args, client=pipe)
pipe.execute()
For your second problem you could again make it faster by writing a short Lua script. Instead of getting all the keys and returning them to the client, your script would get the keys and the data associated with them and return it in one call.
(Note, though, that calling keys() is inherently slow wherever you do it. And note that in either approach you're essentially pulling your entire Redis dataset into local memory, which might or might not become a problem.)
There is no command like that, redis hashes work within the hash, so HMGET work inside one hash and give all the fields in that hash. There is no way to access all the fields in multiple hashes at ones.
There are 2 options
Using pipeline
Using LUA
However both of this are workarounds, not a solution to your problem. To know how to do this check My Answer in this Question: Is there a command in Redis for HASH data structure similar to MGET?
I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary
I have a default dict of dicts whose primary key is a timestamp in the string form 'YYYYMMDD HH:MM:SS.' The keys are entered sequentially. How do I access the last entered key or the key with the latest timestamp?
Use an OrderedDict from the collections module if you simply need to access the last item entered. If, however, you need to maintain continuous sorting, you need to use a different data structure entirely, or at least an auxiliary one for the purposes of indexing.
Edit: I would add that, if accessing the final element is an operation that you have to do very rarely, it may be sufficient simply to sort the dict's keys and select the maximum. If you have to do this frequently, however, repeatedly sorting would become prohibitively expensive. Depending on how your code works, the simplest approach would probably be to simply maintain a single variable that, at any given point, contains the last key added and/or the maximum value added (i.e., is updated with each subsequent addition to the dict). If you want to maintain a record of additions that extends beyond just the last item, however, and don't require continuous sorting, an OrderedDict is ideal.
Use OrderedDict rather than a built-in dict
You can try something like this:
>>> import time
>>> data ={'20120627 21:20:23':'first','20120627 21:20:40':'last'}
>>> latest = lambda d: time.strftime('%Y%m%d %H:%M:%S',max(map(lambda x: time.strptime(x,'%Y%m%d %H:%M:%S'),d.keys())))
>>> data[latest(data)]
'last'
but it probably would be slow on large data sets.
If you want to know who entered the last (according to time of entrance) see the example below:
import datetime
format='%Y%m%d %H:%M'
Dict={'20010203 12:00':'Dave',
'20000504 03:00':'Pete',
'20020825 23:00':'kathy',
'20030102 01:00':'Ray'}
myDict={}
for key,val in Dict.iteritems():
TIME= str(datetime.datetime.strptime(key,format))
myDict[TIME]= val
myDict=sorted(myDict.iteritems(), key=lambda (TIME,v): (TIME))
print myDict[-1]
Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!
For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.
You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes
For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.
Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.
You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)
Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .
I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.
Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.
I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing
If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.
How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.
Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.
It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.