I currently store about 50k hashes in my Redis table, every single one has 5 key/value pairs. Once a day I run a batch job which updates hash values, including setting some key values to the value of the other key in a hash.
Here is my python code which iterates through keys and sets old_code to new_code if new_code value exists for a give hash:
pipe = r.pipeline()
for availability in availabilities:
pipe.hget(availability["EventId"], "new_code")
for availability, old_code in zip(availabilities, pipe.execute()):
if old_code:
availability["old_code"] = old_code.decode("utf-8")
for availability in availabilities:
if "old_code" in availability:
pipe.hset(
availability["EventId"], "old_code", availability["old_code"])
pipe.hset(availability["EventId"], "new_code", availability["MsgCode"])
pipe.execute()
It's a bit weird to me that I have to iterate through keys twice to achieve the same result, is there a better way to do this?
Another thing I'm trying to figure out is how to get all hash values with the best performance. Here is how I currently do it:
d = []
pipe = r.pipeline()
keys = r.keys('*')
for key in keys:
pipe.hgetall(key)
for val, key in zip(pipe.execute(), keys):
e = {"event_id": key}
e.update(val)
if "old_key" not in e:
e["old_key"] = None
d.append(e)
So basically I do keys * then iterate with HGETALL across all keys to get values. This is way too slow, especially the iteration. Is there a quicker way to do it?
How about an unpside down change. Transpose the way you store the data.
Instead of having 50k hashes each with 5 values. Have 5 hashes each with 50k values.
For example your hash depends on eventid and you store new_code, old_code and other stuffs inside that hash
Now, for new_code have a hash map which will contain eventid as a member and it's value as value. So new_code alone is a hash map containing 50k member value pair.
So looping through 5 instead of 50k will be relatively quicker.
I have done a little experiment and following are the numbers
50k hashes * 5 elements
Memory : ~12.5 MB
Time to complete loop through of elements : ~1.8 seconds
5 hashes * 50k elements
Memory : ~35 MB
Time to complete loop through of elements : ~0.3 seconds.
I have tested with simple strings like KEY_i and VALUE_i (where i is the incrementer) so memory may increase in your case. And also I have just walked through the data, I haven't done any manipulations so time also will vary in your case.
As you can see this change can give you 5x performance boost up, and 2 times more memory.
Redis does compression for hashes within a range (512 - default). Since we are storing more than that range (50k) we have this spike in memory.
Basically it's a trade off and it's upto you to choose the best one that would suit for your application.
For your 1st question:
you are getting values of new_code in each hashes, now you have
everything in a single hash -> just a single call.
Then you are updating old_code and new_code one by one. Now you can do them using hmset using a single call.
Hope this helps.
For your first problem, using a Lua script will definitely improve performance. This is untested, but something like:
update_hash = r.register_script("""
local key = KEYS[1]
local new_code = ARGS[1]
local old_code = redis.call("HGET", key, "new_code")
if old_code then
redis.call("HMSET", key, "old_code", old_code, "new_code", new_code)
else
redis.call("HSET", key, "new_code", new_code)
end
""")
# You can use transaction=False here if you don't need all the
# hashes to be updated together as one atomic unit.
pipe = r.pipeline()
for availability in availabilities:
keys = [availability["EventId"]]
args = [availability["MsgCode"]]
update_hash(keys=keys, args=args, client=pipe)
pipe.execute()
For your second problem you could again make it faster by writing a short Lua script. Instead of getting all the keys and returning them to the client, your script would get the keys and the data associated with them and return it in one call.
(Note, though, that calling keys() is inherently slow wherever you do it. And note that in either approach you're essentially pulling your entire Redis dataset into local memory, which might or might not become a problem.)
There is no command like that, redis hashes work within the hash, so HMGET work inside one hash and give all the fields in that hash. There is no way to access all the fields in multiple hashes at ones.
There are 2 options
Using pipeline
Using LUA
However both of this are workarounds, not a solution to your problem. To know how to do this check My Answer in this Question: Is there a command in Redis for HASH data structure similar to MGET?
Related
For reasons concerning MemoryError, I am appending a series of dictionaries to a file like pickle.dump(mydict, open(filename, "a")). The entirety of the dictionary, as far as I can tell, can't be constructed in my laptop's memory. As a result I have identical keys in the same pickled file. The file is essentially a hash table of doublets and strings. The data looks like:
{
'abc': [list of strings1],
'efg': [list of strings2],
'abc': [list of strings3]
}
Main Question: When I use pickle.load(open(filename, "r")) is there a way to join the duplicate dictionary keys?
Question 2: Does it matter that there are duplicates? Will calling the duplicate key give me all applicable results?
For example:
mydict = pickle.load(open(filename, "r"))
mydict['abc'] = <<sum of all lists with this key>>
One solution I've considered, but I'm not clear on from a Python-knowledge standpoint:
x = mydict['abc']
if type(x[0]) is list:
reduce(lambda a, b: a.extend(b), x)
<<do processing on list items>>
Edit 1: Here's the flow of data, roughly speaking.
Daily: update table_ownership with 100-500 new records. Each record contains 1 or 2 strings (names of people).
Create a new hash table of 3-letter groups, tied to the strings that contain the doublet. The key is the doublet, the value is a list of strings (actually a tuple containing the string and the primary key for the table_ownership record.
Hourly: update table_people with 10-40 new names to match.
Use the hash table to pull the most likely matches PRIOR to running fuzzy matching. We get the doublets from myString and strings like potential_matches.append(hashTable[doublet]) for doublet in get_doublets(myString)
Sort by shared doublet count.
Apply fuzzy matching to top 5000 potential_matches, storing results of high quality in table_fuzzymatches
So this works very well, and it's 10-100 times faster than fuzzymatching straight away. With only 200k records, I can make the hash table in memory and pickle.dump() but with the full 1.65M records I can't.
Edit 2: I'm looking into 2 things:
x64 Python
'collections.defaultdict'
I'll report back.
Answers:
32bit Python has a 2GB memory limit. x64 fixed my problem right away.
But what if I hadn't had 64bit Python available?
Chunk the input.
When I used a 10**5 chunk size and wrote to a dictionary piecemeal, it worked out.
For timing, my chunking process took 2000 seconds. 64bit Python sped it up to 380 seconds.
I'm putting around 4 million different keys into a Python dictionary.
Creating this dictionary takes about 15 minutes and consumes about 4GB of memory on my machine. After the dictionary is fully created, querying the dictionary is fast.
I suspect that dictionary creation is so resource consuming because the dictionary is very often rehashed (as it grows enormously).
Is is possible to create a dictionary in Python with some initial size or bucket number?
My dictionary points from a number to an object.
class MyObject:
def __init__(self):
# some fields...
d = {}
d[i] = MyObject() # 4M times on different key...
With performance issues it's always best to measure. Here are some timings:
d = {}
for i in xrange(4000000):
d[i] = None
# 722ms
d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
# 634ms
dict.fromkeys(xrange(4000000))
# 558ms
s = set(xrange(4000000))
dict.fromkeys(s)
# Not including set construction 353ms
The last option doesn't do any resizing, it just copies the hashes from the set and increments references. As you can see, the resizing isn't taking a lot of time. It's probably your object creation that is slow.
I tried :
a = dict.fromkeys((range(4000000)))
It creates a dictionary with 4 000 000 entries in about 3 seconds. After that, setting values are really fast. So I guess dict.fromkey is definitly the way to go.
If you know C, you can take a look at dictobject.c and the Notes on Optimizing Dictionaries. There you'll notice the parameter PyDict_MINSIZE:
PyDict_MINSIZE. Currently set to 8.
This parameter is defined in dictobject.h. So you could change it when compiling Python but this probably is a bad idea.
You can try to separate key hashing from the content filling with dict.fromkeys classmethod. It'll create a dict of a known size with all values defaulting to either None or a value of your choice. After that you could iterate over it to fill with the values. It'll help you to time the actual hashing of all keys. Not sure if you'd be able significantly increase the speed though.
If your datas need/can be stored on disc perhaps you can store your datas in a BSDDB database or use Cpickle to load/store your dictionnary
Do you initialize all keys with new "empty" instances of the same type? Is it not possible to write a defaultdict or something that will create the object when it is accessed?
I'm working on a text-parsing algorithm (open-source side project). I'd be very appreciative for any advice.
I have a tab-delimited txt file which is sorted by the first column (sample dataset below). Duplicate entries exist within this column.
Ultimately, I would like to use a hash to point to all values of which have the same key (first column value). Should a new key come along, the contents of the hash are to be serialized, saved, etc, and then cleared for the new key to populate it. As a result, my goal is to have only 1 key present. Therefore, if I have N unique keys, I wish to make N hashes each pointing to their respective entry. Datasets though are GBs in size and in-memory heaps won't be much help, hence my reasoning to create a hash per key and process each individually.
SAMPLE DATASET
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123
...
So in the above dataset snippet, I wish to have a hash for 'A' (pointing to its 3x respective items). When 'B' is read, serialize the 'A' hash and clear the hash-contents. Repeat for 'B' until end of dataset.
My pseudocode is as follows:
declare hash
for item in the dataset:
key, value = item[0], item[1:]
if key not in hash:
if hash.size is 0: // pertains to the very first item
hash.put(key, value)
else:
clear hash // if a new key is read but a diff. key is present.
else:
hash.put(key, value) // key already there so append it.
If any suggestions exist as to how to efficiently implement the above algorithm, I'd be very appreciative. Also, if my hash-reasoning/approach is not efficient or if improvements could be brought-up, I'd be very thankful. My goal is to ultimately create in-memory hashes until a new key comes along.
Thank you,
p.
Use itertools.groupby, passing it the file as an iterator:
from itertools import groupby
from cStringIO import StringIO
sourcedata = StringIO("""\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123""")
# or sourcedata = open("zillion_gigabyte_file.dat")
for key,lines in groupby(sourcedata, key=lambda s:s.split()[0]):
accum = [float(s.split()[2]) for s in lines]
print key, accum
groupby is very smart and efficient, keeping very little data in memory at a time, keeping things purely in the form of iterators until the last possible moment. What you describe out hashes and keeping only one in memory at a time and all that, is already done for you in groupby.
You could open an anydbm (2.x) or dbm (3.x) for each key in your first column, named by the value of the column. This is pretty trivial - I'm not sure what the question is.
You could also use something like my cachedb module, to let it figure out whether something is "new" or not: http://stromberg.dnsalias.org/~strombrg/cachedb.html I've used it in two projects, both with good results.
Either way, you could probably make your keys just lists of ascii floats separated by newlines or nulls or something.
You don't make it explicit whether the sorted data you give is typical or whether the same key can be interspersed with other keys throughout the file, and that does make a fundamental difference to the algorithm. I deduce from your sample code that they will appear in arbitrary order.
Neither do you say to what use you intend to put the data so extracted. This can matter a lot—there are many different ways to store data, and the application can be a crucial feature in determining access times. So you may want to consider the use of a various different storage types. Without knowing how you propose to use the resulting structure the following suggestion may be inappropriate.
Since the data are floating-point numbers then you may want to consider using the shelve module to maintain simple lists of the floating point numbers keyed agains the alphamerics. This has the advantage that all pickling and unpickling to/from external storage is handled automatically. If you need an increase in speed consider using a more efficient pickling protocol (one of the unused arguments to shelve.open()).
# Transform the data:
# note it's actually more efficient to process line-by-line
# as you read it from a file - obviously it's best to try
# to avoid reading the whole data set into memory at once.
data = """\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123"""
data = [(k, float(v))
for (k, _, v) in
[_.split() for _ in data.splitlines()]]
# create a shelve
import shelve
shelf = shelve.open("myshelf", "c")
# process data
for (k, v) in data:
if k in shelf:
# see note below for rationale
tmp = shelf[k]
tmp.append(v)
shelf[k] = tmp
else:
shelf[k] = [v]
# verify results
for k in shelf.keys():
print k, shelf[k]
You may be wondering why I didn't just use shelf[k].append(v) in the case where a key has already been seen. This is because it's only the operation of key assignment that triggers detection of the value change. You can read the shelve module docs for more detail, and to learn how to use the binary pickle format.
Note also that this program re-creates the shelf each time it is run due to the "c" argument to shelve.open().
I'm putting around 4 million different keys into a Python dictionary.
Creating this dictionary takes about 15 minutes and consumes about 4GB of memory on my machine. After the dictionary is fully created, querying the dictionary is fast.
I suspect that dictionary creation is so resource consuming because the dictionary is very often rehashed (as it grows enormously).
Is is possible to create a dictionary in Python with some initial size or bucket number?
My dictionary points from a number to an object.
class MyObject:
def __init__(self):
# some fields...
d = {}
d[i] = MyObject() # 4M times on different key...
With performance issues it's always best to measure. Here are some timings:
d = {}
for i in xrange(4000000):
d[i] = None
# 722ms
d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
# 634ms
dict.fromkeys(xrange(4000000))
# 558ms
s = set(xrange(4000000))
dict.fromkeys(s)
# Not including set construction 353ms
The last option doesn't do any resizing, it just copies the hashes from the set and increments references. As you can see, the resizing isn't taking a lot of time. It's probably your object creation that is slow.
I tried :
a = dict.fromkeys((range(4000000)))
It creates a dictionary with 4 000 000 entries in about 3 seconds. After that, setting values are really fast. So I guess dict.fromkey is definitly the way to go.
If you know C, you can take a look at dictobject.c and the Notes on Optimizing Dictionaries. There you'll notice the parameter PyDict_MINSIZE:
PyDict_MINSIZE. Currently set to 8.
This parameter is defined in dictobject.h. So you could change it when compiling Python but this probably is a bad idea.
You can try to separate key hashing from the content filling with dict.fromkeys classmethod. It'll create a dict of a known size with all values defaulting to either None or a value of your choice. After that you could iterate over it to fill with the values. It'll help you to time the actual hashing of all keys. Not sure if you'd be able significantly increase the speed though.
If your datas need/can be stored on disc perhaps you can store your datas in a BSDDB database or use Cpickle to load/store your dictionnary
Do you initialize all keys with new "empty" instances of the same type? Is it not possible to write a defaultdict or something that will create the object when it is accessed?
I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.
Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.
I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing
If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.
How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.
Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.
It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.