pickle.load() a dictionary with duplicate keys and reduce

pickle.load() a dictionary with duplicate keys and reduce - python

For reasons concerning MemoryError, I am appending a series of dictionaries to a file like pickle.dump(mydict, open(filename, "a")). The entirety of the dictionary, as far as I can tell, can't be constructed in my laptop's memory. As a result I have identical keys in the same pickled file. The file is essentially a hash table of doublets and strings. The data looks like:
{
'abc': [list of strings1],
'efg': [list of strings2],
'abc': [list of strings3]
}
Main Question: When I use pickle.load(open(filename, "r")) is there a way to join the duplicate dictionary keys?
Question 2: Does it matter that there are duplicates? Will calling the duplicate key give me all applicable results?
For example:
mydict = pickle.load(open(filename, "r"))
mydict['abc'] = <<sum of all lists with this key>>
One solution I've considered, but I'm not clear on from a Python-knowledge standpoint:
x = mydict['abc']
if type(x[0]) is list:
reduce(lambda a, b: a.extend(b), x)
<<do processing on list items>>
Edit 1: Here's the flow of data, roughly speaking.
Daily: update table_ownership with 100-500 new records. Each record contains 1 or 2 strings (names of people).
Create a new hash table of 3-letter groups, tied to the strings that contain the doublet. The key is the doublet, the value is a list of strings (actually a tuple containing the string and the primary key for the table_ownership record.
Hourly: update table_people with 10-40 new names to match.
Use the hash table to pull the most likely matches PRIOR to running fuzzy matching. We get the doublets from myString and strings like potential_matches.append(hashTable[doublet]) for doublet in get_doublets(myString)
Sort by shared doublet count.
Apply fuzzy matching to top 5000 potential_matches, storing results of high quality in table_fuzzymatches
So this works very well, and it's 10-100 times faster than fuzzymatching straight away. With only 200k records, I can make the hash table in memory and pickle.dump() but with the full 1.65M records I can't.
Edit 2: I'm looking into 2 things:
x64 Python
'collections.defaultdict'
I'll report back.

Answers:
32bit Python has a 2GB memory limit. x64 fixed my problem right away.
But what if I hadn't had 64bit Python available?
Chunk the input.
When I used a 10**5 chunk size and wrote to a dictionary piecemeal, it worked out.
For timing, my chunking process took 2000 seconds. 64bit Python sped it up to 380 seconds.

Related

Quickest and most efficient way to search large sorted text file

I have a large static text/csv file, which contains approx 100k rows (2MB). It's essentially a dictionary, and I need to perform regular lookups on this data in Python.
The format of the file is:
key value1 value2
alpha x1 x2
alpha beta y1 y2
gamma z1 z2
...
The keys can be multi-word strings.
The list is sorted in alphabetical order by the key
The values are strings
This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys. There will be up to 100 users on the application each looking up those 100-300 keys over the same data.
I just need to return the first exact match. For example, if the user searched for the keys [alpha, gamma], I just need to return [('x1','x2'), ('z1','z2')], which represents the first exact match of 'alpha' and 'gamma'.
I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.
Read the file once into an ordered set, and perform the 200 or so lookups. However, for every user using the application (~100), the file will be loaded into memory.
Read the file once into a list, and use binary search (e.g. bisect). Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.
Don't read the entire file into memory, and just read the file one line at a time. I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.
Whoosh is a search library that caught my eye since it created an index once. However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column. If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?
I'm really open to ideas and I'm in no way restricted to the four options above!
Thank you :)

How about something similar to approach #2. You could still read the file into memory but instead of storing it into a list and using binary search for searching up keys, you could store the file into a hash map.
The benefit of doing this is to take advantage of a hash map's average lookup time of O(1) with a worst case of O(n). The time complexity benefit and justification can be found here and here. Since you're only looking up keys, having constant lookup time would be a great way to search through the file. This method would also be faster than binary search's average O(log n) search time.
You could store your file as
table = {
key1: (value1, value2),
key2: (value1, value2),
key2: (value1, value2)
}
Note this method is only viable if your keys are all distinct with no duplicate keys.

Size issues with Python shelve module

I want to store a few dictionaries using the shelve module, however, I am running into a problem with the size. I use Python 3.5.2 and the latest shelve module.
I have a list of words and I want to create a map from the bigrams (character level) to the words. The structure will look something like this:
'aa': 'aardvark', 'and', ...
'ab': 'absolute', 'dab', ...
...
I read in a large file consisting of approximately 1.3 million words. So the dictionary gets pretty large. This is the code:
self.bicharacters // part of class
def _create_bicharacters(self):
'''
Creates a bicharacter index for calculating Jaccard coefficient.
'''
with open('wordlist.txt', encoding='ISO-8859-1') as f:
for line in f:
word = line.split('\t')[2]
for i in range(len(word) - 1):
bicharacter = (word[i] + word[i+1])
if bicharacter in self.bicharacters:
get = self.bicharacters[bicharacter]
get.append(word)
self.bicharacters[bicharacter] = get
else:
self.bicharacters[bicharacter] = [word]
When I ran this code using a regular Python dictionary, I did not run into issues, but I can't spare those kinds of memory resources due to the rest of the program also having quite a large memory footprint.
So I tried using the shelve module. However, when I run the code above using shelve the program stops after a while due to no more memory on disk, the shelve db that was created was around 120gb, and it had still not read even half the 1.3M word list from the file. What am I doing wrong here?

The problem here is not so much the number of keys, but that each key references a list of words.
While in memory as one (huge) dictionary, this isn't that big a problem as the words are simply shared between the lists; each list is simply a sequence of references to other objects and here many of those objects are the same, as only one string per word needs to be referenced.
In shelve, however, each value is pickled and stored separately, meaning that a concrete copy of the words in a list will have to be stored for each value. Since your setup ends up adding a given word to a large number of lists, this multiplies your data needs rather drastically.
I'd switch to using a SQL database here. Python comes with bundled with sqlite3. If you create one table for individual words, and second table for each possible bigram, and a third that simply links between the two (a many-to-many mapping, linking bigram row id to word row id), this can be done very efficiently. You can then do very efficient lookups as SQLite is quite adept managing memory and indices for you.

Redis: the best way to get all hash values

I currently store about 50k hashes in my Redis table, every single one has 5 key/value pairs. Once a day I run a batch job which updates hash values, including setting some key values to the value of the other key in a hash.
Here is my python code which iterates through keys and sets old_code to new_code if new_code value exists for a give hash:
pipe = r.pipeline()
for availability in availabilities:
pipe.hget(availability["EventId"], "new_code")
for availability, old_code in zip(availabilities, pipe.execute()):
if old_code:
availability["old_code"] = old_code.decode("utf-8")
for availability in availabilities:
if "old_code" in availability:
pipe.hset(
availability["EventId"], "old_code", availability["old_code"])
pipe.hset(availability["EventId"], "new_code", availability["MsgCode"])
pipe.execute()
It's a bit weird to me that I have to iterate through keys twice to achieve the same result, is there a better way to do this?
Another thing I'm trying to figure out is how to get all hash values with the best performance. Here is how I currently do it:
d = []
pipe = r.pipeline()
keys = r.keys('*')
for key in keys:
pipe.hgetall(key)
for val, key in zip(pipe.execute(), keys):
e = {"event_id": key}
e.update(val)
if "old_key" not in e:
e["old_key"] = None
d.append(e)
So basically I do keys * then iterate with HGETALL across all keys to get values. This is way too slow, especially the iteration. Is there a quicker way to do it?

How about an unpside down change. Transpose the way you store the data.
Instead of having 50k hashes each with 5 values. Have 5 hashes each with 50k values.
For example your hash depends on eventid and you store new_code, old_code and other stuffs inside that hash
Now, for new_code have a hash map which will contain eventid as a member and it's value as value. So new_code alone is a hash map containing 50k member value pair.
So looping through 5 instead of 50k will be relatively quicker.
I have done a little experiment and following are the numbers
50k hashes * 5 elements
Memory : ~12.5 MB
Time to complete loop through of elements : ~1.8 seconds
5 hashes * 50k elements
Memory : ~35 MB
Time to complete loop through of elements : ~0.3 seconds.
I have tested with simple strings like KEY_i and VALUE_i (where i is the incrementer) so memory may increase in your case. And also I have just walked through the data, I haven't done any manipulations so time also will vary in your case.
As you can see this change can give you 5x performance boost up, and 2 times more memory.
Redis does compression for hashes within a range (512 - default). Since we are storing more than that range (50k) we have this spike in memory.
Basically it's a trade off and it's upto you to choose the best one that would suit for your application.
For your 1st question:
you are getting values of new_code in each hashes, now you have
everything in a single hash -> just a single call.
Then you are updating old_code and new_code one by one. Now you can do them using hmset using a single call.
Hope this helps.

For your first problem, using a Lua script will definitely improve performance. This is untested, but something like:
update_hash = r.register_script("""
local key = KEYS[1]
local new_code = ARGS[1]
local old_code = redis.call("HGET", key, "new_code")
if old_code then
redis.call("HMSET", key, "old_code", old_code, "new_code", new_code)
else
redis.call("HSET", key, "new_code", new_code)
end
""")
# You can use transaction=False here if you don't need all the
# hashes to be updated together as one atomic unit.
pipe = r.pipeline()
for availability in availabilities:
keys = [availability["EventId"]]
args = [availability["MsgCode"]]
update_hash(keys=keys, args=args, client=pipe)
pipe.execute()
For your second problem you could again make it faster by writing a short Lua script. Instead of getting all the keys and returning them to the client, your script would get the keys and the data associated with them and return it in one call.
(Note, though, that calling keys() is inherently slow wherever you do it. And note that in either approach you're essentially pulling your entire Redis dataset into local memory, which might or might not become a problem.)

There is no command like that, redis hashes work within the hash, so HMGET work inside one hash and give all the fields in that hash. There is no way to access all the fields in multiple hashes at ones.
There are 2 options
Using pipeline
Using LUA
However both of this are workarounds, not a solution to your problem. To know how to do this check My Answer in this Question: Is there a command in Redis for HASH data structure similar to MGET?

Merging two datasets in Python efficiently

What would anyone consider the most efficient way to merge two datasets using Python?
A little background - this code will take 100K+ records in the following format:
{user: aUser, transaction: UsersTransactionNumber}, ...
and using the following data
{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...
to create
{user: aUser, activationNumber: assoiciatedActivationNumber}, ...
N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.
So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:
Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.
Some kind of Set based magic?
Clarification
The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.
Apologies if my notation for the records was misleading, I have updated them accordingly.
Thanks for the replies, I am going to give two ideas a try:
Sorting the lists first (I don't know
how expensive this is)
Creating a
dictionary with the transactionCodes
as the key then store the user and
activation code in a list as the
value
Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.

Here's a radical approach.
Don't.
You have two CSV files; one (users) is clearly the driver. Leave this alone.
The other -- transaction codes for a user -- can be turned into a simple dictionary.
Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".
Write your application do simply do simple lookups in the other collection.
Create a list of dictionaries and iterate over the list comparing the key each time,
Close. It looks like this. Note: No Sort.
import csv
with open('activations.csv','rb') as act_data:
rdr= csv.DictReader( act_data)
activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
rdr= csv.DictReader( user_data )
with open( 'users_2.csv','wb') as updated_data:
wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
for user in rdr:
user['some_field']= activations[user['user_id_column']]['some_field']
wtr.writerow( user )
This is fast and simple. Save the dictionaries (use shelve or pickle).
however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
False.
One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.

Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.

This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.