Merging two datasets in Python efficiently

Merging two datasets in Python efficiently - python

What would anyone consider the most efficient way to merge two datasets using Python?
A little background - this code will take 100K+ records in the following format:
{user: aUser, transaction: UsersTransactionNumber}, ...
and using the following data
{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...
to create
{user: aUser, activationNumber: assoiciatedActivationNumber}, ...
N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.
So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:
Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.
Some kind of Set based magic?
Clarification
The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.
Apologies if my notation for the records was misleading, I have updated them accordingly.
Thanks for the replies, I am going to give two ideas a try:
Sorting the lists first (I don't know
how expensive this is)
Creating a
dictionary with the transactionCodes
as the key then store the user and
activation code in a list as the
value
Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.

Here's a radical approach.
Don't.
You have two CSV files; one (users) is clearly the driver. Leave this alone.
The other -- transaction codes for a user -- can be turned into a simple dictionary.
Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".
Write your application do simply do simple lookups in the other collection.
Create a list of dictionaries and iterate over the list comparing the key each time,
Close. It looks like this. Note: No Sort.
import csv
with open('activations.csv','rb') as act_data:
rdr= csv.DictReader( act_data)
activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
rdr= csv.DictReader( user_data )
with open( 'users_2.csv','wb') as updated_data:
wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
for user in rdr:
user['some_field']= activations[user['user_id_column']]['some_field']
wtr.writerow( user )
This is fast and simple. Save the dictionaries (use shelve or pickle).
however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
False.
One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.

Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.

This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

Related

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!

In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.

Merge Two TinyDB Databases

On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])

Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())

pickle.load() a dictionary with duplicate keys and reduce

For reasons concerning MemoryError, I am appending a series of dictionaries to a file like pickle.dump(mydict, open(filename, "a")). The entirety of the dictionary, as far as I can tell, can't be constructed in my laptop's memory. As a result I have identical keys in the same pickled file. The file is essentially a hash table of doublets and strings. The data looks like:
{
'abc': [list of strings1],
'efg': [list of strings2],
'abc': [list of strings3]
}
Main Question: When I use pickle.load(open(filename, "r")) is there a way to join the duplicate dictionary keys?
Question 2: Does it matter that there are duplicates? Will calling the duplicate key give me all applicable results?
For example:
mydict = pickle.load(open(filename, "r"))
mydict['abc'] = <<sum of all lists with this key>>
One solution I've considered, but I'm not clear on from a Python-knowledge standpoint:
x = mydict['abc']
if type(x[0]) is list:
reduce(lambda a, b: a.extend(b), x)
<<do processing on list items>>
Edit 1: Here's the flow of data, roughly speaking.
Daily: update table_ownership with 100-500 new records. Each record contains 1 or 2 strings (names of people).
Create a new hash table of 3-letter groups, tied to the strings that contain the doublet. The key is the doublet, the value is a list of strings (actually a tuple containing the string and the primary key for the table_ownership record.
Hourly: update table_people with 10-40 new names to match.
Use the hash table to pull the most likely matches PRIOR to running fuzzy matching. We get the doublets from myString and strings like potential_matches.append(hashTable[doublet]) for doublet in get_doublets(myString)
Sort by shared doublet count.
Apply fuzzy matching to top 5000 potential_matches, storing results of high quality in table_fuzzymatches
So this works very well, and it's 10-100 times faster than fuzzymatching straight away. With only 200k records, I can make the hash table in memory and pickle.dump() but with the full 1.65M records I can't.
Edit 2: I'm looking into 2 things:
x64 Python
'collections.defaultdict'
I'll report back.

Answers:
32bit Python has a 2GB memory limit. x64 fixed my problem right away.
But what if I hadn't had 64bit Python available?
Chunk the input.
When I used a 10**5 chunk size and wrote to a dictionary piecemeal, it worked out.
For timing, my chunking process took 2000 seconds. 64bit Python sped it up to 380 seconds.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!

No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.

I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.

As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.