Is Python multiprocessing suitable for comparing two very large gzipped files?

Is Python multiprocessing suitable for comparing two very large gzipped files? - python

I am sorry for posting a lot of background in order to illustrate my problem, but here goes: I have created a script in python 2.7 to compare two large files and output any differences. The script currently takes around nineteen hours to complete and I would like to potentially reduce this via multi-core processing.
The files are structured like this:
r1 count:3 contained:True - rs692242298 40 ACGCTTTCCGGCCG IIIIIIIIIIIIII 2
r1 count:3 contained:True - rs344292768 73 ACGCTTTCCGGCCG IIIIIIIIIIIIII 2
r1 count:3 contained:True - rs326313795 23 ACGCTTTCCGGCCG IIIIIIIIIIIIII 2
r10 count:592 contained:True + rs690696575 4 CGGCCGGAAAGCGC IIIIIIIIIIIIII 3
r10 count:592 contained:True + rs333942854 30 CGGCCGGAAAGCGC IIIIIIIIIIIIII 3
r10 count:592 contained:True + rs323000429 65 CGGCCGGAAAGCGC IIIIIIIIIIIIII 3
r10 count:592 contained:True + rs341309868 76 CGGCCGGAAAGCGC IIIIIIIIIIIIII 3
r11 count:1 contained:False + rs346130515 43 CTCCGTCCGGCG IIIIIIIIIIII 10
r11 count:1 contained:False + rs336124149 75 CTCCGTCCGGCG IIIIIIIIIIII 10
... and when I say they are large, I mean VERY large. Each file (gzipped) takes up around 30 GB and if unzipped 3 TB (so I never unzip!!). For anyone interested, this is output from a genomic alignment program called bowtie. The comparisons are line to line. Most lines will have an exact match in the other file, but I need to find and output lines that are unique (no match in the other file - both ways).
The files are structured such that all lines with the same id (the leftmost r#) are grouped together. The id numbers (r1, r10, r11 in the example) are kind of growing, but due to multicore processing in bowtie the pattern is not steady (hence the gap between r1 and r10). This also means that the relative position of a block of lines for a particular id differs between the files. The number of lines for a particular id is frequently several millions.
I use a generator to iteratively return a list of all the lines for an id. As soon as I have two similar ids (one for each file), I compare the lists for that id by adding the lists to a frozenset and then I get the unique lines by using the difference method:
lines_file1 = frozenset(file_1_lines)
lines_file2 = frozenset(file_2_lines)
unique_lines = list(lines_file1.difference(lines_file2)) + list(lines_file2.difference(lines_file1))
I then write the unique lines to output and delete the original lists to save memory.
When timing the events, most of the time is spend on "reading" the files (app. 10 hours). As this is primarily CPU-bound (due to the compression) I think I could potentially cut this in half by dividing it into two processes. If I could then parallelize the other tasks, I could potentially cut the overall time from 19 to 5 hours.
Use of the word "potentially" here reflects that I may not fully understand the strengths and weaknesses of Python multiprocessing and if what I hope to achieve is even feasible.
From the attempts that I have done so far, one of the biggest barriers (if I have understood this correctly) is that everything needs to be pickled, which simply is not viable with such large amounts of data. If this is correct I would say that the usability of Python multiprocessing for analysis of genetic data is next to nothing.
So, to formulate a concrete question: Would it be possible (perhaps through some kind of mapping to avoid pickling everything) to create a producer process(es) that reads both input files in parallel and outputs 2-tuples where each tuple contains two lists of lines for one id?
Could I then make a consumer that would access the list/queue of such tuples to compare each pair and output unique lines to the SAME output file?
If the answer to the above is yes/maybe, I would be grateful if you could point me in the right direction. I have to admit that right now my take on the multiprocessing "powers" of Python is that it is useless for something like this and that my time is better spend elsewhere. But I kind of wish this is merely due to my own ignorance. Thank you.

Related

Reading an ASCII file in Python gets progressively slow

I am processing some output results from some simulation that are reported into ASCII files. This output file is written in a verbose form where each time step of the simulation reports the same tables with updated values.
I use python to process the tables into a pandas dataframe which I use to plot the simulation output variables.
My code is split on 2 parts:
First I make a quick pass of the file to split the file in a number of sections equal to the number of time steps. I do this to extract the time steps and mark the portions of the file where each time step is reported so I can browse the data easily just by calling each time step. I also extract a list of the time steps because I do not need to plot all of them so I use this table to filter out the time steps I would like to process.
The second part is to actually put data into the dataframe. Using the times steps from the previous list I call each section of the file and process the tables into a common dataframe. It is here where I notice that my code drags. It is strange because each time step data section is of the same size (same tables, same amount of characters). Nevertheless, I do see that processing each step gets progressively slower. The first step tables gets read in 1.79 seconds, the second in 2.29, when we are on the 20th step, it uses already 22 seconds. If I need to read a 100 steps or more, this becomes really unmanageable.
I could paste some of my code here but it may be unreadable so I tried to explain it as bast as I could. I will try to reproduce it in a simple example:
input_file="Simulation\nStep1:1seconds\n0 0 0 0\nStep2:2seconds\n1 0 1 0\nStep3:3seconds\n3 1 2 0\nStep4:4seconds\n4 5 8 2\n"
From the first part of my code I convert this string into a list where each element is the data of each step, and I get a list of the steps:
data=["0 0 0 0","1 0 1 0","3 1 2 0","4 5 8 2"]
steps=[0,1,2,3]
If I want to use only Steps 1 and 3, I filter them:
filtered_steps=[0,2]
Now I use this short list to call only the first (0) and thrid (2) elements of the data list, process each string and put them into a data frame.
On a trivial example like the one I used, it does not use time, but when instead of 4 steps I need to process 10s to 100s, and when instead of a line of character each time step has multiple lines, time becomes an issue. I would like to at least understand why is it getting progressively slower to read something that in the previous iteration had the same size.

multiprocessing speed difference

I am doing some sort of pattern matching on a dataset (English sentences) of 40 MB. To improve the speed I used multiprocessing. I created a Pool of 4 processes. Initially I kept the content of the file in a variable, then split it to 4 equal parts and sent it to the function.
This is how I am doing:
nl_split = content.split('\n') #The whole data kept in content
lenNl=int(len(nl_split)/4)
part1,part2,part3,part4 = '\n'.join(nl_split[0:lenNl]),'\n'.join(nl_split[lenNl+1:2*lenNl]),'\n'.join(nl_split[2*lenNl+1:3*lenNl]),'\n'.join(nl_split[3*lenNl+1:4*lenNl])
pr = Pool(4)
rValue=pr.map(match_some_pattern,[part1,part2,part3,part4])
Taking 90.148968935 Sec.
As a second approach, I divided the file into 4 and passed to the function as seperate files.
pr = Pool(4)
rValue=pr.map(match_some_pattern,['part1.txt','part2.txt','part3.txt','part4.txt'])
Taking 48.5109400749 Sec.
When I compare their speed of execution, I found that the second approach is far better than the first one. I was thinking that the first approach should be better than the second one, since less file operations are there. But the result was opposite!
Why second approach is consuming less time than the first one ?

analyze text file in parallel with mpi4py

I have an input tab separated text file:
0 .4
1 .9
2 .2
3 .12
4 .55
5 .98
I analyze it in plain Python as:
lines = open("songs.tsv").readlines()
def extract_hotness(line):
return float(line.split()[1])
songs_hotness =map(extract_hotness, lines)
max_hotness = max(songs_hotness)
How do I perform the same operation in parallel using mpi4py?
I started implementing this with scatter, but that won't work straight away because scatter needs list elements to be the same length as the number of nodes.

Processing a text file in parallel is difficult. Where do you split the file? Are you even reading from a parallel file system? You might consider MPI-IO if you have a large enough input file. If you go that route, these answers, provided in a C context, describe the challenges that still hold in mpi4py: https://stackoverflow.com/a/31726730/1024740 and https://stackoverflow.com/a/12942718/1024740
Another approach is not to scatter the data but to read it all in on rank 0 and broadcast to everyone else. This approach requires enough memory to stage all the input data at once, or a master-worker scheme where only some data is read in one shot.

Efficient way to intersect multiple large files containing geodata

Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based on 3 of 139 attributes.:
Now a bit more background. The files range from ~300-400mb, consisting of 139 columns and typically in the range of 400,000-600,000 rows. I have three particular fields I want to join on - a unique ID, and latitude/longitude (with a bit of a tolerance if possible). The goal is to determine which of these recored existed across certain ranges of files. Going worst case, that will mean performing a 22 file intersection.
So far, the following has failed
I tried using MySQL to perform the join. This was back when I was only looking at 7 years. Attempting the join on 7 years (using INNER JOIN about 7 times... e.g. t1 INNER JOIN t2 ON condition INNER JOIN t3 ON condition ... etc), I let it run for about 48 hours before the timeout ended it. Was it likely to actually still be running, or does that seem overly long? Despite all the suggestions I found to enable better multithreading and more RAM usage, I couldn't seem to get the cpu usage above 25%. If this is a good approach to pursue, any tips would be greatly appreciated.
I tried using ArcMap. I converted the CSVs to tables and imported them into a file geodatabase. I ran the intersection tool on two files, which took about 4 days, and the number of records returned was more than twice the number of input features combined. Each file had about 600,000 records. The intersection returned with 2,000,0000 results. In other cases, not all records were recognized by ArcMap. ArcMap says there are 5,000 records, when in reality there are 400,000+
I tried combining in python. Firstly, I can immediately tell RAM is going to be an issue. Each file takes up roughly 2GB of RAM in python when fully opened. I do this with:
f1 = [row for row in csv.reader(open('file1.csv', 'rU'))]
f2 = [row for row in csv.reader(open('file2.csv', 'rU'))]
joinOut = csv.writer(open('Intersect.csv', 'wb'))
uniqueIDs = set([row[uniqueIDIndex] for row in f1].extend([row[uniqueIDIndex] for row in f2]))
for uniqueID in uniqueIDs:
f1rows = [row for row in f1 if row[uniqueIDIndex] == uniqueID]
f2rows = [row for row in f2 if row[uniqueIDIndex] == uniqueID]
if len(f1rows) == 0 or len(f2rows) == 0:
//Not an intersect
else:
// Strings, split at decimal, if integer and first 3 places
// after decimal are equal, they are spatially close enough
f1lat = f1rows[0][latIndex].split('.')
f1long = f1rows[0][longIndex].split('.')
f2lat = f2rows[0][latIndex].split('.')
f2long = f2rows[0][longIndex].split('.')
if f1lat[0]+f1lat[1][:3] == f2lat[0]+f2lat[1][:3] and f1long[0]+f1long[1][:3] == f2long[0]+f2long[1][:3]:
joinOut.writerows([f1rows[0], f2rows[0]])
Obviously, this approach requires that the files being intersected are available in memory. Well I only have 16GB of RAM available and 22 files would need ~44GB of RAM. I could change it so that instead, when each uniqueID is iterated, it opens and parses each file for the row with that uniqueID. This has the benefit of reducing the footprint to almost nothing, but with hundreds of thousands of unique IDs, that could take an unreasonable amount of time to execute.
So, here I am, asking for suggestions on how I can best handle this data. I have an i7-3770k at 4.4Ghz, 16GB RAM, and a vertex4 SSD, rated at 560 MB/s read speed. Is this machine even capable of handling this amount of data?
Another venue I've thought about exploring is an Amazon EC2 cluster and Hadoop. Would that be a better idea to investigate?

Suggestion: Pre-process all the files to extract the 3 attributes you're interested in first. You can always keep track of the file/rownumber as well, so you can reference all the original attributes later if you want.

How can I group a large dataset

I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:

Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).

If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)
Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
5 1
5 2
12 1
12 2
and you'll need to return:
5 1 2
12 1 2
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.

Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
5 1 2
12 1 2
and the 4th pass would output
341 2
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk

If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.

I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).
a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.