multiprocessing speed difference - python

I am doing some sort of pattern matching on a dataset (English sentences) of 40 MB. To improve the speed I used multiprocessing. I created a Pool of 4 processes. Initially I kept the content of the file in a variable, then split it to 4 equal parts and sent it to the function.
This is how I am doing:
nl_split = content.split('\n') #The whole data kept in content
lenNl=int(len(nl_split)/4)
part1,part2,part3,part4 = '\n'.join(nl_split[0:lenNl]),'\n'.join(nl_split[lenNl+1:2*lenNl]),'\n'.join(nl_split[2*lenNl+1:3*lenNl]),'\n'.join(nl_split[3*lenNl+1:4*lenNl])
pr = Pool(4)
rValue=pr.map(match_some_pattern,[part1,part2,part3,part4])
Taking 90.148968935 Sec.
As a second approach, I divided the file into 4 and passed to the function as seperate files.
pr = Pool(4)
rValue=pr.map(match_some_pattern,['part1.txt','part2.txt','part3.txt','part4.txt'])
Taking 48.5109400749 Sec.
When I compare their speed of execution, I found that the second approach is far better than the first one. I was thinking that the first approach should be better than the second one, since less file operations are there. But the result was opposite!
Why second approach is consuming less time than the first one ?

Related

Is it faster to do a bulk file write or write to it in smaller parts?

I have a python script that reads a flat file and writes the records to a JSON file. Would it be faster to do a write all at once:
dict_array = []
for record in records:
dict_array.append(record)
# writes one big array to file
out_file.write(json.dumps(dict_array))
Or write to the file as the iterator yields each record?
for record in records:
out_file.write(json.dumps(record) + '\n')
The amount of records in records is around 81,000.
Also, the format of JSON can be one big array of objects (case 1) or line-separated objects (case 2).
Your two solutions aren't doing the same thing. The first one writes a valid JSON object. The second writes a probably-valid (but you have to be careful) JSONlines (and probably also NDJSON/LDJSON and NDJ) file. So, the way you process the data later is going to be very different. And that's the most important thing here—do you want a JSON file, or a JSONlines file?
But since you asked about performance: It depends.
Python files are buffered by default, so doing a whole bunch of small writes is only a tiny bit slower than doing one big write. But it is a tiny bit slower, not zero.
On the other hand, building a huge list in memory means allocation, and copies, that are otherwise unneeded. This is almost certainly going to be more significant, unless your values are really tiny and your list is also really short.
Without seeing your data, I'd give about 10:1 odds that the iterative solution will turn out faster, but why rely on that barely-educated guess? If it matters, measure with your actual data with timeit.
On the third hand, 81,000 typical JSON records is basically nothing, so unless you're doing this zillions of times, it's probably not even worth measuring. If you spend an hour figuring out how to measure it, running the tests, and interpreting the results (not to mention the time you spent on SO) to save 23 milliseconds per day for about a week and then nothing ever again… well, to a programmer, that's always attractive, but still it's not always a good idea.
import json
dict_array = []
records = range(10**5)
start = time.time()
for record in records:
dict_array.append(record)
out_file.write(json.dumps(dict_array))
end = time.time()
print(end-start)
#0.07105851173400879
start = time.time()
for record in records:
out_file.write(json.dumps(record) + '\n')
end = time.time()
print(end-start)
#1.1138122081756592
start = time.time()
out_file.write(json.dumps([record for record in records]))
end = time.time()
print(end-start)
#0.051038265228271484
I don't know what records is, but based on these tests, list comprehension is fastest, followed by constructing a list and writing it all at once, followed by writing one record at a time. Depending on what records is, just doing out_file.write(json.dumps(records))) may be even faster.

Reading an ASCII file in Python gets progressively slow

I am processing some output results from some simulation that are reported into ASCII files. This output file is written in a verbose form where each time step of the simulation reports the same tables with updated values.
I use python to process the tables into a pandas dataframe which I use to plot the simulation output variables.
My code is split on 2 parts:
First I make a quick pass of the file to split the file in a number of sections equal to the number of time steps. I do this to extract the time steps and mark the portions of the file where each time step is reported so I can browse the data easily just by calling each time step. I also extract a list of the time steps because I do not need to plot all of them so I use this table to filter out the time steps I would like to process.
The second part is to actually put data into the dataframe. Using the times steps from the previous list I call each section of the file and process the tables into a common dataframe. It is here where I notice that my code drags. It is strange because each time step data section is of the same size (same tables, same amount of characters). Nevertheless, I do see that processing each step gets progressively slower. The first step tables gets read in 1.79 seconds, the second in 2.29, when we are on the 20th step, it uses already 22 seconds. If I need to read a 100 steps or more, this becomes really unmanageable.
I could paste some of my code here but it may be unreadable so I tried to explain it as bast as I could. I will try to reproduce it in a simple example:
input_file="Simulation\nStep1:1seconds\n0 0 0 0\nStep2:2seconds\n1 0 1 0\nStep3:3seconds\n3 1 2 0\nStep4:4seconds\n4 5 8 2\n"
From the first part of my code I convert this string into a list where each element is the data of each step, and I get a list of the steps:
data=["0 0 0 0","1 0 1 0","3 1 2 0","4 5 8 2"]
steps=[0,1,2,3]
If I want to use only Steps 1 and 3, I filter them:
filtered_steps=[0,2]
Now I use this short list to call only the first (0) and thrid (2) elements of the data list, process each string and put them into a data frame.
On a trivial example like the one I used, it does not use time, but when instead of 4 steps I need to process 10s to 100s, and when instead of a line of character each time step has multiple lines, time becomes an issue. I would like to at least understand why is it getting progressively slower to read something that in the previous iteration had the same size.

Looking for a quick way to speed up my code

I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.

Nested Iteration of HDF5 using PyTables

I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons.
For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time.
Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second.
Is there a way to speed this process up?
Example follows (this is not my real code, just an example):
Small Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
elements = np.empty((N_elements, 1e5))
for ii, d in enumerate(data):
elements[ii] = data['element']
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(elements[ii], elements[jj])
Large Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(data['element'][ii], data['element'][jj])
Two approaches I'd suggest here:
numpy memmap: Create a memory mapped array, put the data inside this and then run code for "Small Set". Memory maps behave almost like arrays.
Use multiprocessing-module to allow parallel processing: if the "compare" method consumes at least a noticeable amount of CPU time, you could use more than one process.
Assuming you have more than one core in the CPU, this will speed up significantly. Use
one process to read the data from the hdf and put in into a queue
one process to grab from the queue and do the comparisson and put some result to "output-queue"
one process to collect the results again.
Before choosing the way: "Know your enemy", i.e., use profiling! Optimizations are only worth the effort if you improve at the bottlenecks, so first find out which methods consume you precious CPU time.
Your algorithm is O(n^2), which is not good for large data. Don't you see any chance to reduce this, e.g., by applying some logic? This is always the best approach.
Greetings,
Thorsten

Python fast string parsing, manipulation

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

Categories