This code simulates loading a CSV, parsing it and loading it int a pandas dataframe. I would like to parallelize this problem so that it runs faster, but my pool.map implementation is actually slower than the serial implementation.
The csv is read as one big string and first split into lines and then split into values. It is an irregularly formatted csv with recurring headers so I cannot use the pandas read_csv. At least not that I know how.
My idea was to simply read in the file as string, split the long string into four parts (one for each core) and then process each chunk separately in parallel. This it turns out is slower than the serial version.
from multiprocessing import Pool
import datetime
import pandas as pd
def data_proc(raw):
pre_df_list = list()
for item in (i for i in raw.split('\n') if i and not i.startswith(',')):
if ' ' in item and ',' in item:
key, freq, date_observation = item.split(' ')
date, observation = date_observation.split(',')
pre_df_list.append([key, freq, date, observation])
return pre_df_list
if __name__ == '__main__':
raw = '\n'.join([f'KEY FREQ DATE,{i}' for i in range(15059071)]) # instead of loading csv
start = datetime.datetime.now()
pre_df_list = data_proc(raw)
df = pd.DataFrame(pre_df_list, columns=['KEY','FREQ','DATE','VAL'])
end = datetime.datetime.now()
print(end - start)
pool = Pool(processes=4)
start = datetime.datetime.now()
len(raw.split('\n'))
number_of_tasks = 4
chunk_size = int((len(raw) / number_of_tasks))
beginning = 0
multi_list = list()
for i in range(1,number_of_tasks+1):
multi_list.append(raw[beginning:chunk_size*i])
beginning = chunk_size*i
results = pool.imap(data_proc, multi_list)
# d = results[0]
pool.close()
pool.join()
# I haven'f finished conversion to dataframe since previous part is not working yet
# df = pd.DataFrame(d, columns=['SERIES_KEY','Frequency','OBS_DATE','val'])
end = datetime.datetime.now()
print(end - start)
EDIT: the serial version finishes in 34 seconds and the parallel after 53 seconds on my laptop. When I started working on this, my initial assumption was that I would be able to get it down to 10-ish seconds on a 4 core machine.
It looks like the parallel version I posted never finishes. I changed the pool.map call to pool.imap and now it works again. Note it has to be ran from the command line, not Spyder.
In General:
Multiprocessing isn't always the best way to do things. It takes overhead to create and manage the new processes, and to synchronize their output. In a relatively simple case, like parsing 150M lines of small text, you may or may not get a big time savings from using the multiprocessor.
There are a lot of other confounding variables - process load on the machine, number of processors, any access to I/O being spread across the processor (that one is not a problem in your specific case), the potential to fill up memory and then deal with page swaps... The list can keep growing. There are times multiprocessing is the ideal, and there are times where it makes matters worse. (In my production code, I have one place where I left a comment: "Using multiprocesing here took 3x longer than not. Just using regular map...")
Your specific case
However, without knowing your exact system specifications, I think that you should have a performance improvment from properly done multiprocessing. You may not; this task may be small enough that it's not worth the overhead. However, there are several issues with your code that will result in your multiprocessing path taking longer. I'll call out the ones that catch my attention
len(raw.split('\n'))
This line is very expensive, and accomplishes nothing. It goes through every line of your raw data, splits it, takes the length of the result, then throws out the split data and the len. You likely want to do something like:
splitted = raw.split('\n')
splitted_len = len(splitted) # but I'm not sure where you need this.
This would save the split data, so you could make use of it later, in your for loop. As it is right now, your for loop operates on raw, which has not been split. So instead of, e.g., running on [first_part, second_part, third_part, fourth_part], you're running on [all_of_it, all_of_it, all_of_it, all_of_it]. This, of course, is a HUGE part of your performance degradation - you're doing the same work x4!
I expect, if you take care of splitting on \n outside of your processing, that's all you'll need to get an improvement from multiprocessing. (Note, you actually don't need any special processing for 'serial' vs. 'parallel' - you can test it decently by using map instead of pool.map.)
Here's my take at re-doing your code. It moves the line splitting out of the data_proc function, so you can focus on whether splitting the array into 4 chunks gives you any improvement. (Other than that, it makes each task into a well-defined function - that's just style, to help clarify what's testing where.)
from multiprocessing import Pool
import datetime
import pandas as pd
def serial(raw):
pre_df_list = data_proc(raw)
return pre_df_list
def parallel(raw):
pool = Pool(processes=4)
number_of_tasks = 4
chunk_size = int((len(raw) / number_of_tasks))
beginning = 0
multi_list = list()
for i in range(1,number_of_tasks+1):
multi_list.append(raw[beginning:chunk_size*i])
beginning = chunk_size*i
results = pool.map(data_proc, multi_list)
pool.close()
pool.join()
pre_df_list = []
for r in results:
pre_df_list.append(r)
return pre_df_list
def data_proc(raw):
# assume raw is pre-split by the time you're here
pre_df_list = list()
for item in (i for i in if i and not i.startswith(',')):
if ' ' in item and ',' in item:
key, freq, date_observation = item.split(' ')
date, observation = date_observation.split(',')
pre_df_list.append([key, freq, date, observation])
return pre_df_list
if __name__ == '__main__':
# don't bother with the join, since we would need it in either case
raw = [f'KEY FREQ DATE,{i}' for i in range(15059071)] # instead of loading csv
start = datetime.datetime.now()
pre_df_list = serial(raw)
end = datetime.datetime.now()
print("serial time: {}".format(end - start))
start = datetime.datetime.now()
pre_df_list = parallel(raw)
end = datetime.datetime.now()
print("parallel time: {}".format(end - start))
# make the dataframe. This would happen in either case
df = pd.DataFrame(pre_df_list, columns=['KEY','FREQ','DATE','VAL'])
Related
I want to put a huge file into small files. There are approximately 2 million IDs in file and I want sort them by module. When you run program it should ask the number of files that you want to divide the main file.(x= int(input)). And I want to seperate file by module function. I mean if ID%x == 1 it should ad this ID to q1 and to f1. But it adds only first line that true for requirements.
import multiprocessing
def createlist(x,i,queue_name):
with open("events.txt") as f:
next(f)
for line in f:
if int(line) % x == i:
queue_name.put(line)
def createfile(x,i,queue_name):
for i in range(x):
file_name = "file{}.txt".format(i+1)
with open(file_name, "w") as text:
text.write(queue_name.get())
if __name__=='__main__':
x= int(input("number of parts "))
i = 0
for i in range(x):
queue_name = "q{}".format(i+1)
queue_name = multiprocessing.Queue()
p0=multiprocessing.Process(target = createlist, args = (x,i,queue_name,))
process_name = "p{}".format(i+1)
process_name = multiprocessing.Process(target = createfile, args = (x,i,queue_name,))
p0.start()
process_name.start()
Your createfile has two functional issues.
it only reads from the queue once, then terminates
it iterates the range of the desired number of subsets a second time, hence even after fixing the issue with the single queue-read you get one written file and parts - 1 empty files.
To fix your approach make createfile look like this:
def createfile(i, queue_name): # Note: x has been removed from the args
file_name = "file{}.txt".format(i + 1)
with open(file_name, "w") as text:
while True:
if queue.empty():
break
text.write(queue.get())
Since x has been removed from createfile's arguments, you'd also remove it from the process instantiation:
process_name = multiprocessing.Process(target = createfile, args = (i,queue_name,))
However ... do not do it like this. The more subsets you want, the more processes and queues you create (two processes and one queue per subset). That is a lot of overhead you create.
Also, while having one responsible process per output file for writing might still make some sense, having multiple processes reading the same (huge) file completely does not.
I did some timing and testing with an input file containing 1000 lines, each consisting of one random integer between 0 and 9999. I created three algorithms and ran each in ten iterations while tracking execution time. I did this for a desired number of subsets of 1, 2, 3, 4, 5 and 10. For the graph below I took the mean value of each series.
orwqpp (green): Is one-reader-writer-queue-per-part. Your approach. It saw an average increase in execution time of 0.48 seconds per additional subset.
orpp (blue): Is one-reader-per-part. This one had a common writer process that took care of writing to all files. It saw an average increase in execution time of 0.25 seconds per additional subset.
ofa (yellow): Is one-for-all. One single function, not run in a separate process, reading and writing in one go. It saw an average increase in execution time of 0.0014 seconds per additional subset.
Keep in mind that these figures were created with an input file 1/2000 the size of yours. The processes in what resembles your approach completed so quickly, they barely got in each other's way. Once the input is large enough to make the processes run for a longer amount of time, contention for CPU resources will increase and so will the penalty of having more processes as more subsets are requested.
Here's the one-for-all approach:
def one_for_all(parts):
handles = dict({el: open("file{}.txt".format(el), "w") for el in range(parts)})
with open("events.txt") as f:
next(f)
for line in f:
fnum = int(line) % parts
handles.get(fnum).write(line)
[h.close() for h in handles.values()]
if __name__ == '__main__':
x = int(input("number of parts "))
one_for_all(x)
This currently names the files based on the result of the modulo operation, so numbers where int(line) % parts is 0 will be in file0.txt and so on.
if you don't want that, simply add 1 when formatting the file name:
handles = dict({el: open("file{}.txt".format(el+1), "w") for el in range(parts)})
System Monitor during process I am a novice when it comes to programming. I've worked through the book Practical Computing for Biologists and am playing around with some slightly more advanced concepts.
I've written a Python (2.7) script which reads in a .fasta file and calculates GC-content. The code is provided below.
The file I'm working with is cumbersome (~ 3.9 Gb), and I was wondering if there's a way to take advantage of multiple processors, or whether it would be worth-while. I have a four-core (hyperthreaded) Intel i-7 2600K processor.
I ran the code and looked at system resources (picture attached) to see what the load on my CPU is. Is this process CPU limited? Is it IO limited? These concepts are pretty new to me. I played around with the multiprocessing module and Pool(), to no avail (probably because my function returns a tuple).
Here's the code:
def GC_calc(InFile):
Iteration = 0
GC = 0
Total = 0
for Line in InFile:
if Line[0] != ">":
GC = GC + Line.count('G') + Line.count('C')
Total = Total + len(Line)
Iteration = Iteration + 1
print Iteration
GCC = 100 * GC / Total
return (GC, Total, GCC)
InFileName = "WS_Genome_v1.fasta"
InFile = open(InFileName, 'r')
results = GC_calc(InFile)
print results
Currently, the major bottleneck of your code is print Iteration. Printing to stdout is really really slow. I would expect a major performance boost if you remove this line, or at least, if you absolutely need it, move it to another thread. However, thread management is an advanced topic and I would advice not to go into it right now.
Another possible bottleneck is the fact that you read data from a file. File IO can be slow, especially if you have a single HDD on your machine. With a single HDD you won't need to use multiprocessing at all, because you won't be able to provide enough data to processor cores. Performance-oriented RAIDs and SSDs can help here.
The final comment is to try to use grep and similar text-operating programs instead of python. They got through a decades of optimizing and have a good chance to work way more faster. There's a bunch of questions on SO where grep outperforms python. Or at least you can filter out FASTA headers before you pass data to the script:
$ grep "^[>]" WS_Genome_v1.fasta | python gc_calc.py
(Took grep "^[>]" from here.) In this case you shouldn't open files in the script, but rather read lines from sys.stdin, almost like you do it now.
Basically you're counting the number of C and G in every line and you're calculating the length of the line.
Only at the end you calculate a total.
Such a process is easy to do in parallel, because the calculation for each line is independent of the others.
Assuming the calculations are done in CPython (the one from python.org), threading won't improve performance much because of the GIL.
These calculations could be done in parallel with multiprocessing.Pool.
Processes don't share data like threads do. And we don't want to send parts of a 3.9 GB file to each worker process!
So you want each worker process to open the file by itself. The operating system's cache should take care that pages from the same file aren't loaded into memory multiple times.
If you have N cores, I would create the worker function so as to process every N-th line, with an offset.
def worker(arguments):
n = os.cpu_count() + 1
infile, offset = arguments
with open(infile) as f:
cg = 0
totlen = 0
count = 1
for line in f:
if (count % n) - offset == 0:
if not line.startswith('>'):
cg += line.count('C') +
line.count('G')
totlen += len(line)
count += 1
return (cg, totlen)
You could run the pool like this;
import multiprocessing as mp
from os import cpu_count
pool = mp.Pool()
results = pool.map(worker, [('infile', n) for n in range(1, cpu_count()+1)])
By default, a Pool creates as many workers as the CPU has cores.
The results would be a list of (cg, len) tuples, which you can easily sum.
Edit: updated to fix modulo zero error.
I now have a working code for parallelization (special thanks to #Roland Smith). I just had to make two small modifications to the code, and there's a caveat with respect to how .fasta files are structured. The final (working) code is below:
###ONLY WORKS WHEN THERE ARE NO BREAKS IN SEQUENCE LINES###
def GC_calc(arguments):
n = mp.cpu_count()
InFile, offset = arguments
with open(InFile) as f:
GC = 0
Total = 0
count = 0
for Line in f:
if (count % n) - offset == 0:
if Line[0] != ">":
Line = Line.strip('\n')
GC += Line.count('G') + Line.count('C')
Total += len(Line)
count += 1
return (GC, Total)
import time
import multiprocessing as mp
startTime = time.time()
pool = mp.Pool()
results = pool.map(GC_calc, [('WS_Genome_v2.fasta', n) for n in range(1, mp.cpu_count()+1)])
endTime = time.time()
workTime = endTime - startTime
#Takes the tuples, parses them out, adds them
GC_List = []
Tot_List = []
# x = GC count, y = total count: results = [(x,y), (x,y), (x,y),...(x,y)]
for x,y in results:
GC_List.append(x)
Tot_List.append(y)
GC_Final = sum(GC_List)
Tot_Final = sum(Tot_List)
GCC = 100*float(GC_Final)/float(Tot_Final)
print results
print
print "Number GC = ", GC_Final
print "Total bp = ", Tot_Final
print "GC Content = %.3f%%" % (GCC)
print
endTime = time.time()
workTime = endTime - startTime
print "The job took %.5f seconds to complete" % (workTime)
The caveat is that the .fasta files cannot have breaks within the sequences themselves. My original code didn't have an issue with it, but this code would not work properly when sequence was broken into multiple lines. That was simple enough to fix via the command line.
I also had to modify the code in two spots:
n = mp.cpu_count()
and
count = 0
Originally, count was set to 1 and n was set to mp.cpu_count()+1. This resulted in inaccurate counts, even after the file correction. The downside is that it also allowed all 8 cores (well, threads) to work. The new code only allows 4 to work at any given time.
But it DID speed up the process from about 23 seconds to about 13 seconds! So I'd say it was a success (except for the amount of time it took to correct the original .fasta file).
I'm trying to accurately determine which method of unpacking binary data into a view-able format is faster. I'm attempting to use the time module to do so. I'm working with the bitstring module as I found this the easiest way when unpacking i'm working with bit aligned data. This is a small test case to see which way is faster as i'm processing millions of lines. It needs to be displayed in a specific way which is why the formatting is there.
from bitstring import BitArray
import time
s = BitArray('0x0081')
start = time.time()
for i in range(100000):
test = s.unpack('uintle:16')
temp = hex(test[0]).lstrip('0x').zfill(4)
end = time.time()
ttime = end-start
print("uintle " + str(ttime))
start = time.time()
for i in range(100000):
hex_val = s.unpack('hex:16')
temp = hex_val[0][2:]+hex_val[0][0:2]
end = time.time()
ttime = end-start
print("hex " + str(ttime))
when testing the condition on 1 million loops this is the output:
uintle 32.51800322532654
uintle 46.38693380355835
hex 131.79687571525574
It doesn't seem valid as it prints one output twice and I can't figure out why that happens.
When testing with 100,000 loops this is the output:
uintle 2.705230951309204
hex 6.699380159378052
only two outputs just as expected. Any ideas on why it is behaving as such?
I have to time the implementation I did of an algorithm in one of my classes, and I am using the time.time() function to do so. After implementing it, I have to run that algorithm on a number of data files which contains small and bigger data sets in order to formally analyse its complexity.
Unfortunately, on the small data sets, I get a runtime of 0 seconds even if I get a precision of 0.000000000000000001 with that function when looking at the runtimes of the bigger data sets and I cannot believe that it really takes less than that on the smaller data sets.
My question is: Is there a problem using this function (and if so, is there another function I can use that has a better precision)? Or am I doing something wrong?
Here is my code if ever you need it:
import sys, time
import random
from utility import parseSystemArguments, printResults
...
def main(ville):
start = time.time()
solution = dynamique(ville) # Algorithm implementation
end = time.time()
return (end - start, solution)
if __name__ == "__main__":
sys.argv.insert(1, "-a")
sys.argv.insert(2, "3")
(algoNumber, ville, printList) = parseSystemArguments()
(algoTime, solution) = main(ville)
printResults(algoTime, solution, printList)
The printResults function:
def printResults(time, solution, printList=True):
print ("Temps d'execution = " + str(time) + "s")
if printList:
print (solution)
The solution to my problem was to use the timeit module instead of the time module.
import timeit
...
def main(ville):
start = timeit.default_timer()
solution = dynamique(ville)
end = timeit.default_timer()
return (end - start, solution)
Don't confuse the resolution of the system time with the resolution of a floating point number. The time resolution on a computer is only as frequent as the system clock is updated. How often the system clock is updated varies from machine to machine, so to ensure that you will see a difference with time, you will need to make sure it executes for a millisecond or more. Try putting it into a loop like this:
start = time.time()
k = 100000
for i in range(k)
solution = dynamique(ville)
end = time.time()
return ((end - start)/k, solution)
In the final tally, you then need to divide by the number of loop iterations to know how long your code actually runs once through. You may need to increase k to get a good measure of the execution time, or you may need to decrease it if your computer is running in the loop for a very long time.
I am trying to understand how to get children to write to a parent's variables. Maybe I'm doing something wrong here, but I would have imagined that multiprocessing would have taken a fraction of the time that it is actually taking:
import multiprocessing, time
def h(x):
h.q.put('Doing: ' + str(x))
return x
def f_init(q):
h.q = q
def main():
q = multiprocessing.Queue()
p = multiprocessing.Pool(None, f_init, [q])
results = p.imap(h, range(1,5))
p.close()
-----Results-----:
1
2
3
4
Multiprocessed: 0.0695610046387 seconds
1
2
3
4
Normal: 2.78949737549e-05 seconds # much shorter
for i in range(len(range(1,5))):
print results.next() # prints 1, 4, 9, 16
if __name__ == '__main__':
start = time.time()
main()
print "Multiprocessed: %s seconds" % (time.time()-start)
start = time.time()
for i in range(1,5):
print i
print "Normal: %s seconds" % (time.time()-start)
#Blender basically already answered your question, but as a comment. There is some overhead associated with the multiprocessing machinery, so if you incur the overhead without doing any significant work, it will be slower.
Try actually doing some work that parallelizes well. For example, write Python code to open a file, scan it using a regular expression, and pull out matching lines; then make a list of ten big files and time how long it takes to do all ten with multiprocessing vs. plain Python. Or write code to compute an expensive function and try that.
I have used multiprocessing.Pool() just to run a bunch of instances of an external program. I used subprocess to run an audio encoder, and it ran four instances of the encoder at once for a noticeable speedup.