Parallelising Python code - python

I have written a function that returns a Pandas data frame (sample as a row and descriptor as columns) and takes input as a list of peptides (a biological sequence as strings data). "my_function(pep_list)" takes pep_list as a parameter and return data frame. it iterates eache peptide sequence from pep_list and calculates descriptor and combined all the data as pandas data frame and returns df:
pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]
example:
I want to parallelising this code with the given algorithm bellow:
1. get the number of processor available as .
n = multiprocessing.cpu_count()
2. split the pep_list as
sub_list_of_pep_list = pep_list/n
sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]
4. run "my_function()" for each core as (example if 4 cores )
df0 = my_function(sub_list_of_pep_list[0])
df1 = my_function(sub_list_of_pep_list[1])
df2 = my_functonn(sub_list_of_pep_list[2])
df3 = my_functonn(sub_list_of_pep_list[4])
5. join all df = concat[df0,df1,df2,df3]
6. returns df with nX speed.
Please suggest me the best suitable library to implement this method.
thanks and regards.
Updated
With some reading i am able to write down a code which work as per my expectation like
1. without parallelising it takes ~10 second for 10 peptide sequence
2. with two processes it takes ~6 second for 12 peptide
3. with four processes it takes ~4 second for 12 peptides
from multiprocessing import Process
def func1():
structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])
def func2():
structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])
def func3():
structure_gen(pep_seq = ["DAAADEF","DAAALEF"])
def func4():
structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p3 = Process(target=func1)
p3.start()
p4 = Process(target=func2)
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
but this code easily work with 10 peptide but not able to implement it for a PEP_list contains 1 million peptide
thanks

multiprocessing.Pool.map is what you're looking for.
Try this:
from multiprocessing import Pool
# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50
# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)
# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]
p = Pool()
try:
# p.map will return a list of dataframes which are to be
# concatenated
df = concat(p.map(my_function, sub_lists))
finally:
p.close()
The pool will automatically contain as many processes as there are available cores. But you can overwrite this number if you want to, just have a look at the docs.

Related

How to use map function and multiprocessing to simplify this code and reduce time?

Two separate queries.
1. I have a 'm' raster files and 'n' vector files. I would like to use map function (as in R) and iterate through a list of 'n' vector files for each 'm' raster files. I got the output by writing separate for loop for each vector file.
2.As given below, I am using for loop for each vector file. If i run it in a single script, I will be using only single processor. Is it possible to do multiprocessing to reduce the time?
Here is the for loop:
filenames_dat[i] is the raster input
df1 = gpd.read_file("input_path")
df2 = gpd.read_file("input_path")
for i in range(len(raster_path)):
array_name, trans_name = mask(filenames_dat[i], shapes=df1.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df1, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
df1['amg'+str(filenames[i])] = [x[('mean')] for x in zs]
df1['mpg'+str(filenames[i])] = [x[('sum')] for x in zs]
print(i)
df1csv = pd.DataFrame(df1)
df1csv.to_csv(cwd+'/rasteroutput/df1.csv', index = False)
for i in range(len(raster_path)):
array_name, trans_name = mask(filenames_dat[i], shapes=df2.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df2, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
df2['amg'+str(filenames[i])] = [x[('mean')] for x in zs]
df2['mpg'+str(filenames[i])] = [x[('sum')] for x in zs]
print(i)
df2csv = pd.DataFrame(df2)
df2csv.to_csv(cwd+'/rasteroutput/df2.csv', index = False)
Here is the function which I have not used as I am not sure how to use map with multiple arguments. 'i' is the index for raster list. poly2 function works for each integer 'i' (ie; i =1) but not when I store 'i' as list of index. list(map(poly2,lst,df)) shows error. Was looking for something similar to map2df as in R.
def poly2(i,df):## i is for year
df = df
array_name, trans_name = mask(filenames_dat[i], shapes=df.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
df['amg'+str(filenames[i])] = [x[('mean')] for x in zs]
df['mpg'+str(filenames[i])] = [x[('sum')] for x in zs]
print(i)
lst=[]
for i in range(len(raster_path)):
lst.append(i)
poly2(i=1, df=df)
list(map(poly2,lst,df)) ## shows error.
I also find the description of your problem a bit confusing. However, one possible solution for how to use multi-processing in python is shown below:
Starmap takes e.g. an iterable as input for a single iteration and unpacks it. See the link.
from multiprocessing import Pool
def add(x, y, id_num):
print(f"\nRunning Process: {id_num} -> result: {x+y}")
return x + y
tasks_x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
tasks_y = [10] * 10
with Pool(processes=2) as pool:
results = pool.starmap(add, zip(tasks_x, tasks_y, range(10)))
Which results in:
Running Process: 2 -> result: 13
Running Process: 0 -> result: 11
Running Process: 3 -> result: 14
Running Process: 1 -> result: 12
Running Process: 4 -> result: 15
Running Process: 6 -> result: 17
Running Process: 7 -> result: 18
Running Process: 5 -> result: 16
Running Process: 8 -> result: 19
Running Process: 9 -> result: 20
Note that 'results' is still sorted by the input order:
results
Out[8]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
I also recommend to look into joblib, it is sometimes very handy and allows you to write on-liners.
I will answer one of your questions first. If you have a function poly that takes arguments i, which varies, and df, whose value is always the same on every call, and you need to call poly as if it were a function of just a single argument (as required by the map function). Then:
from functools import partial
df = some_data_frame
modified_poly = partial(poly, df=df)
# Now calling modified_poly(0) is equivalent to calling
# poly(0, df)
You could also use itertools.starmap or, if doing multiprocessing, multiprocessing.pool.Pool.starmap, that can handle the invocation of a function that takes multiple arguments. But in your case where on argument never varies (df), I would just use functools.partial. Please read up on this. Moving on ...
You didn't really answer my question in a way that makes things clearer for me. But I will just assume you are trying to parallelize the following code:
def poly2(i,df):## i is for year
df = df
array_name, trans_name = mask(filenames_dat[i], shapes=df.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
df['amg'+str(filenames[i])] = [x[('mean')] for x in zs]
df['mpg'+str(filenames[i])] = [x[('sum')] for x in zs]
print(i)
My continued source of confusion stems from your saying you have m raster files and n vector files and using a for loop for each vector file. You further say filenames_dat[i] is "raster input". So the above loop seems to be iterating each raster file but I do not see where you are then iterating each vector file for a given raster file. Or does the call to zonal_stats iterate through the vector files? Let's move on again ...
First, a few observations:
df = df accomplished nothing.
('mean') evaluates to 'mean'. It's not wrong per se, but why not just x['mean']?
Your call to map is wrong since map takes two arguments, a function and an iterable.
You have a loop that creates and initializes variable lst, which could have been created with lst = [i for i in range(len(raster_path))] (better) or lst = list(range(len(raster_path))) (best).
Moving on ...
The problem is that processes do not share the same address space, unlike threads. So if you use multiprocessing passing df to each process, they will be working on and modifying their own version of it, which is not very helpful in your case. The problem, however, with Python threads is that they do not execute Python code in parallel since such code must acquire the so-called Global Interpreter Lock (library code written in C or C++ can be a different story). That is, the Python interpreter when executing Python code is not thread-safe. So multiprocessing it is. But the problem with multiprocessing is that, unlike with threads, creating new processes and moving data from one address space to another is expensive. This added expense can be more than compensated for by having code executed in parallel that otherwise would be executed serially -- but only if this executed code is CPU-intensive to actually overcome the additional expense. So your poly function cannot be too trivial in terms of CPU requirements or performance will be worse under multiprocessing.
Finally, to overcome the separate address space issue. I would have your poly function not attempt to modify the passed in df dataframe, but instead return results that the main process then uses to updated a single df instance:
from multiprocessing import Pool, cpu_count
from functool import partial
... # Other code omitted for brevity
def poly2(i,df):## i is for year
array_name, trans_name = mask(filenames_dat[i], shapes=df.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
return (i, [x[('mean')] for x in zs], [x[('sum')] for x in zs])
# Required by Windows (but okay even if not Windows):
if __name__ == '__main__':
df = some_dataframe
# Create a pool size no larger than both the number of CPU cores you have
# and the number of tasks being submitted:
pool_size = min(cpu_count, len(raster_path))
pool = Pool(pool_size)
results = pool.map(partial(poly, df=df), range(len(raster_path))
for i, mean, the_sum in results:
df['amg'+str(filenames[i])] = mean
df['mpg'+str(filenames[i])] = the_sum
# shutdown the pool:
pool.close()
pool.join()
If df is large (and working on the assumption that poly no longer modifies the dataframe), then you can avoid passing it for each task submitted by initializing the global storage of each pool process with its value:
from multiprocessing import Pool, cpu_count
... # Other code omitted for brevity
def init_pool_process(the_dataframe):
global df
df = the_dataframe # Initialize global varibale df
# Now df is no longer an argument but read from global storage:
def poly2(i):## i is for year
array_name, trans_name = mask(filenames_dat[i], shapes=df.geometry, crop=True, nodata=np.nan)
zs= zonal_stats(df, array_name[0], affine=trans_name, stats=['mean','sum'], nodata=np.nan, all_touched=True)
return (i, [x[('mean')] for x in zs], [x[('sum')] for x in zs])
# Required by Windows (but okay even if not Windows):
if __name__ == '__main__':
df = some_dataframe
# Create a pool size no larger than both the number of CPU cores you have
# and the number of tasks being submitted:
pool_size = min(cpu_count, len(raster_path))
# Initaialize global storage of each pool process with df:
pool = Pool(pool_size, initializer=init_global_process, initargs=(df,))
results = pool.map(poly, range(len(raster_path))
for i, mean, the_sum in results:
df['amg'+str(filenames[i])] = mean
df['mpg'+str(filenames[i])] = the_sum
# shutdown the pool:
pool.close()
pool.join()
So now df is created once by the main process and copied once for each pool process instead of once for each task submitted by map.

multiprocessing with Process library does not give results

I'd like to do multi-core processing on very long lists (not numpy arrrays!), but I can't get it to work. The examples I find do not help much either. My idea is to split that vector in several equal-sized parts, do something with the data, and return the modified data. The example operation is obviously simple; in reality it contains a number of if-statements and for-loops
ncore = 4
size = 100
vectors = []
for icore in range(ncore):
vectors.append([vector[ind:ind+size] for ind in range(icore * size, (icore + 1) * size)])
and the functions
from multiprocessing import Process
def some_func(vector):
return [val*val for val in vector]
if True:
for vector in vectors:
# print(name)
proc = Process(target=some_func, args=(vector,))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
But there is no output.
Solutions?
thanks, Andreas

code is faster on single cpu but very slow on multiple processes why?

I have some code to sort some values originally in sparse matrix and zip it together with another data. I used some kind of optimizations to make it fast and the code is 20x faster than it was as it is below:
This code takes 8s on single CPU core:
# cosine_sim is a sparse csr matrix
# names is an numpy array of length 400k
cosine_sim_labeled = []
for i in range(0, cosine_sim.shape[0]):
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
But if I use same code with multi core (to make it even faster) it takes 300 seconds:
#split is array of arrays of numbers like [[1,2,3], [4,5,6]] it is meant to generate batches of array indexes to be processed with each paralel process
split = np.array_split(range(0, cosine_sim.shape[0]), cosine_sim.shape[0] / batch)
def sort_rows(split):
cosine_sim_labeled = []
for i in split:
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
return cosine_sim_labeled
# this ensures paralel CPU execution
rows = Parallel(n_jobs=CPU_use, verbose=40)(delayed(sort_rows)(x) for x in split)
cosine_sim_labeled = np.vstack(rows).tolist()
you do realize that your new parallel function sort_rows does not even use the split argument? all it does is to distribute all the data to all processes, which takes time, then each process is doing the exact same calculation, only to return the whole data back to the main process, which again takes time

Use multiprocessing to update cells in google spreadsheet?

I am trying to calculate numbers in parallel and put them into cells in a google spreadsheet. The following is my code:
import multiprocessing, ezsheets
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
def myfunc(inputs):
a = sum(inputs)
sheet2['A1']=a
return
processes = []
for i in range(1,5):
p = multiprocessing.Process(target=myfunc, args=[[1,2,3]])
p.start()
processes.append(p)
for process in processes:
process.join()
But it does not change a cell. What am I doing wrong?
I am calling a function that uses GetHistoryRequest from telethon. Does that make a problem?
The main problem is that with multiprocessing each process has its own memory space and therefore sees its own copy of variable sheet2.
A secondary issue is that your code is invoking myfunc 5 times with the same argument and updating the same cell 5 times with the same value, so this is not a realistic use case. A more realistic example would be where you needed to set 5 different cells invoking myfunc with 5 different arguments. The easiest way to solve this would not to have myfunc attempt to update a shared spreadsheet but rather to just have it return to the main process the value that needs to be set in the cell and for the main process to do the actual cell setting. And to return a value from a subprocess the easiest way to do this is to use a process pool:
from concurrent.futures import ProcessPoolExecutor
import ezsheets
def myfunc(inputs):
return sum(inputs)
if __name__ == '__main__': # required for Windows
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
with ProcessPoolExecutor(max_workers=5) as executor:
a1 = executor.submit(myfunc, [1,2,3])
a2 = executor.submit(myfunc, [4,5,6])
a3 = executor.submit(myfunc, [7,8,9])
a4 = executor.submit(myfunc, [8,9,10])
a5 = executor.submit(myfunc, [11,12,13])
sheet2['A1'] = a1.result()
sheet2['A2'] = a2.result()
sheet2['A3'] = a3.result()
sheet2['A4'] = a4.result()
sheet2['A5'] = a5.result()

Parallelizing a code in Python

I'm working with a commercial analysis software called Abaqus which has a Python interface to read the output values.
I have just given a sample code (which doesn't run) below:
myOdb contains all the information, from which I am extracting the data. The caveat is that i cannot open the file using 2 separate programs.
Code 1 and Code 2 shown below work independently of each other, all they need is myOdb.
Is there a way to parallelize the codes 1 and 2 after I read the odb ?
# Open the odb file
myOdb = session.openOdb(name=odbPath)
# Code 1
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
# Code 2
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
If all you're trying to do is achieve a basic level of multiprocessing, this is what you need:
import multiprocessing
#Push the logic of code 1 and code 2 into 2 functions. Pass whatever you need
#these functions to access as arguments.
def code_1(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
def code_2(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
if __name__ == "__main__":
# Open the odb file
myOdb = session.openOdb(name=odbPath)
#Create process objects that lead to those functions and pass the
#object as an argument.
p1 = multiprocessing.Process(target=code_1, args=(myOdb,NoofSteps, ))
p2 = multiprocessing.Process(target=code_2, args=(myOdb,NoofSteps,))
#start both jobs
p1.start()
p2.start()
#Wait for each to finish.
p1.join()
p2.join()
#Done
Isolate the "main" portion of your code into a main block like I have shown above, do not, and I mean absolutely, do not use global variables. Be sure that all the variables you're using are available in the namespace of each function.
I recommend learning more about Python and the GIL problem. Read about the multiprocessing module here.

Categories