I am trying to calculate numbers in parallel and put them into cells in a google spreadsheet. The following is my code:
import multiprocessing, ezsheets
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
def myfunc(inputs):
a = sum(inputs)
sheet2['A1']=a
return
processes = []
for i in range(1,5):
p = multiprocessing.Process(target=myfunc, args=[[1,2,3]])
p.start()
processes.append(p)
for process in processes:
process.join()
But it does not change a cell. What am I doing wrong?
I am calling a function that uses GetHistoryRequest from telethon. Does that make a problem?
The main problem is that with multiprocessing each process has its own memory space and therefore sees its own copy of variable sheet2.
A secondary issue is that your code is invoking myfunc 5 times with the same argument and updating the same cell 5 times with the same value, so this is not a realistic use case. A more realistic example would be where you needed to set 5 different cells invoking myfunc with 5 different arguments. The easiest way to solve this would not to have myfunc attempt to update a shared spreadsheet but rather to just have it return to the main process the value that needs to be set in the cell and for the main process to do the actual cell setting. And to return a value from a subprocess the easiest way to do this is to use a process pool:
from concurrent.futures import ProcessPoolExecutor
import ezsheets
def myfunc(inputs):
return sum(inputs)
if __name__ == '__main__': # required for Windows
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
with ProcessPoolExecutor(max_workers=5) as executor:
a1 = executor.submit(myfunc, [1,2,3])
a2 = executor.submit(myfunc, [4,5,6])
a3 = executor.submit(myfunc, [7,8,9])
a4 = executor.submit(myfunc, [8,9,10])
a5 = executor.submit(myfunc, [11,12,13])
sheet2['A1'] = a1.result()
sheet2['A2'] = a2.result()
sheet2['A3'] = a3.result()
sheet2['A4'] = a4.result()
sheet2['A5'] = a5.result()
Related
I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.
I'm working with a commercial analysis software called Abaqus which has a Python interface to read the output values.
I have just given a sample code (which doesn't run) below:
myOdb contains all the information, from which I am extracting the data. The caveat is that i cannot open the file using 2 separate programs.
Code 1 and Code 2 shown below work independently of each other, all they need is myOdb.
Is there a way to parallelize the codes 1 and 2 after I read the odb ?
# Open the odb file
myOdb = session.openOdb(name=odbPath)
# Code 1
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
# Code 2
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
If all you're trying to do is achieve a basic level of multiprocessing, this is what you need:
import multiprocessing
#Push the logic of code 1 and code 2 into 2 functions. Pass whatever you need
#these functions to access as arguments.
def code_1(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
def code_2(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
if __name__ == "__main__":
# Open the odb file
myOdb = session.openOdb(name=odbPath)
#Create process objects that lead to those functions and pass the
#object as an argument.
p1 = multiprocessing.Process(target=code_1, args=(myOdb,NoofSteps, ))
p2 = multiprocessing.Process(target=code_2, args=(myOdb,NoofSteps,))
#start both jobs
p1.start()
p2.start()
#Wait for each to finish.
p1.join()
p2.join()
#Done
Isolate the "main" portion of your code into a main block like I have shown above, do not, and I mean absolutely, do not use global variables. Be sure that all the variables you're using are available in the namespace of each function.
I recommend learning more about Python and the GIL problem. Read about the multiprocessing module here.
This is my first time trying to use multiprocessing in Python. I'm trying to parallelize my function fun over my dataframe df by row. The callback function is just to append results to an empty list that I'll sort through later.
Is this the correct way to use apply_async? Thanks so much.
import multiprocessing as mp
function_results = []
async_results = []
p = mp.Pool() # by default should use number of processors
for row in df.iterrows():
r = p.apply_async(fun, (row,), callback=function_results.extend)
async_results.append(r)
for r in async_results:
r.wait()
p.close()
p.join()
It looks like using map or imap_unordered (dependending on whether you need your results to be ordered or not) would better suit your needs
import multiprocessing as mp
#prepare stuff
if __name__=="__main__":
p = mp.Pool()
function_results = list(p.imap_unorderd(fun,df.iterrows())) #unordered
#function_results = p.map(fun,df.iterrows()) #ordered
p.close()
I have written a function that returns a Pandas data frame (sample as a row and descriptor as columns) and takes input as a list of peptides (a biological sequence as strings data). "my_function(pep_list)" takes pep_list as a parameter and return data frame. it iterates eache peptide sequence from pep_list and calculates descriptor and combined all the data as pandas data frame and returns df:
pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]
example:
I want to parallelising this code with the given algorithm bellow:
1. get the number of processor available as .
n = multiprocessing.cpu_count()
2. split the pep_list as
sub_list_of_pep_list = pep_list/n
sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]
4. run "my_function()" for each core as (example if 4 cores )
df0 = my_function(sub_list_of_pep_list[0])
df1 = my_function(sub_list_of_pep_list[1])
df2 = my_functonn(sub_list_of_pep_list[2])
df3 = my_functonn(sub_list_of_pep_list[4])
5. join all df = concat[df0,df1,df2,df3]
6. returns df with nX speed.
Please suggest me the best suitable library to implement this method.
thanks and regards.
Updated
With some reading i am able to write down a code which work as per my expectation like
1. without parallelising it takes ~10 second for 10 peptide sequence
2. with two processes it takes ~6 second for 12 peptide
3. with four processes it takes ~4 second for 12 peptides
from multiprocessing import Process
def func1():
structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])
def func2():
structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])
def func3():
structure_gen(pep_seq = ["DAAADEF","DAAALEF"])
def func4():
structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p3 = Process(target=func1)
p3.start()
p4 = Process(target=func2)
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
but this code easily work with 10 peptide but not able to implement it for a PEP_list contains 1 million peptide
thanks
multiprocessing.Pool.map is what you're looking for.
Try this:
from multiprocessing import Pool
# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50
# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)
# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]
p = Pool()
try:
# p.map will return a list of dataframes which are to be
# concatenated
df = concat(p.map(my_function, sub_lists))
finally:
p.close()
The pool will automatically contain as many processes as there are available cores. But you can overwrite this number if you want to, just have a look at the docs.
I have a code which reads data from multiple files named 001.txt, 002.txt, ... , 411.txt. I would like to read the data from each file, plot them, and save as 001.jpg, 002.jpg, ... , 411.jpg.
I can do this by looping through the files, but I would like to use the multiprocess module to speed things up.
However, when I use the code below, the computer hangs- I can't click on anything, but the mouse moves, and the sound continues. I then have to power down the computer.
I'm obviously misusing the multiprocess module with matplotlib. I have used something very similar to the below code to actually generate the data, and save to text files with no problems. What am I missing?
import multiprocessing
def do_plot(number):
fig = figure(number)
a, b = random.sample(range(1,9999),1000), random.sample(range(1,9999),1000)
# generate random data
scatter(a, b)
savefig("%03d" % (number,) + ".jpg")
print "Done ", number
close()
for i in (0, 1, 2, 3):
jobs = []
# for j in chunk:
p = multiprocessing.Process(target = do_plot, args = (i,))
jobs.append(p)
p.start()
p.join()
The most important thing in using multiprocessing is to run the main code of the module only for the main process. This can be achieved by testing if __name__ == '__main__' as shown below:
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(1000)
b = random.sample(1000)
# generate random data
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))
Note also that I replaced the creation of the separate processes by a process pool (which scales better to many pictures since it only uses as many process as you have cores available).