Parallelizing a code in Python - python

I'm working with a commercial analysis software called Abaqus which has a Python interface to read the output values.
I have just given a sample code (which doesn't run) below:
myOdb contains all the information, from which I am extracting the data. The caveat is that i cannot open the file using 2 separate programs.
Code 1 and Code 2 shown below work independently of each other, all they need is myOdb.
Is there a way to parallelize the codes 1 and 2 after I read the odb ?
# Open the odb file
myOdb = session.openOdb(name=odbPath)
# Code 1
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
# Code 2
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))

If all you're trying to do is achieve a basic level of multiprocessing, this is what you need:
import multiprocessing
#Push the logic of code 1 and code 2 into 2 functions. Pass whatever you need
#these functions to access as arguments.
def code_1(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
def code_2(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
if __name__ == "__main__":
# Open the odb file
myOdb = session.openOdb(name=odbPath)
#Create process objects that lead to those functions and pass the
#object as an argument.
p1 = multiprocessing.Process(target=code_1, args=(myOdb,NoofSteps, ))
p2 = multiprocessing.Process(target=code_2, args=(myOdb,NoofSteps,))
#start both jobs
p1.start()
p2.start()
#Wait for each to finish.
p1.join()
p2.join()
#Done
Isolate the "main" portion of your code into a main block like I have shown above, do not, and I mean absolutely, do not use global variables. Be sure that all the variables you're using are available in the namespace of each function.
I recommend learning more about Python and the GIL problem. Read about the multiprocessing module here.

Related

Use multiprocessing to update cells in google spreadsheet?

I am trying to calculate numbers in parallel and put them into cells in a google spreadsheet. The following is my code:
import multiprocessing, ezsheets
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
def myfunc(inputs):
a = sum(inputs)
sheet2['A1']=a
return
processes = []
for i in range(1,5):
p = multiprocessing.Process(target=myfunc, args=[[1,2,3]])
p.start()
processes.append(p)
for process in processes:
process.join()
But it does not change a cell. What am I doing wrong?
I am calling a function that uses GetHistoryRequest from telethon. Does that make a problem?
The main problem is that with multiprocessing each process has its own memory space and therefore sees its own copy of variable sheet2.
A secondary issue is that your code is invoking myfunc 5 times with the same argument and updating the same cell 5 times with the same value, so this is not a realistic use case. A more realistic example would be where you needed to set 5 different cells invoking myfunc with 5 different arguments. The easiest way to solve this would not to have myfunc attempt to update a shared spreadsheet but rather to just have it return to the main process the value that needs to be set in the cell and for the main process to do the actual cell setting. And to return a value from a subprocess the easiest way to do this is to use a process pool:
from concurrent.futures import ProcessPoolExecutor
import ezsheets
def myfunc(inputs):
return sum(inputs)
if __name__ == '__main__': # required for Windows
ss = ezsheets.Spreadsheet(spreadsheet_url)
sheet2 = ss[1]
with ProcessPoolExecutor(max_workers=5) as executor:
a1 = executor.submit(myfunc, [1,2,3])
a2 = executor.submit(myfunc, [4,5,6])
a3 = executor.submit(myfunc, [7,8,9])
a4 = executor.submit(myfunc, [8,9,10])
a5 = executor.submit(myfunc, [11,12,13])
sheet2['A1'] = a1.result()
sheet2['A2'] = a2.result()
sheet2['A3'] = a3.result()
sheet2['A4'] = a4.result()
sheet2['A5'] = a5.result()

python multiprocessing with concurrent features

I am new to python and tried a lot of methods for multiprocessing in python with no such benefits:
I have a task of implementing 3 methods x,y and z. What I have tried till now is:
Def foo:
Iterate over the lines in a text file:
Call_method_x()
Result from method x say x1
Call_method_y() #this uses x1
Result from method y say y1
For i in range(4):
Multiprocessing.Process(target=Call_method_z()) #this uses y1
I used multiprocessing here on method_z as this is the most cpu intensive.
i tried this another way:
def foo:
call method_x()
call method_y()
call method_z()
def main():
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(foo())
Which one seems more appropriate ? I checked the execution time but it was not much of a difference. the thing is that first method_x(), then method_y() and then method_z() should be implemented as they use the output from each other. Both these ways work but theres no significant difference of using multiprocessing with these two methods.
Please let me know if I am missing something here.
You can use multiprocessing.Pool from python, something like :
from multiprocessing import Pool
with open(<path-to-file>) as f:
data = f.readlines()
def method_x():
# do something
pass
def method_y():
x1 = method_x()
#do something
def method_z():
y1 = method_y()
# do something
def call_home():
p = Pool(6)
p.map(method_z, data)
First you read all lines in variable data. Then invoke 6 processes and allow each line to be processed by any of 6 process

Writing to file in Pool multiprocessing (Python 2.7)

I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions?
I greatly appreciate any help :)!
You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close() and join() are explained well in this post. It turns out (in the comment to the linked post) that map is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', ).. or the for loop underneath - for readability and functionality.
Multiprocessing.pool spawns processes, writing to a common file without lock from each process can cause data loss.
As you said you are trying to parallelise the calculation, multiprocessing.pool can be used to parallelize the computation.
Below is the solution that do parallel computation and writes the result in file, hope it helps:
from multiprocessing import Pool
# library for time
import datetime
# file in which you want to write
fout = open('test.txt', 'wb')
# function for your calculations, i have tried it to make time consuming
def calc(x):
x = x**2
sum = 0
for i in range(0, 1000000):
sum += i
return x
# function to write in txt file, it takes list of item to write
def f(res):
global fout
for x in res:
fout.write(str(x) + "\n")
if __name__ == '__main__':
qs = datetime.datetime.now()
arr = [1, 2, 3, 4, 5, 6, 7]
p = Pool(5)
res = p.map(calc, arr)
# write the calculated list in file
f(res)
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
# to compare the improvement using multiprocessing, iterative solution
qs = datetime.datetime.now()
for item in arr:
x = calc(item)
fout.write(str(x)+"\n")
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000

Launching other programs using a multiprocessing pool in Python is slow

My goal:
Use some function in python to extract data from many datafiles simultanously using different processors on my computer. I need to extract data from 1300 files and would therefore like Python to start extracting data from a new file once one extraction is complete. The extraction of data from a file is completely independent of extraction of data from other files. The files are of a form requiring to open the program which created them (OrcaFlex) to extract data. Due to this, extracting data from a single file can be time-consuming. (I use Windows)
My attempt:
Using Multiprocessing.Pool().map_async to pool my tasks.
Code:
import multiprocessing as mp
import OrcFxAPI # Package connected to external program
arcs = [1,2,5,7,9,13]
# define a example function
def get_results(x):
# Collects results from external program:
model = OrcFxAPI.Model(x[0])
c = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).TimeSeriesStatistics('Ezy-Angle').Mean
d = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).Query('Ezy-Angle', 'Ezy-Angle').ValueAtMax
e = model['line_inner_barrel'].LinkedStatistics(['Ezy-Angle'], 1, objectExtra=OrcFxAPI.oeEndB).Query('Ezy-Angle', 'Ezy-Angle').ValueAtMin
# Also does many other operations for extraction of results
return [c,d,e]
if __name__ == '__main__':
# METHOD WITH POOL - TAKES APPROX 1 HR 28 MIN
# List of input needed for the get_results function:
args = ((('CaseD%.3d.sim' % casenumber), arcs, 1) for casenumber in range(1,25))
pool = mp.Pool(processes=7)
results = pool.map_async(get_results, args)
pool.close()
pool.join()
# METHOD WITH FOR-LOOP - TAKES APPROX 1 HR 10 MIN
# List of input needed for the get_results function:
args2 = ((('CaseD%.3d.sim' % casenumber), arcs, 1) for casenumber in range(1,25))
for arg in args2:
get_results(arg)
Problem:
Using a for-loop for a reduced set (24)` of smaller datafiles took 1 hour 10 min, while using the Pool with 7 processors took 1 hour 28 min. Does anybody know why the run-time is so slow, and not close to divided by seven???
Also, is there a way of knowing which processor the Multiprocessing.Pool() allocates to a given Process? (In other words, can I let my Process know which processor it is using)
All help would be very much appreciated!!

Saving multiple matplotlib figures with multiprocessing

I have a code which reads data from multiple files named 001.txt, 002.txt, ... , 411.txt. I would like to read the data from each file, plot them, and save as 001.jpg, 002.jpg, ... , 411.jpg.
I can do this by looping through the files, but I would like to use the multiprocess module to speed things up.
However, when I use the code below, the computer hangs- I can't click on anything, but the mouse moves, and the sound continues. I then have to power down the computer.
I'm obviously misusing the multiprocess module with matplotlib. I have used something very similar to the below code to actually generate the data, and save to text files with no problems. What am I missing?
import multiprocessing
def do_plot(number):
fig = figure(number)
a, b = random.sample(range(1,9999),1000), random.sample(range(1,9999),1000)
# generate random data
scatter(a, b)
savefig("%03d" % (number,) + ".jpg")
print "Done ", number
close()
for i in (0, 1, 2, 3):
jobs = []
# for j in chunk:
p = multiprocessing.Process(target = do_plot, args = (i,))
jobs.append(p)
p.start()
p.join()
The most important thing in using multiprocessing is to run the main code of the module only for the main process. This can be achieved by testing if __name__ == '__main__' as shown below:
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(1000)
b = random.sample(1000)
# generate random data
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))
Note also that I replaced the creation of the separate processes by a process pool (which scales better to many pictures since it only uses as many process as you have cores available).

Categories