I have a code which reads data from multiple files named 001.txt, 002.txt, ... , 411.txt. I would like to read the data from each file, plot them, and save as 001.jpg, 002.jpg, ... , 411.jpg.
I can do this by looping through the files, but I would like to use the multiprocess module to speed things up.
However, when I use the code below, the computer hangs- I can't click on anything, but the mouse moves, and the sound continues. I then have to power down the computer.
I'm obviously misusing the multiprocess module with matplotlib. I have used something very similar to the below code to actually generate the data, and save to text files with no problems. What am I missing?
import multiprocessing
def do_plot(number):
fig = figure(number)
a, b = random.sample(range(1,9999),1000), random.sample(range(1,9999),1000)
# generate random data
scatter(a, b)
savefig("%03d" % (number,) + ".jpg")
print "Done ", number
close()
for i in (0, 1, 2, 3):
jobs = []
# for j in chunk:
p = multiprocessing.Process(target = do_plot, args = (i,))
jobs.append(p)
p.start()
p.join()
The most important thing in using multiprocessing is to run the main code of the module only for the main process. This can be achieved by testing if __name__ == '__main__' as shown below:
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(1000)
b = random.sample(1000)
# generate random data
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))
Note also that I replaced the creation of the separate processes by a process pool (which scales better to many pictures since it only uses as many process as you have cores available).
Related
I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.
Update: it's working after updating my Spyder to 5.0.5. Thanks everyone!
I am trying to speed up a loop using multiprocessing. The code below aims to generate 10000 random vectors.
My idea is to split the task into 5 processes and store it in result. However, it returned an empty list when I run the code.
But, if I remove result = add_one(result) in the randomize_data function, the code runs perfectly. So, the error must be coming from using functions from other modules (Testing.test) inside multiprocessing.
Here is the add_one function from Testing.test:
def add_one(x):
return x+1
How can I use function from other modules inside process? Thank you.
import multiprocessing
import numpy as np
import pandas as pd
def randomize_data(mean, cov, n_init, proc_num, return_dict):
result = pd.DataFrame()
for _ in range(n_init):
temp = np.random.multivariate_normal(mean, cov)
result = result.append(pd.Series(temp), ignore_index=True)
result = add_one(result)
return_dict[proc_num] = result
if __name__ == "__main__":
from Testing.test import add_one
mean = np.arange(0, 1, 0.1)
cov = np.identity(len(mean))
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(5):
p = multiprocessing.Process(target=randomize_data, args=(mean, cov, 2000, i, return_dict, ))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
result = return_dict.values()
The issue here is pretty obvious:
You imported add_one in a local scope, not in global. Because of this, the referenz to this function only exists inside your main-if.
Move this import-statement to the other ones to the top of your file, and your code should work.
import multiprocessing
import numpy as np
import pandas as pd
from Testing.test import add_one
I'm attempting to get python multiprocessing working to speed up a code I've written. The code looks like this:
from multiprocessing import Array, Pool
import numpy as np
#setting up shared memory array
global misfit
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,1):
#setting up pool
pool = Pool()
p = [pool.apply_async(self.testfunc,args=(somevals,j)) for j in xrange(0,1)]
pool.close()
pool.join()
Where self.testfunc looks like:
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,1):
#some calculations
for mn in xrange(0,1):
#some more calculations
#save results
result = i*j*k*mn # example
misfit[i*j*k*mn] = result
My problem is that when I run this none of the values are saved in the shared Array, and it remains empty. I understand this could be to do with the global variable, but in a simpler program that uses this exact setup, the values are saved to the array. The array is quite large in the full program as well (4561920000 values). Also if I call this function outside of the Pool, it works and the values are saved.
So my question is what I am doing wrong here? Am I sending the shared Array incorrectly?
EDIT: Figured I'd add in the code that works:
from multiprocessing import Array, Pool
from numpy import empty, sin
from time import time
import numpy as np
def initarr():
a = Array('d', empty((5, 50, 80)).flat)
return a
def testfunc(i, j, k):
count = (i*50*80) + (j*80) + k
x = sin(k)
a[count] = x
y = np.fft.fft(np.exp(2j*np.pi*np.arange(50000)/50000))
def process(i):
start = time()
pool = Pool()
for j in xrange(0, 50):
p = [pool.apply_async(testfunc, args=(i, j, k)) for k in xrange(0, 80)]
pool.close()
pool.join()
print time() - start
global a
a = initarr()
for i in xrange(0, 5):
process(i)
Ok so with the help of someone from our IT department, I finally have a version of this that works, so for anybody in the future viewing this question, I'll post a solution. I haven't really used stack overflow much so sorry if it's bad etiquette to answer my own question.
We got this working using an initializer function, but we had to make sure the initializer function was in the same file (module) as the function being run by the Pool. So in one module (misc) we had:
**misc.py**
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,len(krange)):
#some calculations
for mn in xrange(0,len(mnrange)):
#some more calculations
#save results
loc = (i*len(jrange)*len(krange)*len(mnrange))+
(j*len(krange)*len(mnrange))+(k*len(mnrange))+mn
result = i*j*k*mn # example
misfit[loc] = result
def initpool(a):
global misfit
misfit = a
And in the main file we have:
**main.py**
from multiprocessing import Array, Pool
from misc import initpool, testfunc
import numpy as np
#setting up shared memory array
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,len(irange)):
#setting up pool
pool = Pool(initializer=initpool,initargs=(misfit,),processes=20)
p = [pool.apply_async(testfunc,args=(somevals,j)) for j in xrange(0,len(jrange))]
pool.close()
pool.join()
print(misfit[0])
Note that when we initially set up the Array, it must be named the same as the variable you set in initpool, at least from when I tested it.
This probably isn't the best way to do it but it works and hopefully some other people might find a use for it!
I'm working with a commercial analysis software called Abaqus which has a Python interface to read the output values.
I have just given a sample code (which doesn't run) below:
myOdb contains all the information, from which I am extracting the data. The caveat is that i cannot open the file using 2 separate programs.
Code 1 and Code 2 shown below work independently of each other, all they need is myOdb.
Is there a way to parallelize the codes 1 and 2 after I read the odb ?
# Open the odb file
myOdb = session.openOdb(name=odbPath)
# Code 1
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
# Code 2
for i in range(1, NoofSteps+1):
frames = myOdb.steps[stepName].frames
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
If all you're trying to do is achieve a basic level of multiprocessing, this is what you need:
import multiprocessing
#Push the logic of code 1 and code 2 into 2 functions. Pass whatever you need
#these functions to access as arguments.
def code_1(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
RFD = lastframe.fieldOutputs['RF']
sum1=0
for value in RFD.values:
sum1=sum1+value.data[1]
def code_2(odb_object, NoofSteps):
for i in range(1, NoofSteps+1):
frames = odb_object.steps[stepName].frames
#stepName? Where did this variable come from? Is it "i"?
lastframe=frames[-1]
for j in range(4,13):
file2=open('Fp'+str(j)+stepName,'w')
b=lastframe.fieldOutputs[var+str(j)]
fieldValues=b.values
for v in fieldValues:
file2.write('%d %6.15f\n' % (v.elementLabel, v.data))
if __name__ == "__main__":
# Open the odb file
myOdb = session.openOdb(name=odbPath)
#Create process objects that lead to those functions and pass the
#object as an argument.
p1 = multiprocessing.Process(target=code_1, args=(myOdb,NoofSteps, ))
p2 = multiprocessing.Process(target=code_2, args=(myOdb,NoofSteps,))
#start both jobs
p1.start()
p2.start()
#Wait for each to finish.
p1.join()
p2.join()
#Done
Isolate the "main" portion of your code into a main block like I have shown above, do not, and I mean absolutely, do not use global variables. Be sure that all the variables you're using are available in the namespace of each function.
I recommend learning more about Python and the GIL problem. Read about the multiprocessing module here.
I'm not an expert on python but I have managed to write down a multiprocessing code that uses all my cpus and cores in my PC. My code loads a very large array, about 1.6 GB, and I need to update the array in every process. Fortunately, the update consists of adding some artificial stars to the image and every process has a different set of image positions where to add the artificial stars.
The image is too large and I can't create a new one every time a call a process. My solution was creating a variable in the shared memory and I save plenty of memory. For some reason, it works for 90% of the image but there are regions were my code add random numbers in some of the positions I sent before to the processes. Is it related to the way I create a shared variable? Are the processes interfering each other during the execution of my code?
Something weird is that when using a single cpu and single core, the images is 100% perfect and there are no random numbers added to the image. Do you suggest me a way to share a large array between multiple processes? Here the relevant part of my code. Please, read the line when I define the variable im_data.
import warnings
warnings.filterwarnings("ignore")
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import sys,os
import subprocess
import numpy as np
import time
import cv2 as cv
import pyfits
from pyfits import getheader
import multiprocessing, Queue
import ctypes
class Worker(multiprocessing.Process):
def __init__(self, work_queue, result_queue):
# base class initialization
multiprocessing.Process.__init__(self)
# job management stuff
self.work_queue = work_queue
self.result_queue = result_queue
self.kill_received = False
def run(self):
while not self.kill_received:
# get a task
try:
i_range, psf_file = self.work_queue.get_nowait()
except Queue.Empty:
break
# the actual processing
print "Adding artificial stars - index range=", i_range
radius=16
x_c,y_c=( (psf_size[1]-1)/2, (psf_size[2]-1)/2 )
x,y=np.meshgrid(np.arange(psf_size[1])-x_c,np.arange(psf_size[2])-y_c)
distance = np.sqrt(x**2 + y**2)
for i in range(i_range[0],i_range[1]):
psf_xy=np.zeros(psf_size[1:3], dtype=float)
j=0
for i_order in range(psf_order+1):
j_order=0
while (i_order+j_order < psf_order+1):
psf_xy += psf_data[j,:,:] * ((mock_y[i]-psf_offset[1])/psf_scale[1])**i_order * ((mock_x[i]-psf_offset[0])/psf_scale[0])**j_order
j_order+=1
j+=1
psf_factor=10.**( (30.-mock_mag[i])/2.5)/np.sum(psf_xy)
psf_xy *= psf_factor
npsf_xy=cv.resize(psf_xy,(npsf_size[0],npsf_size[1]),interpolation=cv.INTER_LANCZOS4)
npsf_factor=10.**( (30.-mock_mag[i])/2.5)/np.sum(npsf_xy)
npsf_xy *= npsf_factor
im_rangex=[max(mock_x[i]-npsf_size[1]/2,0), min(mock_x[i]-npsf_size[1]/2+npsf_size[1], im_size[1])]
im_rangey=[max(mock_y[i]-npsf_size[0]/2,0), min(mock_y[i]-npsf_size[0]/2+npsf_size[0], im_size[0])]
npsf_rangex=[max(-1*(mock_x[i]-npsf_size[1]/2),0), min(-1*(mock_x[i]-npsf_size[1]/2-im_size[1]),npsf_size[1])]
npsf_rangey=[max(-1*(mock_y[i]-npsf_size[0]/2),0), min(-1*(mock_y[i]-npsf_size[0]/2-im_size[0]),npsf_size[0])]
im_data[im_rangey[0]:im_rangey[1], im_rangex[0]:im_rangex[1]] = 10.
self.result_queue.put(id)
if __name__ == "__main__":
n_cpu=2
n_core=6
n_processes=n_cpu*n_core*1
input_mock_file=sys.argv[1]
print "Reading file ", im_file[i]
hdu=pyfits.open(im_file[i])
data=hdu[0].data
im_size=data.shape
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1])
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
data=0
assert im_data.base.base is im_data_base.get_obj()
# run
# load up work queue
tic=time.time()
j_step=np.int(np.ceil( mock_n*1./n_processes ))
j_range=range(0,mock_n,j_step)
j_range.append(mock_n)
work_queue = multiprocessing.Queue()
for j in range(np.size(j_range)-1):
if work_queue.full():
print "Oh no! Queue is full after only %d iterations" % j
work_queue.put( (j_range[j:j+2], psf_file[i]) )
# create a queue to pass to workers to store the results
result_queue = multiprocessing.Queue()
# spawn workers
for j in range(n_processes):
worker = Worker(work_queue, result_queue)
worker.start()
# collect the results off the queue
while not work_queue.empty():
result_queue.get()
print "Writing file ", mock_im_file[i]
hdu[0].data=im_data
hdu.writeto(mock_im_file[i])
print "%f s for parallel computation." % (time.time() - tic)
I think the problem (as you suggested it in your question) comes from the fact that you are writing in the same array from multiple threads.
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1])
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
Although I am pretty sure that you could write into im_data_base in a "process-safe" manner (a implicit lock is used by python to synchronize access to the array), I am not sure you can write into im_data in a process-safe manner.
I would therefore (even though I am not sure I will solve your issue) advise you to create an explicit lock around im_data
# Disable python implicit lock, we are going to use our own
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1],
lock=False)
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
# Create our own lock
im_data_lock = Lock()
Then in the processes, acquire the lock each time you need to modify im_data
self.im_data_lock.acquire()
im_data[im_rangey[0]:im_rangey[1], im_rangex[0]:im_rangex[1]] = 10
self.im_data_lock.release()
I omitted the code to pass the lock to the contructor of your process and store it as a member field (self.im_data_lock) for the sake of brevity. You should also pass the im_data array to the constructor of your process and store it as a member field.
The problem occurs in your example when multiple threads write into overlapping regions in the image/array. So indeed you either have to put one lock per image or create a set of locks per image sections (to reduce lock contention).
Or you can produce image modifications in one set of processes and do the actual modification of the image in a separate single thread.