I'm not an expert on python but I have managed to write down a multiprocessing code that uses all my cpus and cores in my PC. My code loads a very large array, about 1.6 GB, and I need to update the array in every process. Fortunately, the update consists of adding some artificial stars to the image and every process has a different set of image positions where to add the artificial stars.
The image is too large and I can't create a new one every time a call a process. My solution was creating a variable in the shared memory and I save plenty of memory. For some reason, it works for 90% of the image but there are regions were my code add random numbers in some of the positions I sent before to the processes. Is it related to the way I create a shared variable? Are the processes interfering each other during the execution of my code?
Something weird is that when using a single cpu and single core, the images is 100% perfect and there are no random numbers added to the image. Do you suggest me a way to share a large array between multiple processes? Here the relevant part of my code. Please, read the line when I define the variable im_data.
import warnings
warnings.filterwarnings("ignore")
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import sys,os
import subprocess
import numpy as np
import time
import cv2 as cv
import pyfits
from pyfits import getheader
import multiprocessing, Queue
import ctypes
class Worker(multiprocessing.Process):
def __init__(self, work_queue, result_queue):
# base class initialization
multiprocessing.Process.__init__(self)
# job management stuff
self.work_queue = work_queue
self.result_queue = result_queue
self.kill_received = False
def run(self):
while not self.kill_received:
# get a task
try:
i_range, psf_file = self.work_queue.get_nowait()
except Queue.Empty:
break
# the actual processing
print "Adding artificial stars - index range=", i_range
radius=16
x_c,y_c=( (psf_size[1]-1)/2, (psf_size[2]-1)/2 )
x,y=np.meshgrid(np.arange(psf_size[1])-x_c,np.arange(psf_size[2])-y_c)
distance = np.sqrt(x**2 + y**2)
for i in range(i_range[0],i_range[1]):
psf_xy=np.zeros(psf_size[1:3], dtype=float)
j=0
for i_order in range(psf_order+1):
j_order=0
while (i_order+j_order < psf_order+1):
psf_xy += psf_data[j,:,:] * ((mock_y[i]-psf_offset[1])/psf_scale[1])**i_order * ((mock_x[i]-psf_offset[0])/psf_scale[0])**j_order
j_order+=1
j+=1
psf_factor=10.**( (30.-mock_mag[i])/2.5)/np.sum(psf_xy)
psf_xy *= psf_factor
npsf_xy=cv.resize(psf_xy,(npsf_size[0],npsf_size[1]),interpolation=cv.INTER_LANCZOS4)
npsf_factor=10.**( (30.-mock_mag[i])/2.5)/np.sum(npsf_xy)
npsf_xy *= npsf_factor
im_rangex=[max(mock_x[i]-npsf_size[1]/2,0), min(mock_x[i]-npsf_size[1]/2+npsf_size[1], im_size[1])]
im_rangey=[max(mock_y[i]-npsf_size[0]/2,0), min(mock_y[i]-npsf_size[0]/2+npsf_size[0], im_size[0])]
npsf_rangex=[max(-1*(mock_x[i]-npsf_size[1]/2),0), min(-1*(mock_x[i]-npsf_size[1]/2-im_size[1]),npsf_size[1])]
npsf_rangey=[max(-1*(mock_y[i]-npsf_size[0]/2),0), min(-1*(mock_y[i]-npsf_size[0]/2-im_size[0]),npsf_size[0])]
im_data[im_rangey[0]:im_rangey[1], im_rangex[0]:im_rangex[1]] = 10.
self.result_queue.put(id)
if __name__ == "__main__":
n_cpu=2
n_core=6
n_processes=n_cpu*n_core*1
input_mock_file=sys.argv[1]
print "Reading file ", im_file[i]
hdu=pyfits.open(im_file[i])
data=hdu[0].data
im_size=data.shape
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1])
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
data=0
assert im_data.base.base is im_data_base.get_obj()
# run
# load up work queue
tic=time.time()
j_step=np.int(np.ceil( mock_n*1./n_processes ))
j_range=range(0,mock_n,j_step)
j_range.append(mock_n)
work_queue = multiprocessing.Queue()
for j in range(np.size(j_range)-1):
if work_queue.full():
print "Oh no! Queue is full after only %d iterations" % j
work_queue.put( (j_range[j:j+2], psf_file[i]) )
# create a queue to pass to workers to store the results
result_queue = multiprocessing.Queue()
# spawn workers
for j in range(n_processes):
worker = Worker(work_queue, result_queue)
worker.start()
# collect the results off the queue
while not work_queue.empty():
result_queue.get()
print "Writing file ", mock_im_file[i]
hdu[0].data=im_data
hdu.writeto(mock_im_file[i])
print "%f s for parallel computation." % (time.time() - tic)
I think the problem (as you suggested it in your question) comes from the fact that you are writing in the same array from multiple threads.
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1])
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
Although I am pretty sure that you could write into im_data_base in a "process-safe" manner (a implicit lock is used by python to synchronize access to the array), I am not sure you can write into im_data in a process-safe manner.
I would therefore (even though I am not sure I will solve your issue) advise you to create an explicit lock around im_data
# Disable python implicit lock, we are going to use our own
im_data_base = multiprocessing.Array(ctypes.c_float, im_size[0]*im_size[1],
lock=False)
im_data = np.ctypeslib.as_array(im_data_base.get_obj())
im_data = im_data.reshape(im_size[0], im_size[1])
im_data[:] = data
# Create our own lock
im_data_lock = Lock()
Then in the processes, acquire the lock each time you need to modify im_data
self.im_data_lock.acquire()
im_data[im_rangey[0]:im_rangey[1], im_rangex[0]:im_rangex[1]] = 10
self.im_data_lock.release()
I omitted the code to pass the lock to the contructor of your process and store it as a member field (self.im_data_lock) for the sake of brevity. You should also pass the im_data array to the constructor of your process and store it as a member field.
The problem occurs in your example when multiple threads write into overlapping regions in the image/array. So indeed you either have to put one lock per image or create a set of locks per image sections (to reduce lock contention).
Or you can produce image modifications in one set of processes and do the actual modification of the image in a separate single thread.
Related
I am trying to run my simulations in a threadpool and store my results for each repetition in a global numpy array. However, I get problems while doing that and I am observing a really interesting behavior with the following (, simplified) code (python 3.7):
import numpy as np
from multiprocessing import Pool, Lock
log_mutex = Lock()
repetition_count = 5
data_array = np.zeros(shape=(repetition_count, 3, 200), dtype=float)
def record_results(repetition_index, data_array, log_mutex):
log_mutex.acquire()
print("Start record {}".format(repetition_index))
# Do some stuff and modify data_array, e.g.:
data_array[repetition_index, 0, 53] = 12.34
print("Finish record {}".format(repetition_index))
log_mutex.release()
def run(repetition_index):
global log_mutex
global data_array
# do some simulation
record_results(repetition_index, data_array, log_mutex)
if __name__ == "__main__":
random.seed()
with Pool(thread_count) as p:
print(p.map(run, range(repetition_count)))
The issue is: I get the correct "Start record & Finish record" outputs, e.g. Start record 1... Finish record 1. However, the different slices of the numpy array that are modified by each thread is not kept in the global variable. In other words, the elements that have been modified by thread 1 is still zero, a thread 4 overwrites different parts of the array.
One additional remark, the address of the global array, which I retrieve by
print(hex(id(data_array))) is the same for all threads, inside their log_mutex.acquire() ... log_mutex.release() lines.
Am I missing a point? Like, there are multiple copies of the global data_array stored for each thread? I am observing some behavior like this but this should not be the case when I use global keyword, am I wrong?
Looks like you're running the run function using multiple processes, not multiple threads. Try something like this instead:
import numpy as np
from threading import Thread, Lock
log_mutex = Lock()
repetition_count = 5
data_array = np.zeros(shape=(repetition_count, 3, 200), dtype=float)
def record_results(repetition_index, data_array, log_mutex):
log_mutex.acquire()
print("Start record {}".format(repetition_index))
# Do some stuff and modify data_array, e.g.:
data_array[repetition_index, 0, 53] = 12.34
print("Finish record {}".format(repetition_index))
log_mutex.release()
def run(repetition_index):
global log_mutex
global data_array
record_results(repetition_index, data_array, log_mutex)
if __name__ == "__main__":
threads = []
for i in range(repetition_count):
t = Thread(target=run, args=[i])
t.start()
threads.append(t)
for t in threads:
t.join()
Update:
To do this with multiple processes, you would need to use multiprocessing.RawArray to instantiate your array; the size of the array is the product repetition_count * 3 * 200. Within each process, create a view on the array using np.frombuffer, and reshape it accordingly. While this will be very fast, I discourage this style of programming as it relies on global shared memory objects, which are error-prone in larger programs.
If possible, I suggest removing the global data_array and instead instantiate an array in each call to record_results, which you would return in run. The p.map call will return a list of arrays, which you can convert to a numpy array and recover the shape and contents of the global data_array in your original implementation. This will incur a communication cost, but it's a cleaner approach to managing concurrency and eliminates the need for locks.
It's generally a good idea to minimize inter-process communication, but unless performance is critical, I don't think shared memory is the right solution. With p.map, you'll want to avoid returning large objects, but the object sizes in your snippet are very small (600*8 bytes).
I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!
If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.
I'm attempting to get python multiprocessing working to speed up a code I've written. The code looks like this:
from multiprocessing import Array, Pool
import numpy as np
#setting up shared memory array
global misfit
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,1):
#setting up pool
pool = Pool()
p = [pool.apply_async(self.testfunc,args=(somevals,j)) for j in xrange(0,1)]
pool.close()
pool.join()
Where self.testfunc looks like:
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,1):
#some calculations
for mn in xrange(0,1):
#some more calculations
#save results
result = i*j*k*mn # example
misfit[i*j*k*mn] = result
My problem is that when I run this none of the values are saved in the shared Array, and it remains empty. I understand this could be to do with the global variable, but in a simpler program that uses this exact setup, the values are saved to the array. The array is quite large in the full program as well (4561920000 values). Also if I call this function outside of the Pool, it works and the values are saved.
So my question is what I am doing wrong here? Am I sending the shared Array incorrectly?
EDIT: Figured I'd add in the code that works:
from multiprocessing import Array, Pool
from numpy import empty, sin
from time import time
import numpy as np
def initarr():
a = Array('d', empty((5, 50, 80)).flat)
return a
def testfunc(i, j, k):
count = (i*50*80) + (j*80) + k
x = sin(k)
a[count] = x
y = np.fft.fft(np.exp(2j*np.pi*np.arange(50000)/50000))
def process(i):
start = time()
pool = Pool()
for j in xrange(0, 50):
p = [pool.apply_async(testfunc, args=(i, j, k)) for k in xrange(0, 80)]
pool.close()
pool.join()
print time() - start
global a
a = initarr()
for i in xrange(0, 5):
process(i)
Ok so with the help of someone from our IT department, I finally have a version of this that works, so for anybody in the future viewing this question, I'll post a solution. I haven't really used stack overflow much so sorry if it's bad etiquette to answer my own question.
We got this working using an initializer function, but we had to make sure the initializer function was in the same file (module) as the function being run by the Pool. So in one module (misc) we had:
**misc.py**
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,len(krange)):
#some calculations
for mn in xrange(0,len(mnrange)):
#some more calculations
#save results
loc = (i*len(jrange)*len(krange)*len(mnrange))+
(j*len(krange)*len(mnrange))+(k*len(mnrange))+mn
result = i*j*k*mn # example
misfit[loc] = result
def initpool(a):
global misfit
misfit = a
And in the main file we have:
**main.py**
from multiprocessing import Array, Pool
from misc import initpool, testfunc
import numpy as np
#setting up shared memory array
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,len(irange)):
#setting up pool
pool = Pool(initializer=initpool,initargs=(misfit,),processes=20)
p = [pool.apply_async(testfunc,args=(somevals,j)) for j in xrange(0,len(jrange))]
pool.close()
pool.join()
print(misfit[0])
Note that when we initially set up the Array, it must be named the same as the variable you set in initpool, at least from when I tested it.
This probably isn't the best way to do it but it works and hopefully some other people might find a use for it!
I have a code which reads data from multiple files named 001.txt, 002.txt, ... , 411.txt. I would like to read the data from each file, plot them, and save as 001.jpg, 002.jpg, ... , 411.jpg.
I can do this by looping through the files, but I would like to use the multiprocess module to speed things up.
However, when I use the code below, the computer hangs- I can't click on anything, but the mouse moves, and the sound continues. I then have to power down the computer.
I'm obviously misusing the multiprocess module with matplotlib. I have used something very similar to the below code to actually generate the data, and save to text files with no problems. What am I missing?
import multiprocessing
def do_plot(number):
fig = figure(number)
a, b = random.sample(range(1,9999),1000), random.sample(range(1,9999),1000)
# generate random data
scatter(a, b)
savefig("%03d" % (number,) + ".jpg")
print "Done ", number
close()
for i in (0, 1, 2, 3):
jobs = []
# for j in chunk:
p = multiprocessing.Process(target = do_plot, args = (i,))
jobs.append(p)
p.start()
p.join()
The most important thing in using multiprocessing is to run the main code of the module only for the main process. This can be achieved by testing if __name__ == '__main__' as shown below:
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(1000)
b = random.sample(1000)
# generate random data
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))
Note also that I replaced the creation of the separate processes by a process pool (which scales better to many pictures since it only uses as many process as you have cores available).
I wrote a function in Python 2.7 (on Window OS 64bit) in order to calculate the mean value of of the intersection area from a reference polygon (Ref) and one or more segmented (Seg) polygon(s) in ESRI shapefile format. The code is quite slow because i have more that 2000 reference polygon (s) and for each Ref_polygon the function run for every time for all Seg polygons(s) (more than 7000). I am sorry but the function is a prototype.
I wish to know if multiprocessing can help me to increase the speed of my loop or there are more performance solutions. if multiprocessing can be a possible solution i wish to know the best way to optimize my following function
import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal
def AreaInter(reference,segmented,outFile):
# open shapefile
ref = osgeo.ogr.Open(reference)
if ref is None:
raise SystemExit('Unable to open %s' % reference)
seg = osgeo.ogr.Open(segmented)
if seg is None:
raise SystemExit('Unable to open %s' % segmented)
ref_layer = ref.GetLayer()
seg_layer = seg.GetLayer()
# create outfile
if not os.path.split(outFile)[0]:
file_path, file_name_ext = os.path.split(os.path.abspath(reference))
outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
else:
file_path_name, file_ext = os.path.splitext(outFile)
file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
# For each reference objects-i
for index in xrange(ref_layer.GetFeatureCount()):
ref_feature = ref_layer.GetFeature(index)
# get FID (=Feature ID)
FID = str(ref_feature.GetFID())
ref_geometry = ref_feature.GetGeometryRef()
pts = ref_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
# convert in a shapely polygon
ref_polygon = Polygon(points)
# get the area
ref_Area = ref_polygon.area
# create an empty list
Area_seg, Area_intersect = ([] for _ in range(2))
# For each segmented objects-j
for segment in xrange(seg_layer.GetFeatureCount()):
seg_feature = seg_layer.GetFeature(segment)
seg_geometry = seg_feature.GetGeometryRef()
pts = seg_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
seg_polygon = Polygon(points)
seg_Area.append = seg_polygon.area
# intersection (overlap) of reference object with the segmented object
intersect_polygon = ref_polygon.intersection(seg_polygon)
# area of intersection (= 0, No intersection)
intersect_Area.append = intersect_polygon.area
# Avarage for all segmented objects (because 1 or more segmented polygons can intersect with reference polygon)
seg_Area_average = numpy.average(seg_Area)
intersect_Area_average = numpy.average(intersect_Area)
file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
file_out.close()
You can use the multiprocessing package, and especially the Pool class. First create a function that does all the stuff you want to do within the for loop, and that takes as an argument only the index:
def process_reference_object(index):
ref_feature = ref_layer.GetFeature(index)
# all your code goes here
return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
Note that this doesn't write to a file itself- that would be messy because you'd have multiple processes writing to the same file at the same time. Instead, it returns the string that needs to be written. Also note that there are objects in this function like ref_layer or ref_geometry that will need to reach it somehow- that's up to you how to do it (you could put process_reference_object as the method in a class initialized with them, or it could be as ugly as just defining them globally).
Then, you create a pool of process resources, and run all of your indices using Pool.imap_unordered (which will itself allocate each index to a different process as necessary):
from multiprocessing import Pool
p = Pool() # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
file_out.write(l)
This will parallelize the independent processing of your reference objects across multiple processes, and write them to the file (in an arbitrary order, note).
Threading can help to a degree, but first you should make sure you can't simplify the algorithm. If you're checking each of 2000 reference polygons against 7000 segmented polygons (perhaps I misunderstood), then you should start there. Stuff that runs at O(n2) is going to be slow, so maybe you can prune away things that will definitely not intersect or find some other way to speed things up. Otherwise, running multiple processes or threads will only improve things linearly when your data grows geometrically.