Using shared array in multiprocessing - python

I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
In a extract_polygon function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
Below is my sample code:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
Can I use any other library like ray in this case?

You can absolutely use a library like Ray.
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
#ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
You can read more about Ray in the documentation.
See some related answers below:
Shared-memory objects in multiprocessing
python3 multiprocess shared numpy array(read-only)

Related

Apply ufunc to xarray single Dataset variable as delayed operation using dask

I would like to apply a custom function to a variable within an xarray.Dataset modifying only the specified variable. At the same time I am trying to make this part of a dask computation graph so it can be delayed prior to reading out to disk with to_netcdf.
At the moment I can apply the ufunc using xr.apply_ufunc() but only to all variables within the Dataset.
I understand I could probably access the variable directly using it's name like Dataset.var and pass this to apply_ufunc() but I don't quite understand how the output of this function (a delayed future) would be recombined with the original dataset prior to output.
Ideally, I want to do something like this (where 'data.nc' has multiple variables and only var1 is squared).
import xarray as xr
from distributed import Client
dask_client = Client()
def square(x):
return x*x
data = xr.open_dataset('data.nc', chunks={'d1':10})
fut_sq = xr.apply_ufunc(square, data.var1, dask='parallelized', output_dtypes=['float'])
data.var1 = fut_sq.var1
fut_save = data.to_netcft('new.nc', compute=False)
dask_client.compute(fut_save)
So I played around with this a bit more and decided that the best way to do this was to extract the data from the netCDF4 file, convert it to a dask.array and then rewrite a new file to disk. This involves writing custom functions using the dask.delayed functionality. Using the ufunc approach was probably inappropriate for my problem.
A few drawbacks of this:
You don't seem to be able to modify the file in place. To save the modified variables from the original NetCDF4 file you have to rewrite the whole file to disk.
For me at least, the best way to parallelise the custom square function was to create my own data chunks and pass these to chunks individually to square. Then reconstitute them using dask.array.concatenate. I know dask has some bagging functionality but I struggled to get it to work the way I wanted.
The reading of the file happens in parallel but it does not appear that dask writes to NetCDF4 in parallel.
It would be great if I can be corrected on these points.
Here is my amended example
import xarray as xr
from distributed import Client
import dask
import dask.array as da
dask_client = Client()
def bag_slices(ind, n=10):
bag = list()
prev = 0
for i in range(len(ind)):
if (i+1)%n == 0:
bag.append(slice(prev, i+1, 1))
prev = i+1
if prev != i+1:
bag.append(slice(prev, i+1, 1))
return bag
#dask.delayed
def square(x):
return x*x
#dask.delayed
def assign(old_xr_dataset, new_data):
old_xr_dataset['var1'].values = new_data
return old_xr_dataset
# for me data.data.var1 is 3D and I process by splitting the data along the second dimension.
with xr.open_dataset('data.nc', chunks={'d1':10}) as data:
# create slice bags for distributed processing along preferred axis
bags = bag_slices(data.coords['dim2'].values, n=10)
# convert to dask array
data_da = da.from_array(data.var1.values)
# create data bags
bags = [data_da[:, slc, :] for slc in bags]
future_squared = []
for data_bag in bags:
# concatenate doesn't understand delayed objects
# so must convert them back to delayed arrays
future_squared.append(da.from_delayed(square(data_bag), data_bag.shape, dtype=float))
data_new = dask.array.concatenate(future_squared, axis=1)
fut_dataset = assign(data, data_new)
fut_nc_save = fut_dataset.to_netcdf('data_squared.nc', compute=False)
fut_nc_save.compute()

How to correctly implement apply_async for data processing?

I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!
If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.

Folding in array mutations using multiprocessing in python

I have 218k+ 33-channel images, and I need to find the mean and variance of each channel. I've tried to use multiprocessing, but this seems unbearably slow. Here's a brief code sample:
def work(aggregates, genput):
# received (channel, image) from generator
channel = genput[0]
image = genput[1]
for row in image:
for pixel in row:
# use welford's to update a list of "aggregates" which will
# later be finalized as means and variances of each channel
aggregates[channel] = update(aggregates[channel], pixel)
def data_stream(df, data_root):
'''Generator that returns the channel and image for each tif file'''
for index, sample in df.iterrows():
curr_img_path = data_root
# read the image with all channels
tif = imread(curr_img_path) #33x64x64 array
for channel, image in enumerate(tif):
yield (channel, image)
# Pass over each image, compute mean/variance for each channel for each image
def preprocess_mv(df, data_root, channels=33, multiprocessing=True):
'''Calculates mean and variance on the whole image set for use in deep_learn'''
manager = Manager()
aggregates = manager.list()
[aggregates.append(([0,0,0])) for i in range(channels)]
proxy = partial(work, aggregates)
pool = Pool(processes=8)
pool.imap(proxy, data_stream(df, data_root), chunksize=5000)
pool.close()
pool.join()
# finalize data below
My suspicion is that the time it takes to pickle the aggregates array and transfer that back and forth from parent to child processes takes a horrendously long time, and that this is the major bottleneck - I could see this drawback completely eliminating the multi-process advantage since each child is having to wait for other children to pickle and unpickle data. I've read that this is sort of a limitation of the multiprocessing library, and from the pieces I've put together reading other posts here, I've come to realize this may be the best I can do. That said, does anyone have suggestions for how this could be improved?
Additionally, I'm wondering if there are better libraries/tools for this task? A friend actually recommended Scala and I have been investigating that as an option. I'm just very familiar with Python and would like to stay in this domain if possible.
I was able to come to a solution by exploring multiprocessing.Array a little more in depth. I had to figure out how to convert my 2D array to a 1D array and still make indexing work out, but this ended up being pretty simple math. I can now process 1000 samples in 2 minutes instead of 4 hours, so I think that's pretty nice. I also had to write a custom function to print the array, but that's fairly straight forward. This implementation doesn't guarantee against race conditions, but for my purposes this works fairly well. You could easily add a lock by including it in init and passing it in the same way you do with the array (using global).
def init(arr):
global aggregates
aggregates = arr
def work(genput):
# received (sample, channel, image) from generator
sample_no = genput[0]
channel = genput[1]
image = genput[2]
currAgg = (aggregates[3*channel], aggregates[3*channel+1],
aggregates[3*channel+2])
for row in image:
for pixel in row:
# use welford's to compute updated aggregate
newAgg = update(currAgg, pixel)
currAgg = newAgg
# New method of indexing for 1D array ("shaped" as 33x3)
aggregates[3*channel] = newAgg[0]
aggregates[(3*channel)+1] = newAgg[1]
aggregates[(3*channel)+2] = newAgg[2]
def data_stream(df, data_root):
'''Generator that returns the channel and image for each tif file'''
...
yield (index, channel, image)
if __name__ == '__main__':
aggs = Array('d', np.zeros(99)) #99 values for all aggrs
pool = Pool(initializer=init, initargs=(aggs,), processes=8)
pool.imap(work, data_stream(df, data_root), chunksize=10)
pool.close()
pool.join()
#-----------finalize aggregates below

Sparse matrix dot product keeping only N-max values per result row

I've got a very huge csr sparse matrix M. I want to get dot product of this matrix to itself (M.dot(M.T)) and keep only N max values per each row in the result matrix R. The problem is that dot product M.dot(M.T) raises MemoryError. So I created modified implementation of dot function, that looks like:
def dot_with_top(m1, m2, top=None):
if top is not None and top > 0:
res_rows = []
for row_id in xrange(m1.shape[0]):
row = m1[row_id]
if row.nnz > 0:
res_row = m1[row_id].dot(m2)
if res_row.nnz > top:
args_ids = np.argsort(res_row.data)[-top:]
data = res_row.data[args_ids]
cols = res_row.indices[args_ids]
res_rows.append(csr_matrix((data, (np.zeros(top), cols)), shape=res_row.shape))
else:
res_rows.append(res_row)
else:
res_rows.append(csr_matrix((1, m1.shape[0])))
return sparse.vstack(res_rows, 'csr')
return m1.dot(m2)
It works fine but it's a bit slow. Is it possible to make this calculation faster or maybe you know some already existing method that do it faster?
You can implement your loop over the number of row in a function, and call this function with the multiprocessing.Pool() object.
This will parallelize the execution of your loop and should add a nice speedup.
Example :
from multiprocessing import Pool
def f(row_id):
# define here your function inside the loop
return vstack(res_rows, 'csr')
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor
p.map(f, xrange(m1.shape[0]))
source : https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Note that some python-implemented function already use multiprocessing (common in numpy), so you should check your processor activity when your script is running before implementing this solution.

Python: Using multiprocessing module as possible solution to increase the speed of my function

I wrote a function in Python 2.7 (on Window OS 64bit) in order to calculate the mean value of of the intersection area from a reference polygon (Ref) and one or more segmented (Seg) polygon(s) in ESRI shapefile format. The code is quite slow because i have more that 2000 reference polygon (s) and for each Ref_polygon the function run for every time for all Seg polygons(s) (more than 7000). I am sorry but the function is a prototype.
I wish to know if multiprocessing can help me to increase the speed of my loop or there are more performance solutions. if multiprocessing can be a possible solution i wish to know the best way to optimize my following function
import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal
def AreaInter(reference,segmented,outFile):
# open shapefile
ref = osgeo.ogr.Open(reference)
if ref is None:
raise SystemExit('Unable to open %s' % reference)
seg = osgeo.ogr.Open(segmented)
if seg is None:
raise SystemExit('Unable to open %s' % segmented)
ref_layer = ref.GetLayer()
seg_layer = seg.GetLayer()
# create outfile
if not os.path.split(outFile)[0]:
file_path, file_name_ext = os.path.split(os.path.abspath(reference))
outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
else:
file_path_name, file_ext = os.path.splitext(outFile)
file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
# For each reference objects-i
for index in xrange(ref_layer.GetFeatureCount()):
ref_feature = ref_layer.GetFeature(index)
# get FID (=Feature ID)
FID = str(ref_feature.GetFID())
ref_geometry = ref_feature.GetGeometryRef()
pts = ref_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
# convert in a shapely polygon
ref_polygon = Polygon(points)
# get the area
ref_Area = ref_polygon.area
# create an empty list
Area_seg, Area_intersect = ([] for _ in range(2))
# For each segmented objects-j
for segment in xrange(seg_layer.GetFeatureCount()):
seg_feature = seg_layer.GetFeature(segment)
seg_geometry = seg_feature.GetGeometryRef()
pts = seg_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
seg_polygon = Polygon(points)
seg_Area.append = seg_polygon.area
# intersection (overlap) of reference object with the segmented object
intersect_polygon = ref_polygon.intersection(seg_polygon)
# area of intersection (= 0, No intersection)
intersect_Area.append = intersect_polygon.area
# Avarage for all segmented objects (because 1 or more segmented polygons can intersect with reference polygon)
seg_Area_average = numpy.average(seg_Area)
intersect_Area_average = numpy.average(intersect_Area)
file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
file_out.close()
You can use the multiprocessing package, and especially the Pool class. First create a function that does all the stuff you want to do within the for loop, and that takes as an argument only the index:
def process_reference_object(index):
ref_feature = ref_layer.GetFeature(index)
# all your code goes here
return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
Note that this doesn't write to a file itself- that would be messy because you'd have multiple processes writing to the same file at the same time. Instead, it returns the string that needs to be written. Also note that there are objects in this function like ref_layer or ref_geometry that will need to reach it somehow- that's up to you how to do it (you could put process_reference_object as the method in a class initialized with them, or it could be as ugly as just defining them globally).
Then, you create a pool of process resources, and run all of your indices using Pool.imap_unordered (which will itself allocate each index to a different process as necessary):
from multiprocessing import Pool
p = Pool() # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
file_out.write(l)
This will parallelize the independent processing of your reference objects across multiple processes, and write them to the file (in an arbitrary order, note).
Threading can help to a degree, but first you should make sure you can't simplify the algorithm. If you're checking each of 2000 reference polygons against 7000 segmented polygons (perhaps I misunderstood), then you should start there. Stuff that runs at O(n2) is going to be slow, so maybe you can prune away things that will definitely not intersect or find some other way to speed things up. Otherwise, running multiple processes or threads will only improve things linearly when your data grows geometrically.

Categories