How to avoid large objects in task graph

How to avoid large objects in task graph - python

I am running simulations using dask.distributed. My model is defined in a delayed function and I stack several realizations.
A simplified version of what I do is given in this code snippet:
import numpy as np
import xarray as xr
import dask.array as da
import dask
from dask.distributed import Client
from itertools import repeat
#dask.delayed
def run_model(n_time,a,b):
result = np.array([a*np.random.randn(n_time)+b])
return result
client = Client()
# Parameters
n_sims = 10000
n_time = 100
a_vals = np.random.randn(n_sims)
b_vals = np.random.randn(n_sims)
output_file = 'out.nc'
# Run simulations
out = da.stack([da.from_delayed(run_model(n_time,a,b),(1,n_time,),np.float64) for a,b in zip(a_vals, b_vals)])
# Store output in a dataframe
ds = xr.Dataset({'var1': (['realization', 'time'], out[:,0,:])},
coords={'realization': np.arange(n_sims),
'time': np.arange(n_time)*.1})
# Save to a netcdf file -> at this point, computations will be carried out
ds.to_netcdf(output_file)
If I want to run a lot of simulations I get the following warning:
/home/user/miniconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 2.73 MB detected in task graph:
("('getitem-32103d4a23823ad4f97dcb3faed7cf07', 0, ... cd39>]), False)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
As far as I understand (from this and this question), the method proposed by the warning helps in getting large data into the function. However, my inputs are all scalar values, so they should not take up nearly 3MB of memory. Even if the function run_model() does not take any argument at all (so no parameters are passed), I get the same warning.
I also had a look at the task graph to see whether there is some step that requires loading lots of data. For three realizations it looks like this:
So it seems to me that every realization is handled separately which should keep the amount of data to deal with low.
I would like to understand what actually is the step that produces a large object and what I need to do to break it down into smaller parts.

The message is, in this case, slightly misleading. The issue is demonstrated by the following:
> len(out[:, 0, :].dask)
40000
> out[:, 0, :].npartitions
10000
and the pickled size of that graph (whose head is the getitem key in the message) is the ~3MB. By creating a dask-array for each element of the computation, you end up with a stacked array with as many partitions as elements, and the model run operation and item selection, as well as storage operation are being applied to every single one and stored in the graph. Yes, they are independent, and likely this entire computation would complete, but this is all very wasteful, unless the model making function runs for considerable time on each input scalar.
In your real situation, it may be that the inner arrays are in fact bigger than the one-element version you present, but in the general case of doing numpy operations on arrays, it is normal to create the arrays on the workers (with random or some load function) and operate on partitions with sizes >100MB.

Related

Make good use of CPUs within one function or run it parallel?

Here is the problem. I have thousands of formulas to evaluate and access, such as 'rank(sqrt(v1))' or 'corr(v1, square(v2))'. All values(v1, v2, ...) have been shared on memory for multiprocessing. I have one function and all computations within it just use numpy or scipy functions.
from scipy.stats import rankdata
import multiprocessing as np
def myeval(formula):
... # read from shared_memory
... # compute
return value
def eval_access(formula):
factor = myeval(formula) # comsumes 1GB memory
factor_performance1 = rankdata(factor, axis=1) # another Gigabyte
factor_performance2 = np.full_like(factor, np.nan) # another one
... # some computation
pickle_part # some I/O to record performance (just some float number)
return
with mp.Pool(30) as p:
p.map(eval_access, all_formulas)
Running this function on parallel can make good use of all CPUs(almost 100% usage every second), but each process will consume at most 3GB, so 30 processes may consume at most 60GB memory simutaneously. And also, it seems numpy operations under multiprocessing will slow down. To avoid these problems, I want to use all CPUs within eval_access function as follows.
from scipy.stats import rankdata
import multiprocessing as np
def myeval(ind1, ind2):
# alter value[ind1:ind2, :] or value[:, ind1:ind2]
def myeval_multiprocess(formula):
# read from shared_memory
value = np.zeros(shape) # prepare return value's space
... # share_memory value
...# split value's index into 30 parts
with mp.Pool(30) as p:
p.map(myeval, ind_list)
return value
def eval_access(formula):
factor = myeval_multiprocess(formula) # comsumes 1GB memory
factor_performance1 = rankdata(factor, axis=1) # another Gigabyte
factor_performance2 = np.full_like(factor, np.nan) # another one
... # some computation
pickle_part # some I/O to record performance (just some float number)
return
for formula in all_formulas:
eval_access(formula)
This is an ideal situation where these two versions will cost the same time and the latter one needs less memory anytime. However, the problem is it costs a lot of time to share_memory value within myeval_multiprocess and not all methods in eval_access can make good use of all CPUs, unless I decorate all functions I will use to support split the whole problem into some subsets.
So, are there any suggestions on which version is better or how to speed up, and can the latter one run ideally? Meanwhile, is there any method that can multiprocessing every computation easily?

xarray: applying ufunc to month+day groups of dask based Dataset

I have a function that I'd like to apply along the time axis for each combination of month & day of a dataset, eg. to all time slices that have month=1 and day=1 over all years (and so on).
My current solution is to loop over the combinations, subset the dataset, apply the function and persist the result on the dask cluster:
from distributed import Client
import numpy as np
import xarray as xr
client = Client()
# fake function
def fun_along_time(x):
return x*2
# testdata
x = xr.tutorial.load_dataset("air_temperature")
# time dimension needs to be single chunk
x = x.chunk({"time":-1, "lat":10, "lon":10})
timeindex = x.time.to_index()
for month in timeindex.month.unique():
for day in timeindex.day.unique():
xsel = x.sel(
time=np.logical_and(x.time.dt.month == month, x.time.dt.day == day)
)
xres = xr.apply_ufunc(fun_along_time,
xsel,
input_core_dims=[["time"]],
output_core_dims=[["time"]],
dask="parallelized")
xres = xres.persist()
# some downstream tasks following
While this approach works, it does not seem very elegant and it calls persist repeatedly which is hard on the scheduler (and a bad practice?).
To make it work for the scheduler I've tried persisting the arrays together using client.persist, which worked OK but doesn't seem to be efficient either, and waiting for each array to finish persisting then computing the downstream tasks before moving on top the next iteration - this works too with obvious downsides.
So I'm wondering what would be a better way to achieve this, keeping in mind that this needs to scale well with dask.
I've looked into groupby and map but I don't think it works well for me as it generates a ton of dask tasks and messes with the chunksize which generates all kinds of issues itself.
I though about adding a new dimension / coordinate layer with month/day which I can then use for map/apply_ufunc but I cound't wrap my head around it.

Parallelization Python loop

I'm a bit lost between joblib, multiprocessing, etc..
What's the most effective way to parallelize a for loop, based on your experience?
For example :
for i, p in enumerate(patches[ss_idx]):
bar.update(i+1)
h_features.append(calc_haralick(p))
def calc_haralick(roi):
feature_vec = []
texture_features = mt.features.haralick(roi)
mean_ht = texture_features.mean(axis=0)
[feature_vec.append(i) for i in mean_ht[0:9]]
return np.array(feature_vec)
It gets i patches of images then extract features via haralick
And this is how I get patches
h_neigh = 11 # haralick neighbourhood
size = h_neigh
shape = (img.shape[0] - size + 1, img.shape[1] - size + 1, size, size)
strides = 2 * img.strides
patches = stride_tricks.as_strided(img, shape=shape, strides=strides)
patches = patches.reshape(-1, size, size)
Sorry if any information is superfluous

Your images appear to be simple two-dimensional NumPy arrays, and patches a list or array of those. I assume ss_idx is an index array (i.e., not an integer), so that patches[ss_idx] remains something that can be iterated over (as in your example).
In that case, simply use multiprocessing.Pool.map:
import multiprocessing as mp
nproc = 10
with mp.Pool(nproc) as pool:
h_features = pool.map(calc_haralick, patches[ss_idx])
See the first basic example in the multiprocessing documentation.
If you leave out nproc or set it to None, all available cores will be used.
The potential problem with multiprocessing is, that it will create nproc identical Python processes, and copy all the relevant data to those processes. If your images are large, this will cause considerable overhead.
In such a case, it may be worth to split your Python program in separate programs, where calculating the future of a single image is one independent program. That program would need to handle reading a single image and writing the features. You'd then wrap everything in e.g. a bash script that loops over all images, taking care to use only a certain amount of cores at the same (e.g., background processes, but wait every 10 images). The next step/program requires reading the independent feature files into a multi-dimensional array, but from there, you can continue your old program.
While this is more work, it may save some copying overhead (though it introduces extra I/O overhead, in particular writing the separate feature files).
It also has the optional advantage that this is fairly easy to run distributed, should the possibility ever occur.
Try multiprocessing, keeping an eye out on memory usage and CPU usage (if nothing happens for a long time, it may be copying overhead). Then, try another method.

Dask: How to efficiently distribute a genetic search algorithm?

I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool

Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.

Parallelize Image Processing Using Numpy

I'm trying to speed up a section of my code using parallel processing in python, but I'm having trouble getting it to work right, or even find examples that are relevant to me.
The code produces a low-polygon version of an image using Delaunay triangulation, and the part that's slowing me down is finding the mean values of each triangle.
I've been able to get a good speed increase by vectorizing my code, but hope to get more using parallelization:
The code I'm having trouble with is an extremely simple for loop:
for tri in tris:
lopo[tridex==tri,:] = np.mean(hipo[tridex==tri,:],axis=0)
The variables referenced are as follows.
tris - a unique python list of all the indices of the triangles
lopo - a Numpy array of the final low-polygon version of the image
hipo - a Numpy array of the original image
tridex - a Numpy array the same size as the image. Each element represents a pixel and stores the triangle that the pixel lies within
I can't seem to find a good example that uses multiple numpy arrays as input, with one of them shared.
I've tried multiprocessing (with the above snippet wrapped in a function called colorImage):
p = Process(target=colorImage, args=(hipo,lopo,tridex,ppTris))
p.start()
p.join()
But I get a a broken pipe error immediately.

So the way that Python's multiprocessing works (for the most part) is that you have to designate the individual threads that you want to run. I made a brief introductory tutorial here: http://will-farmer.com/parallel-python.html
In your case, what I would recommend is split tris into a bunch of different parts, each equally sized, each that represents a "worker". You can split this list with numpy.split() (documentation here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html).
Then for each list in tri, we use the Threading and Queue modules to designate 8 workers.
import numpy as np
# split into 8 different lists
tri_lists = np.split(tris, 8)
# Queues are threadsafe
return_values = queue.Queue()
threads = []
def color_image(q, tris, hipo, tridex):
""" This is the function we're parallelizing """
for tri in tris:
return_values.put(np.mean(hipo[tridex==tri,:], axis=0))
# Now we run the jobs
for i in range(8):
threads.append(threading.Thread(
target=color_image,
args=(return_values, tri_lists[i], hipo, tridex)))
# Now we have to cleanup our results
# First get items from queue
results = [item for item in return_values.queue]
# Now set values in lopo
for i in range(len(results)):
for t in tri_lists[i]:
lopo[tridex==t, :] = results[i]
This isn't the cleanest way to do it, and I'm not sure if it works since I can't test it, but this is a decent way to do it. The parallelized part is now np.mean(), while setting the values is not parallelized.
If you want to also parallelize the setting of the values, you'll have to have a shared variable, either using the Queue, or with a global variable.
See this post for a shared global variable: Python Global Variable with thread

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.