Dask parallelize a short task that uses a large np.ndarray

Dask parallelize a short task that uses a large np.ndarray - python

I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).
A for loop over a matrix M with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?

Firstly, the dask documentation makes clear the following contraindications:
it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.

Related

ProcessPoolExecutor overhead ?? parallel processing takes more time than single process for large size matrix operation

My python code contains a numpy dot operation of huge size (over 2^(tens...)) matrix and vector.
To reduce the computing time, I applied parallel processing by dividing the matrix suited for the number of cpu cores.
I used concurrent.futures.ProcessPoolExecutor.
My issue is that the parallel processing takes much more time than single processing.
The following is my code.
single process code.
self._vector = np.dot(matrix, self._vector)
parallel processing code.
each_worker_coverage = int(self._dimension/self.workers)
args = []
for i in range(self.workers):
if (i+1)*each_worker_coverage < self._dimension:
arg = [i, gate[i*each_worker_coverage:(i+1)*each_worker_coverage], self._vector]
else:
arg = [i, gate[i*each_worker_coverage:self._dimension], self._vector]
args.append(arg)
pool = futures.ProcessPoolExecutor(self.workers)
results = list(pool.map(innerproduct, args, chunksize=1))
for result in results:
if (result[0]+1)*each_worker_coverage < self._dimension:
self._vector[result[0]*each_worker_coverage:(result[0]+1)*each_worker_coverage] = result[1]
else:
self._vector[result[0]*each_worker_coverage:self._dimension] = result[1]
The innerproduct function called in parallel is as follows.
def innerproduct(args):
answer = np.dot(args[1], args[2])
return args[0], answer
```
For a 2^14 x 2^14 matrix and a 2^14 vector, the single process code takes only 0.05 seconds, but the parallel processing code takes 6.2 seconds.
I also checked the time with the `innerproduct` method, and it only takes 0.02~0.03 sec.
I don't understand this situation.
Why does the parallel processing (multi-processing not multi-threading) take more time?

To exactly know the cause of the slowdown you would have to measure how long everything takes, and with multiprocessing and multithreading that can be tricky.
So what follows is my best guess. For multiprocessing to work, the parent process has to transfer the data used in the calculations to the worker processes. The time this takes depends on the amount of data. Transferring a 2^14 x 2^14 matrix is probably going to take a significant amount of time.
I suspect that this data transfer is what is taking the extra time.
If you are using an operating system that uses the fork startmethod for multiprocessing/concurrent.futures, there is a way around this data transfer. These operating systems are for example Linux, *BSD but not macOS and ms-windows).
On the abovementioned operating systems, multiprocessing uses the fork system call to create its workers. This system call creates a copy of the parent process as the child processes. So if you create the vectors and matrices before creating the ProcessPoolExecutor, the workers will inherit that data. This is not a very costly or time consuming operation because all these OS's use copy-on-write for managing memory pages. As long as the original matrix isn't changed, all programs using it are reading from the same memory pages. This inheriting of the data means you don't have to pass the data explicitly to the worker. You just have to pass a small data structure that says on which index ranges a worker has to operate.
Unfortunately, due to technical limitations of the platform, this doesn't work on macOS and ms-windows. What you could do on those systems is store the original matrix and vector memory mapped binary files before you create the Executor. If you tag these mappings with a name, the worker processes should be able to map the same data into their memory without having to transfer them. I think is it possible to instruct numpy to use such a raw binary array without recreating it.
On both platforms you could use the same technique to "send data back" to the parent process; save the data in shared memory and return the filename or tagname to the parent process.

If you are using modern versions of NumPy and OS, it's most likely that
self._vector = np.dot(matrix, self._vector)
is already optimized and uses all your CPU cores.
If np.show_config() displays openblas or MKL you may run a simple test:
a = np.random.rand(7000, 7000)
b = np.random.rand(7000, 7000)
np.dot(a, b)
It should use all CPU cores for a couple of seconds.
If it's not, you may install OpenBLAS or MKL and reinstall NumPy. See Using MKL to boost Numpy performance on Ubuntu

Threading vs Multiprocessing

Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?

Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this

Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.

Getting Dask map_blocks to make use of all available resources

I am using Dask to parallelize time series satellite imagery analysis on a cluster with a substantial amount of computational resources.
I have set up a distributed scheduler with many workers (--nprocs = 56) each managing one thread (--nthreads = 1) and 4GB of memory due to the embarrassingly parallel nature of the work.
My data comes in as an xarray that is chunked into a dask array and map_blocks is used to map a function across each chunk in order to generate an output array that will be saved to an image file.
data = inputArray.chunk(chunks={'y':1})
client.persist(data)
future = data.data.map_blocks(timeSeriesTrends.timeSeriesTrends, jd, drop_axis=[1])
future = client.persist(future)
dask.distributed.wait(future)
outputArray = future.compute()
My problem is that Dask does not make use of all the resources I have allocated to it. Instead it begins with very few parallelized tasks and slowly adds more as processes finish without ever reaching capacity.
This dramatically restricts the capabilities of the hardware I have access to as many of my resources spend most of their time sitting idle.
Is my approach appropriate for generating an output array from an input array? How can I best make use of the hardware I have access to in this situation?

faster numpy array copy; multi-threaded memcpy?

Suppose we have two large numpy arrays of the same data type and shape, of size on the order of GB's. What is the fastest way to copy all the values from one into the other?
When I do this using normal notation, e.g. A[:] = B, I see exactly one core on the computer at maximum effort doing the copy for several seconds, while the others are idle. When I launch multiple workers using multiprocessing and have them each copy a distinct slice into the destination array, such that all the data is copied, using multiple workers is faster. This is true regardless of whether the destination array is a shared memory array or one that becomes local to the worker. I can get a 5-10x speedup in some tests on a machine with many cores. As I add more workers, the speed does eventually level off and even slow down, so I think this achieves being memory-performance bound.
I'm not suggesting using multiprocessing for this problem; it was merely to demonstrate the possibility of better hardware utilization.
Does there exist a python interface to some multi-threaded C/C++ memcpy tool?
Update (03 May 2017)
When it is possible, using multiple python processes to move data can give major speedup. I have a scenario in which I already have several small shared memory buffers getting written to by worker processes. Whenever one fills up, the master process collects this data and copies it into a master buffer. But it is much faster to have the master only select the location in the master buffer, and assign a recording worker to actually do the copying (from a large set of recording processes standing by). On my particular computer, several GB can be moved in a small fraction of a second by concurrent workers, as opposed to several seconds by a single process.
Still, this sort of setup is not always (or even usually?) possible, so it would be great to have a single python process able to drop into a multi-threaded memcpy routine...

If you are certain that the types/memory layout of both arrays are identical, this might give you a speedup: memoryview(A)[:] = memoryview(B) This should be using memcpy directly and skips any checks for numpy broadcasting or type conversion rules.

How to specify number of workers in Dask.array

Suppose that you want to specify the number of workers in Dask.array, as Dask documentation shows, you can set:
dask.set_options(pool=ThreadPool(num_workers))
This works pretty well with some simulations I've run, for example, montecarlo's, but with some linear algebra operations, it seems that Dask overrides user specified configuration, for example:
import dask.array as da
import dask
from multiprocessing.pool import ThreadPool
dask.set_options(pool=ThreadPool(num_workers))
mat1 = da.random.random((size, size) chunks=chunk_size)
mat2 = da.random.random((size, size) chunks=chunk_size)
mat3 = mat1.dot(mat2)
mat3.compute()
If I run that program with a small matrix size, it apparently uses only num_workers workers, but if I increase matrix size, suddenly it creates dozen of workers, as the image shows.
So, how can I request Dask to solve the problem using only num_workers workers?

When using the threaded scheduler, Dask doesn't spawn any new processes. Instead it runs everything within your main process.
However, this doesn't stop your functions from spawning processes themselves. As Mike Graham points out in the comments you should be careful about mixing parallel solutions like Dask and a parallel BLAS implementation like MKL or OpenBLAS. This can damage performance. It is often best to set one of the two libraries to use a single thread per call.
I am still confused why you're seeing multiple python processes. To the best of my knowledge neither threaded Dask nor MKL create new processes for computation. However given your positive results from limiting the number of MKL threads perhaps MKL has changed since I last checked in with it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.