Suppose that you want to specify the number of workers in Dask.array, as Dask documentation shows, you can set:
dask.set_options(pool=ThreadPool(num_workers))
This works pretty well with some simulations I've run, for example, montecarlo's, but with some linear algebra operations, it seems that Dask overrides user specified configuration, for example:
import dask.array as da
import dask
from multiprocessing.pool import ThreadPool
dask.set_options(pool=ThreadPool(num_workers))
mat1 = da.random.random((size, size) chunks=chunk_size)
mat2 = da.random.random((size, size) chunks=chunk_size)
mat3 = mat1.dot(mat2)
mat3.compute()
If I run that program with a small matrix size, it apparently uses only num_workers workers, but if I increase matrix size, suddenly it creates dozen of workers, as the image shows.
So, how can I request Dask to solve the problem using only num_workers workers?
When using the threaded scheduler, Dask doesn't spawn any new processes. Instead it runs everything within your main process.
However, this doesn't stop your functions from spawning processes themselves. As Mike Graham points out in the comments you should be careful about mixing parallel solutions like Dask and a parallel BLAS implementation like MKL or OpenBLAS. This can damage performance. It is often best to set one of the two libraries to use a single thread per call.
I am still confused why you're seeing multiple python processes. To the best of my knowledge neither threaded Dask nor MKL create new processes for computation. However given your positive results from limiting the number of MKL threads perhaps MKL has changed since I last checked in with it.
Related
I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).
A for loop over a matrix M with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?
Firstly, the dask documentation makes clear the following contraindications:
it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.
I have a project in Python that requires regressing many variables against many others. I am using a Jupyter Notebook for clarity but am also willing to use another IDE if it's easier. My code looks something like:
for a in dependent_variables:
for b in independent_variables:
regress a on b
My current dataset isn't huge, so this whole thing takes maybe 30 seconds, but I will soon have a much larger dataset that will significantly increase time required. I'm curious if this is a situation suitable for parallelization. Specifically, if I have a dual-threaded eight-core processor (meaning 16 CPUs total), is it possible to run simultaneous processes where each process regresses one of the first variables against one of the second variables, allowing me to complete, say, eight of these regressions at a time (if I allocate half of the CPUs to this process)? I am not super familiar with parallelization and most other answers I've found have discussed the parallelization of a single function call, not the simultaneous execution of multiple similar functions. I appreciate the help!
Nominally, this is
import itertools
import multiprocessing as mp
def regress_me(vars):
ind_var, dep_var = vars
# your regression may be better than mine...
result = "{} {}".format(ind_var, dep_var)
return result
if __name__ == "__main__":
with mp.Pool(8) as pool:
analyse_this = list(itertools.product(independent_variables,
dependent_variables))
result = mp.map(regress_me, analyse_this)
A lot depends on what is being passed between parent and child and whether you are using a forking system like linux or a spawning system like windows. If these datasets are being fetched from disk, it may be better to do the read in the worker regress_me instead of passing it from the parent. You can read up on that with the standard python multiprocessing library.
My python code contains a numpy dot operation of huge size (over 2^(tens...)) matrix and vector.
To reduce the computing time, I applied parallel processing by dividing the matrix suited for the number of cpu cores.
I used concurrent.futures.ProcessPoolExecutor.
My issue is that the parallel processing takes much more time than single processing.
The following is my code.
single process code.
self._vector = np.dot(matrix, self._vector)
parallel processing code.
each_worker_coverage = int(self._dimension/self.workers)
args = []
for i in range(self.workers):
if (i+1)*each_worker_coverage < self._dimension:
arg = [i, gate[i*each_worker_coverage:(i+1)*each_worker_coverage], self._vector]
else:
arg = [i, gate[i*each_worker_coverage:self._dimension], self._vector]
args.append(arg)
pool = futures.ProcessPoolExecutor(self.workers)
results = list(pool.map(innerproduct, args, chunksize=1))
for result in results:
if (result[0]+1)*each_worker_coverage < self._dimension:
self._vector[result[0]*each_worker_coverage:(result[0]+1)*each_worker_coverage] = result[1]
else:
self._vector[result[0]*each_worker_coverage:self._dimension] = result[1]
The innerproduct function called in parallel is as follows.
def innerproduct(args):
answer = np.dot(args[1], args[2])
return args[0], answer
```
For a 2^14 x 2^14 matrix and a 2^14 vector, the single process code takes only 0.05 seconds, but the parallel processing code takes 6.2 seconds.
I also checked the time with the `innerproduct` method, and it only takes 0.02~0.03 sec.
I don't understand this situation.
Why does the parallel processing (multi-processing not multi-threading) take more time?
To exactly know the cause of the slowdown you would have to measure how long everything takes, and with multiprocessing and multithreading that can be tricky.
So what follows is my best guess. For multiprocessing to work, the parent process has to transfer the data used in the calculations to the worker processes. The time this takes depends on the amount of data. Transferring a 2^14 x 2^14 matrix is probably going to take a significant amount of time.
I suspect that this data transfer is what is taking the extra time.
If you are using an operating system that uses the fork startmethod for multiprocessing/concurrent.futures, there is a way around this data transfer. These operating systems are for example Linux, *BSD but not macOS and ms-windows).
On the abovementioned operating systems, multiprocessing uses the fork system call to create its workers. This system call creates a copy of the parent process as the child processes. So if you create the vectors and matrices before creating the ProcessPoolExecutor, the workers will inherit that data. This is not a very costly or time consuming operation because all these OS's use copy-on-write for managing memory pages. As long as the original matrix isn't changed, all programs using it are reading from the same memory pages. This inheriting of the data means you don't have to pass the data explicitly to the worker. You just have to pass a small data structure that says on which index ranges a worker has to operate.
Unfortunately, due to technical limitations of the platform, this doesn't work on macOS and ms-windows. What you could do on those systems is store the original matrix and vector memory mapped binary files before you create the Executor. If you tag these mappings with a name, the worker processes should be able to map the same data into their memory without having to transfer them. I think is it possible to instruct numpy to use such a raw binary array without recreating it.
On both platforms you could use the same technique to "send data back" to the parent process; save the data in shared memory and return the filename or tagname to the parent process.
If you are using modern versions of NumPy and OS, it's most likely that
self._vector = np.dot(matrix, self._vector)
is already optimized and uses all your CPU cores.
If np.show_config() displays openblas or MKL you may run a simple test:
a = np.random.rand(7000, 7000)
b = np.random.rand(7000, 7000)
np.dot(a, b)
It should use all CPU cores for a couple of seconds.
If it's not, you may install OpenBLAS or MKL and reinstall NumPy. See Using MKL to boost Numpy performance on Ubuntu
We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps down the graph. But as far as I understood task
computation results are still (de)serialized to hard drive in even in this case.
So the question is how much performance gain would I get trying to keep
as little tasks as possible down a single path of independent computations in a graph:
A) many small "map" tasks along each path
t --> t --> t -->...
some reduce stage
t --> t --> t -->...
B) one huge "map" task along for each path
T ->
some reduce stage
T ->
Thank you!
The dask multiprocessing scheduler will automatically fuse linear chains of tasks into single tasks, so your case A above will automatically become case B.
If your workloads are more complex and do require inter-node communication then you might want to try the distributed scheduler on a single computer. It manages data movement between workers more intelligently.
$ pip install dask distributed
>>> from dask.distributed import Client
>>> c = Client() # Starts local "cluster". Becomes the global scheduler
Further reading
http://dask.pydata.org/en/latest/scheduler-choice.html
http://dask.pydata.org/en/latest/optimize.html
Correction
Also, just as a note, Dask doesn't persist intermediate results on disk. Rather it communicates intermediate results directly between processes.
For a class project I am writing a simple matrix multiplier in Python. My professor has asked for it to be threaded. The way I handle this right now is to create a thread for every row and throw the result in another matrix.
What I wanted to know if it would be faster that instead of creating a thread for each row it creates some amount threads that each handles various rows.
For example: given Matrix1 100x100 * Matrix2 100x100 (matrix sizes can vary widely):
4 threads each handling 25 rows
10 threads each handling 10 rows
Maybe this is a problem of fine tuning or maybe the thread creation process overhead is still faster than the above distribution mechanism.
You will probably get the best performance if you use one thread for each CPU core available to the machine running your application. You won't get any performance benefit by running more threads than you have processors.
If you are planning to spawn new threads each time you perform a matrix multiplication then there is very little hope of your multi-threaded app ever outperforming the single-threaded version unless you are multiplying really huge matrices. The overhead involved in thread creation is just too high relative to the time required to multiply matrices. However, you could get a significant performance boost if you spawn all the worker threads once when your process starts and then reuse them over and over again to perform many matrix multiplications.
For each pair of matrices you want to multiply you will want to load the multiplicand and multiplier matrices into memory once and then allow all of your worker threads to access the memory simultaneously. This should be safe because those matrices will not be changing during the multiplication.
You should also be able to allow all the worker threads to write their output simultaneously into the same output matrix because (due to the nature of matrix multiplication) each thread will end up writing its output to different elements of the matrix and there will not be any contention.
I think you should distribute the rows between threads by maintaining an integer NextRowToProcess that is shared by all of the threads. Whenever a thread is ready to process another row it calls InterlockedIncrement (or whatever atomic increment operation you have available on your platform) to safely get the next row to process.
In no single case a CPU-bound task will be faster in Python in multi-threaded mode. Due to the Global Interpreter Lock, only one thread can be executed at once (unless you write some C extension and release the lock explicitly).
This applies to standard CPython implementation as well as PyPy. In Jython try to use a thread per core, more does not make sense.
Please also check out the great GIL overview by David Beazley.
On the other hand, if your professor does not mind, you can use multiprocessing.