Multithreading regressions in Python

Multithreading regressions in Python - python

I have a project in Python that requires regressing many variables against many others. I am using a Jupyter Notebook for clarity but am also willing to use another IDE if it's easier. My code looks something like:
for a in dependent_variables:
for b in independent_variables:
regress a on b
My current dataset isn't huge, so this whole thing takes maybe 30 seconds, but I will soon have a much larger dataset that will significantly increase time required. I'm curious if this is a situation suitable for parallelization. Specifically, if I have a dual-threaded eight-core processor (meaning 16 CPUs total), is it possible to run simultaneous processes where each process regresses one of the first variables against one of the second variables, allowing me to complete, say, eight of these regressions at a time (if I allocate half of the CPUs to this process)? I am not super familiar with parallelization and most other answers I've found have discussed the parallelization of a single function call, not the simultaneous execution of multiple similar functions. I appreciate the help!

Nominally, this is
import itertools
import multiprocessing as mp
def regress_me(vars):
ind_var, dep_var = vars
# your regression may be better than mine...
result = "{} {}".format(ind_var, dep_var)
return result
if __name__ == "__main__":
with mp.Pool(8) as pool:
analyse_this = list(itertools.product(independent_variables,
dependent_variables))
result = mp.map(regress_me, analyse_this)
A lot depends on what is being passed between parent and child and whether you are using a forking system like linux or a spawning system like windows. If these datasets are being fetched from disk, it may be better to do the read in the worker regress_me instead of passing it from the parent. You can read up on that with the standard python multiprocessing library.

Related

How to maximize the parallelization efficiency using Python ray ActorPool?

I recently started using ray ActorPool to parallelize my python code on my local computer (using the code below), and it's definitely working. Specifically, I used it to process a list of arguments and return a list of results (Note that depending on the inputs, the "process" function could take different amounts of time).
However, while testing the script, it seems in this way the processes are sort of "blocking" each other, in that if there's one process that takes a long time, it almost seems other cores would just stay more or less idle. Although it's definitely not completely blocking, as running it this way still saves a lot of time compared to just running on one core, I found that many of the processors would just stay idle (more than half cores with <20% usage) despite I'm running this script on all cores (16 cores). This is especially observable when there is a long process, in which case there are only one or two cores that are actually active. Also, the total amount of time saved is nowhere near 16x
pool = ActorPool(actors)
poolmap = pool.map(
lambda a, v: a.process.remote(arg),
args,
)
result_list = [a for a in tqdm(poolmap, total=length)]
I suspect this is because the way I used to get the result values is not optimal (last line), but not sure how to make it better. Could you guys help me improve it?

Dask parallelize a short task that uses a large np.ndarray

I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).
A for loop over a matrix M with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?

Firstly, the dask documentation makes clear the following contraindications:
it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.

ProcessPoolExecutor overhead ?? parallel processing takes more time than single process for large size matrix operation

My python code contains a numpy dot operation of huge size (over 2^(tens...)) matrix and vector.
To reduce the computing time, I applied parallel processing by dividing the matrix suited for the number of cpu cores.
I used concurrent.futures.ProcessPoolExecutor.
My issue is that the parallel processing takes much more time than single processing.
The following is my code.
single process code.
self._vector = np.dot(matrix, self._vector)
parallel processing code.
each_worker_coverage = int(self._dimension/self.workers)
args = []
for i in range(self.workers):
if (i+1)*each_worker_coverage < self._dimension:
arg = [i, gate[i*each_worker_coverage:(i+1)*each_worker_coverage], self._vector]
else:
arg = [i, gate[i*each_worker_coverage:self._dimension], self._vector]
args.append(arg)
pool = futures.ProcessPoolExecutor(self.workers)
results = list(pool.map(innerproduct, args, chunksize=1))
for result in results:
if (result[0]+1)*each_worker_coverage < self._dimension:
self._vector[result[0]*each_worker_coverage:(result[0]+1)*each_worker_coverage] = result[1]
else:
self._vector[result[0]*each_worker_coverage:self._dimension] = result[1]
The innerproduct function called in parallel is as follows.
def innerproduct(args):
answer = np.dot(args[1], args[2])
return args[0], answer
```
For a 2^14 x 2^14 matrix and a 2^14 vector, the single process code takes only 0.05 seconds, but the parallel processing code takes 6.2 seconds.
I also checked the time with the `innerproduct` method, and it only takes 0.02~0.03 sec.
I don't understand this situation.
Why does the parallel processing (multi-processing not multi-threading) take more time?

To exactly know the cause of the slowdown you would have to measure how long everything takes, and with multiprocessing and multithreading that can be tricky.
So what follows is my best guess. For multiprocessing to work, the parent process has to transfer the data used in the calculations to the worker processes. The time this takes depends on the amount of data. Transferring a 2^14 x 2^14 matrix is probably going to take a significant amount of time.
I suspect that this data transfer is what is taking the extra time.
If you are using an operating system that uses the fork startmethod for multiprocessing/concurrent.futures, there is a way around this data transfer. These operating systems are for example Linux, *BSD but not macOS and ms-windows).
On the abovementioned operating systems, multiprocessing uses the fork system call to create its workers. This system call creates a copy of the parent process as the child processes. So if you create the vectors and matrices before creating the ProcessPoolExecutor, the workers will inherit that data. This is not a very costly or time consuming operation because all these OS's use copy-on-write for managing memory pages. As long as the original matrix isn't changed, all programs using it are reading from the same memory pages. This inheriting of the data means you don't have to pass the data explicitly to the worker. You just have to pass a small data structure that says on which index ranges a worker has to operate.
Unfortunately, due to technical limitations of the platform, this doesn't work on macOS and ms-windows. What you could do on those systems is store the original matrix and vector memory mapped binary files before you create the Executor. If you tag these mappings with a name, the worker processes should be able to map the same data into their memory without having to transfer them. I think is it possible to instruct numpy to use such a raw binary array without recreating it.
On both platforms you could use the same technique to "send data back" to the parent process; save the data in shared memory and return the filename or tagname to the parent process.

If you are using modern versions of NumPy and OS, it's most likely that
self._vector = np.dot(matrix, self._vector)
is already optimized and uses all your CPU cores.
If np.show_config() displays openblas or MKL you may run a simple test:
a = np.random.rand(7000, 7000)
b = np.random.rand(7000, 7000)
np.dot(a, b)
It should use all CPU cores for a couple of seconds.
If it's not, you may install OpenBLAS or MKL and reinstall NumPy. See Using MKL to boost Numpy performance on Ubuntu

Memory issue in parallel Python

I have a Python script like this:
from modules import functions
a=1
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
When I run this for a value of "a" it takes half an hour, and uses only 1 core of my 6 core machine.
I want to run this code for 100 different values of "a"
The question is: How can I parallelize this so I use all cores and try all values of "a"?
my first approach following an online suggestion was:
from joblib import Parallel, delayed
def repeat(a):
from modules import functions
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
A=list_100_a #list of 100 different a values
Parallel(n_jobs=6,verbose=0)(delayed(repeat)(a) for a in A)
This successfully use all my cores in the computer but it was computing for all the 100 values of a at the same time. after 4 hours my 64GB RAM memory and 64GB swap memory would be saturated and the performance dropped drastically.
So I tried to manually cue the function doing it 6 times at a time inside a for loop. But the problem was that the memory would be consumed also.
I don't know where the problem is. I guess that somehow the program is keeping unnecessary memory.
What can I do so that I don't have this memory problem.
In summary:
When I run this function for a specific value of "a" everything is ok.
When I run this function in parallel for 6 values of "a" everything is ok.
When I sequencially run this function in parallel the memory gradually inscreases until the computer can no longer work.
UPDATE: I found a solution for the memory problem even though I don't understand why.
It appears that changing the backend of matplotlib to 'Agg' no longer produced the memory problem.
just add this before any import and you should be fine:
from matplotlib import use
use('Agg')

Here's how I would do it with multiprocessing. I'll use your repeat function to do the work for one value of a.
def repeat(a):
from modules import functions
parameters = par_vals
for i in range(large_number):
#do lots of stuff dependent on a, plot stuff, save plots as png
Then I'd use multiprocessing.pool like this:
import multiprocessing
pool = multiprocessing.Pool(processes=6) # Create a pool with 6 workers.
A=list_100_a #list of 100 different a values
# Use the workers in the pool to call repeat on each value of a in A. We
# throw away the result of calling map, since it looks like the point of calling
# repeat(a) is for the side effects (files created, etc).
pool.map(repeat, A)
# Close the pool so no more jobs can be submitted to it, then wait for
# all workers to exit.
pool.close()
pool.join()
If you wanted the result of calling repeat, you could just do result = pool.map(repeat, A).
I don't think you'll run into any issues, but it's also useful to read the programming guidelines for using multiprocessing.

Python multiprocessing design

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:
Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids
The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.
I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.
So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?
Any pointers would be appreciated.

The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.
The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.
For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.
Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.
It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.

Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:
import os, time, random, multiprocessing
def busyfunc(runseconds):
starttime = int(time.clock())
while 1:
for randcount in range(0,100):
testnum = random.randint(1, 10000000)
newnum = testnum / 3.256
newtime = int(time.clock())
if newtime - starttime > runseconds:
return
def main(arg):
print 'arg from init:', arg
print "I am " + multiprocessing.current_process().name
busyfunc(15)
if __name__ == '__main__':
p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
p.start()
p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
p.start()
p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
p.start()
time.sleep(5)
This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.
If you must share data due to RAM limitations, then I suggest this:
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.
Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.
When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).
It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.
See 'extending and embedding' in the python documentation.
This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...

I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.
As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.
I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...
You can try segmenting your file and applying your algorithm to different sets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.