Python Multiprocessing Slower than Sequential Programming

Python Multiprocessing Slower than Sequential Programming - python

I looked up for a lot of questions concerning slowness in python multiprocessing, but none of them were able to solve my problem.
Inside my algorithm, I have a for instance from 0 to 2, that runs the most important function of the algorithm (and the most time-consumption one). The 3 iterations of the for instance are independent from each other. So, to take advantage of this feature, I was trying to run my algorithm using parallel processing.
The thing is that when I run my algorithm will parallel processing, the simulation time is higher than the sequential programming. Depending on the input data, my original sequential algorithm can take from ~30ms to ~1500ms to run. Even when I run the ~1500ms cases, the multiprocessing is slower. Is it because the multiprocessing have to deal with really computionally expensive problems to make it worth, or is there something I can do to make it work better for me?
For now I won't post my algorithm because it's really long, but just as an example, what I'm doing is this:
from multiprocessing import Pool
def FUNCTION(A,B,C,f):
R1 = A * B * C * f
R2 = A * B / C * f # The function has several operations, i'm just doing an example here.
return R, S
if __name__ == '__main__':
pool = Pool()
while CP[0] or CP[1] or CP[2] or CPVT[0] or CPVT[1] or CPVT[2]:
f=0
result1 = pool.apply_async(FUNCTION, [A0,B0,C0])
f=1
result2 = pool.apply_async(FUNCTION, [A1,B1,C1])
f=2
result3 = pool.apply_async(FUNCTION, [A2,B2,C2])
[R0,S0] = result1.get(timeout=1)
[R1,S1] = result2.get(timeout=1)
[R2,S2] = result3.get(timeout=1)
Any ideas why is it taking longer then the sequential way to do it, or any solutions to that issue?
Thanks! :)

Related

Parallel computing in Python Similar to MATLAB

I have been using parfor in MATLAB to run parallel for loops for quite some time. I need to do something similar in Python but I cannot find any simple solution. This is my code:
t = list(range(1,3,1))
G = list(range(0,3,2))
results = pandas.DataFrame(columns = ['tau', 'p_value','G','t_i'],index=range(0,len(G)*len(t)))
counter = 0
for iteration_G in list(range(0,len(G))):
for iteration_t in list(range(0,len(t))):
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
results['tau'][counter] = tau
results['p_value'][counter] = p_value
results['G'][counter] = G[iteration_G]
results['t_i'][counter] = G[iteration_t]
counter = counter + 1
I would like to use the parfor equivalent in the first loop.

I'm not familiar with parfor, but you can use the joblib package to run functions in parallel.
In this simple example there's a function that prints its argument and we use Parallel to execute it multiple times in parallel with a for-loop
import multiprocessing
from joblib import Parallel, delayed
# function that you want to run in parallel
def foo(i):
print(i)
# define the number of cores (this is how many processes wil run)
num_cores = multiprocessing.cpu_count()
# execute the function in parallel - `return_list` is a list of the results of the function
# in this case it will just be a list of None's
return_list = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in range(20))
If this doesn't work for what you want to do, you can try to use numba - it might be a bit more difficult to set-up, but in theory with numba you can just add #njit(parallel=True) as a decorator to your function and numba will try to parallelise it for you.

I found a solution using parfor. It is still a bit more complicated than MATLAB's parfor but it's pretty close to what I am used to.
t = list(range(1,16,1))
G = list(range(0,62,2))
for iteration_t in list(range(0,len(t))):
#parfor(list(range(0,len(G))))
def fun(iteration_G):
result = pandas.DataFrame(columns = ['tau', 'p_value'],index=range(0,1))
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
result['tau'] = tau
result['p_value'] = p_value
fun = numpy.array([tau,p_value])
return fun

Why is multiprocessing running and the CPU usage drops to 0%, but the results are never released?

I have put together what I thought was some simple script to calculate the Moran's Index for each feature in a matrix (Information_Gains_Matrix), with its corresponding weight matrix (Weights_Matrix). I wanted to use pool.map to go along each feature in the Information_Gains_Matrix. The output is just a vector of Moran's Index's for each feature in Information_Gains_Matrix. See code:
import multiprocessing
from functools import partial
def Feature_Moran_Index(Information_Gains_Matrix,Wij,N):
Feature = Information_Gains_Matrix
X_bar = np.mean(Feature)
if X_bar != 0:
Deviance = Feature - X_bar
Outer_Deviance = np.outer(Deviance,Deviance)
Deviance2 = Deviance * Deviance
Denom = np.sum(Deviance2)
Moran_Index_Score = (N/Wij) * (np.sum((Weights_Matrix * Outer_Deviance))/Denom)
else:
Moran_Index_Score = 0
return Moran_Index_Score
def Parallel_Feature_Moran_Index(Information_Gains_Matrix,Use_Cores):
N = Information_Gains_Matrix.shape[0]
Wij = np.sum(Weights_Matrix)
pool = multiprocessing.Pool(processes = Use_Cores)
Result = pool.map(partial(Feature_Moran_Index, Wij=Wij,N=N), Information_Gains_Matrix)
pool.close()
pool.join()
Moran_Index_Scores = np.asarray(Result)
np.save("Moran_Index_Scores.npy",Moran_Index_Scores)
return Result
if __name__ == '__main__':
Moran_Index_Scores = Parallel_Feature_Moran_Index(Information_Gains_Matrix,Use_Cores=(multiprocessing.cpu_count()-2))
I'll openly admit my understanding of multiprocessing in Python isn't ideal, but I've used it a lot in the past with success. So I really don't understand why this simple script won't work. Help greatly appreciated. Note, I am aware the Weights_Matrix is not in the script. Its loaded in as a global so it doesn't have to be serialised to each core because it's very large. I'm sorry if this irritates anyone, I know some people get quite testy about the use of globals.
Edit: I've successfully used this exact code on smaller data sets on my local Ubuntu shell (Windows laptop). The problem occurs on a very large data set in a Ubuntu virtual machine.

Multicore programming

At the company where I am interning, I was told about the use of multi-core programming and, in view of a project I am developing for my thesis (I'm not from the area but I'm working on something that involves coding).
I want to know if this is possible:I have a defined function that will be repeated 3x for 3 different variables. Is it possible to put the 3 running at the same time in different core (because they don't need each other information)? Because the calculation process is the same for all of them and instead of running 1 variable at a time, I would like to run all 3 at once (performing all the calculations at the same time) and in the end returning the results.
Some part of what I would like to optimize:
for v in [obj2_v1, obj2_v2, obj2_v3]:
distancia_final_v, \
pontos_intersecao_final_v = calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
v,
criterio
)
def calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
obj2_v,
criterio
):
i = 0
distancia_final_v = []
pontos_intersecao_final_v = []
while i < len(obj2_v):
distancia_relevante_v = []
pontos_intersecao_v = []
distancia_inicial = 1000
for x in range(len(obj1_v1)):
planeNormal = np.array( [obj1_normal[x][0],
obj1_normal[x][1],
obj1_normal[x][2]
] )
planePoint = np.array( [ obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2]
] ) # Any point on the plane
rayDirection = np.array([obj2_normal[i][0],
obj2_normal[i][1],
obj2_normal[i][2]
] ) # Define a ray
rayPoint = np.array([ obj2_v[i][0],
obj2_v[i][1],
obj2_v[i][2]
] ) # Any point along the ray
Psi = Calculos.line_plane_collision( planeNormal,
planePoint,
rayDirection,
rayPoint
)
a = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2]
)
b = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
c = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
d = Calculos.area_trianglo_3d( obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
if float("{:.5f}".format(a)) == float("{:.5f}".format(b + c + d)):
P1 = Ponto( Psi[0][0], Psi[0][1], Psi[0][2] )
P2 = Ponto( obj2_v[i][0], obj2_v[i][1], obj2_v[i][2] )
distancia = Calculos.distancia_pontos( P1, P2 ) * 10
if distancia < distancia_inicial and distancia < criterio:
distancia_inicial = distancia
distancia_relevante_v = []
distancia_relevante_v.append( distancia_inicial )
pontos_intersecao_v = []
pontos_intersecao_v.append( Psi )
x += 1
distancia_final_v.append( distancia_relevante_v )
pontos_intersecao_final_v.append( pontos_intersecao_v )
i += 1
return distancia_final_v, pontos_intersecao_final_v
In this example of my code, I want to make the same process happen for obj2_v1, obj2_v2, obj2_v3.
Is it possible to make them happen at the same time?
Because I will be using a considerable amount of data and it would probably save me some time of processing.

multiprocessing (using processes to avoid the GIL) is the easiest but you're limited to relatively small performance improvements, number of cores speedup is the limit, see Amdahl's law. there's also a bit of latency involved in starting / stopping work which means it's much better for things that take >10ms
in numeric heavy code (like this seems to be) you really want to be moving as much of the it "inside numpy", look at vectorisation and broadcasting. this can give speedups of >50x (just on a single core) while staying easier to understand and reason about
if your algorithm is difficult to express using numpy intrinsics then you could also look at using Cython. this allows you to write Python code that automatically gets compiled down to C, and hence a lot faster. 50x faster is probably also a reasonable speedup, and this is still running on a single core
the numpy and Cython techniques can be combined with multiprocessing (i.e. using multiple cores) to give code that runs hundreds of times faster than naive implementations
Jupyter notebooks have friendly extensions (known affectionately as "magic") that make it easier to get started with this sort of performance work. the %timeit special allows you to easily time parts of the code, and the Cython extension means you can put everything into the same file

It's possible, but use python multiprocessing lib, because the threading lib doesn't delivery parallel execution.
UPDATE
DON'T do something like that (thanks for #user3666197 for pointing the error) :
from multiprocessing.pool import ThreadPool
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio):
#your code
return distancia_final_v,pontos_intersecao_final_v
pool = ThreadPool(processes=3)
async_result1 = pool.apply_async(calculo_vertice, (#your args here))
async_result2 = pool.apply_async(calculo_vertice, (#your args here))
async_result3 = pool.apply_async(calculo_vertice, (#your args here))
result1 = async_result1.get() # result1
result2 = async_result2.get() # result2
result3 = async_result3.get() # result3
Instead, something like this should do the job:
from multiprocessing import Process, Pipe
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio, send_end):
#your code
send_end.send((distancia_final_v,pontos_intersecao_final_v))
numberOfWorkers = 3
jobs = []
pipeList = []
#Start process and build job list
for i in range(numberOfWorkers):
recv_end, send_end = Pipe(False)
process = Process(target=calculo_vertice, args=(#<... your args...>#, send_end))
jobs.append(process)
pipeList.append(recv_end)
process.start()
#Show the results
for job in jobs: job.join()
resultList = [x.recv() for x in pipeList]
print (resultList)
REF.
https://docs.python.org/3/library/multiprocessing.html
https://stackoverflow.com/a/37737985/8738174
This code will create a pool of 3 working process and each of it will async receive the function. It's important to point that in this case you should have 3+ CPU cores, otherwise, your system kernel will just switch between process and things won't real run in parallel.

Q : " Is it possible to make them happen at the same time? "
Yes.
The best results ever will be get if not adding any python ( the multiprocessing module is not necessary at all for doing 3 full-copies ( yes, top-down fully replicated copies ) of the __main__ python process for this so embarrasingly independent processing ).
The reasons for this are explained in detail here and here.
A just-enough tool is GNU's :$ parallel --jobs 3 python job-script.py {} ::: "v1" "v2" "v3"
For all performance-tweaking, read about more configuration details in man parallel.
"Because I will be using a considerable amount of data..."
The Devil is hidden in the details :
The O/P code may be syntactically driving the python interpreter to results, precise (approximate) within some 5 decimal places, yet the core sin is it's ultimately bad chances to demonstrate any reasonably achievable performance in doing that, the worse on "considerable amount of data".
If they, "at the company", expect some "considerable amount of data", you should do at least some elementary research on what is the processing aimed at.
The worst part ( not mentioning the decomposition of once vectorised-ready numpy-arrays back into atomic "float" coordinate values ) is the point-inside-triangle test.
For a brief analysis on how to speed-up this part ( the more if going to pour "considerable amount of data" on doing this ), get inspired from this post and get the job done in fraction of the time it was drafted in the O/P above.
Indirect testing of a point-inside-triangle by comparing an in-equality of a pair of re-float()-ed-strings, received from sums of triangle-areas ( b + c + d ) is just one of the performance blockers, you will find to get removed.

Using Cython to create parallel threads without prange

I have a recursive function that does something similar to the following:
import numpy as np
from copy import copy
shared_data = np.random.randn(6, 5, 3)
def grow(current_data, level):
grown_data = []
if level < shared_data.shape[0] - 1:
nlevel = level + 1
valid = ((shared_data[nlevel] - current_data[-1])**2).sum(axis=-1) < 1
for new_data in shared_data[nlevel, valid]:
continue_data = copy(current_data)
continue_data.append(new_data)
grown_data.extend(grow(continue_data, level+1))
else:
grown_data.append(current_data)
return grown_data
begin_data = np.random.randn(3)
print(grow([begin_data], 0))
I am wondering if there is some way to start a new parallel thread in cython to do the current processing on each entry for the grow function in order to speed this type of recursion up. While the above sample code runs relatively fast, the actual code is slower (a) because it does more than the simple distance calculation included above and (b) because the data it is operating on is more like the size (3000, 10, 3), which even for this simple example is prohibitively slow, at least on my machine.
One thought that I had was to use a list/queue to add recursive jobs to instead of calling them directly, then, on each return from grow, using a prange loop to process the jobs in the list/queue in parallel, but I'm afraid this will result in the recreation of threads all the time and decrease the efficiency.

How to get return value from thread in Python?

I do some computationally expensive tasks in python and found the thread module for parallelization. I have a function which does the computation and returns a ndarray as result. Now I want to know how I can parallize my function and get back the calculated Arrays from each thread.
The followed example is strongly simplified with light functions and calculations.
import numpy as np
def calculate_result(input):
a=np.linspace(1.0, 1000.0, num=10000) # just an example
result = input*a
return(result)
input =[1,2,3,4]
for i in range(0,len(input(i))):
t.Thread(target=calculate_result, args=(input))
t. start()
#Here I want to receive the return value from the thread
I am looking for a way to get the return value from the thread / function for each thread, because in my task each thread calculates different values.
I found an other Question (how to get the return value from a thread in python?) where someone is looking for a similar problem (no ndarrays) and which is handled with ThreadPool and async...
-------------------------------------------------------------------------------
Thanks for your answers !
Due to your help now I am looking for a way to solve my problem with the multiprocessing modul. To give you a better understanding what I do, see my following Explanation.
Explanation:
My 'input_data' is an ndarray with 282240 elements of type uint32
In the 'calculation_function()'I use a for loop to calculate from
every 12 bit a result and put it into the 'output_data'
Because this is very slow, I split my input_data into e.g. 4 or 8
parts and calculate each part in the calculation_function().
Now I am looking for a way, how to parallize the 4 or 8 function
calls
The order of the data is elementary, because the data is in image and
each pixel have to be at the correct Position. So function call no. 1
calculates the first and the last function call the last pixel of the
image.
The calculations work fine and the image can be completly rebuilt
from my algo but I need the parallelization to speed up for time
critical aspects.
Summary:
One input ndarray is devided into 4 or 8 parts. In each part are 70560 or 35280 uint32 values. From each 12 bit I calculate one Pixel with 4 or 8 function calls. Each function returns one ndarray with 188160 or 94080 pixel. All return values will be put together in a row and reshaped into an image.
What allready works:
Calculations are allready working and I can reconstruct my image
Problem:
Function calls are done seriall and in a row but each image reconstruction is very slow
Main Goal:
Speed up the function calls by parallize the function calls.
Code:
def decompress(payload,WIDTH,HEIGHT):
# INPUTS / OUTPUTS
n_threads = 4
img_input = np.fromstring(payload, dtype='uint32')
img_output = np.zeros((WIDTH * HEIGHT), dtype=np.uint32)
n_elements_part = np.int(len(img_input) / n_threads)
input_part=np.zeros((n_threads,n_elements_part)).astype(np.uint32)
output_part =np.zeros((n_threads,np.int(n_elements_part/3*8))).astype(np.uint32)
# DEFINE PARTS (here 4 different ones)
start = np.zeros(n_threads).astype(np.int)
end = np.zeros(n_threads).astype(np.int)
for i in range(0,n_threads):
start[i] = i * n_elements_part
end[i] = (i+1) * n_elements_part -1
# COPY IMAGE DATA
for idx in range(0,n_threads):
input_part [idx,:] = img_input[start[idx]:end[idx]+1]
for idx in range(0,n_threads): # following line is the function_call that should be parallized
output_part[idx,:] = decompress_part2(input_part[idx],output_part[idx])
# COPY PARTS INTO THE IMAGE
img_output[0 : 188160] = output_part[0,:]
img_output[188160: 376320] = output_part[1,:]
img_output[376320: 564480] = output_part[2,:]
img_output[564480: 752640] = output_part[3,:]
# RESHAPE IMAGE
img_output = np.reshape(img_output,(HEIGHT, WIDTH))
return img_output
Please dont take care of my beginner programming style :)
Just looking for a solution how to parallize the function calls with the multiprocessing module and get back the return ndarrays.
Thank you so much for your help !

You can use process pool from the multiprocessing module
def test(a):
return a
from multiprocessing.dummy import Pool
p = Pool(3)
a=p.starmap(test, zip([1,2,3]))
print(a)
p.close()
p.join()

kar's answer works, however keep in mind that he's using the .dummy module which might be limited by the GIL. Heres more info on it:
multiprocessing.dummy in Python is not utilising 100% cpu

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.