How to decrease time of execution using multi-threading python - python

I am performing DCT(in Raspberry Pi). I've broken the image into 8x8 blocks. Initially I performed DCT in nested for loop (without multithreading). I observed that it takes about 18 seconds for a 512x512 image.
But, Here's the code with multi-threads
#!/usr/bin/env python
from __future__ import print_function,division
import time
start_time = time.time()
import cv2
import numpy as np
import sys
import pylab as plt
import threading
import Queue
from numpy import empty,arange,exp,real,imag,pi
from numpy.fft import rfft,irfft
from pprint import pprint
queue = Queue.Queue()
if len(sys.argv)>1:
im = cv2.imread(sys.argv[1])
else :
im = cv2.imread('baboon.jpg')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
h, w = im.shape[:2]
DF = np.zeros((h,w))
Nb=8
def dct2(y):
M = y.shape[0]
N = y.shape[1]
a = empty([M,N],float)
b = empty([M,N],float)
for i in range(M):
a[i,:] = dct(y[i,:])
for j in range(N):
b[:,j] = dct(a[:,j])
queue.put(b)
def dct(y):
N = len(y)
y2 = empty(2*N,float)
y2[:N] = y[:]
y2[N:] = y[::-1]
c = rfft(y2)
phi = exp(-1j*pi*arange(N)/(2*N))
return real(phi*c[:N])
def Main():
jobs = []
for row in range(0, h, Nb):
for col in range(0, w, Nb):
f = im[(row):(row+Nb), (col):(col+Nb)]
thread = threading.Thread(target=dct2(f))
jobs.append(thread)
df = queue.get()
DF[row:row+Nb, col:col+Nb] = df
for j in jobs:
j.start()
for j in jobs:
j.join()
if __name__ == "__main__":
Main()
cv2.imwrite('dct_img.jpg', DF)
print("--- %s seconds ---" % (time.time() - start_time))
plt.imshow(DF1, cmap = 'Greys')
plt.show()
cv2.waitKey(0)
cv2.destroyAllWindows()
After using multiple threads, this code take about 25 seconds to get executed. What's wrong? Have I implemented multi-threading wrongly? I want to reduce the time taken to perform DCT as much as possible (1-5 seconds). Any suggestions?
Any other concept or method (I've read post on multiprocessing) that'll significantly reduce my execution and processing time?

Due to GIL all your threads are executed in a sequence (not in parallel).
So you might want to switch to multiprocessing. Another option is to build numba, which can greatly increase speed of usual python code and also can unlock GIL.

In Python, you should use multithreading for performances only when mixing IO and CPU tasks.
For your problem you should use multiprocessing.

Maybe the other posters are right about the GIL. But OpenCV as well as Numpy release the GIL so I would at least expect a speedup from a multithreaded solution.
I would have a look at how many threads you are creating simultaneously. It's probably a lot since you start one for each 8 by 8 pixel sub picture. (Each time a thread is taken off the cpu and replaced by another it incurs a small overhead which in sum gets quite noticeable if you have a lot of threads)
If this is the case you probably gain performance by not starting them all at once but to only start as many as you have cpu cores (a few more a few less...just experiment) and only start the next thread if one has finished.
Look at the answers to this question on how to do this with minimal effort.

Related

scipy.fftpack.fft with multiprocessing, how to avoid performance losses?

I would like to use scipy.fftpack.fft (and rfft) inside a multiprocessing structure,
I have observed significant performances losses due to an apparent incompatibility between scipy.fftpack and multiprocessing, which makes the parallelization almost inefficient.
Unless the issue seems well known, I could not find a solution in the web to avoid this performance losses.
Below is an minimalist example showing the issue :
import time
import multiprocessing as mp
from scipy.fftpack import fft, ifft
import numpy as np
def costly_function(n_mean: int):
start = time.time()
x = np.ones(16385, dtype=float)
for n in range(n_mean):
fft(ifft(x))
return (time.time() - start) * 1000.
n_run = 24
# ===== sequential test
sequential_times = [costly_function(500) for _ in range(n_run)]
print(f"time per run (sequential): {np.mean(sequential_times):.2f}+-{np.std(sequential_times):.2f}ms")
# ===== parallel test
with mp.Pool(12) as pool:
parallel_times = pool.map(costly_function, [500 for _ in range(n_run)])
print(f"time per run (parallel): {np.mean(parallel_times):.2f}+-{np.std(parallel_times):.2f}ms")
On a 12 cores machine under Ubuntu and python 3.10, I get the following result :
>> time per run (sequential): 510.55+-64.64ms
>> time per run (parallel): 1254.26+-114.47ms
note : none of these additions could resolve the problem
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'

Multiprocessing a function's output to a single array

I have been trying this for quite sometime now, but my array remains unchanged.
My array here is TC_p_value, and the function I am trying to simulate is TC_stats. The code runs fine if we run it normally, but takes too long to simulate (about an hour). Thus, to reduce the processing time, I divided the original array (1000x100) in 10 small sets of 100x100. Although, the code runs without an error, I somehow always get the same array (same as it is defined originally). I tried to define TC_p_value as global, so that each run can assign values to specific part of the array. However, it seems like I am doing something wrong here (as simulating a single array on multiple processors is not possible) or is there something wrong with my coding logic?
Any help is greatly appreciated.
Code for the same is written below.
import pingouin as pg # A package to do regression
TC_p_value = np.zeros((Treecover.shape[1],Treecover.shape[2])) #let this array be of size 1000 x 100
def TC_stats(grid_start):
global TC_p_value
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map(TC_stats, grid)
pool.close()
pool.join()
The problem is that an array defined globally is not shared across processes. Thus, you need to use shared memory.
import ctypes
import numpy as np
import pingouin as pg # A package to do regression
N, M = Treecover.shape[1], Treecover.shape[2]
mp_arr = mp.Array(ctypes.c_double, N * M)
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
#let this array be of size 1000 x 100
def TC_stats(grid_start):
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
def init(shared_arr_):
global mp_arr
mp_arr = shared_arr_
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(initializer=init, initargs=(mp_arr,))
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map_async(TC_stats, grid)
pool.close()
pool.join()
I ran the code above with some modified toy example, and it worked.
Reference: Use numpy array in shared memory for multiprocessing

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

I want to do some large matrix multiplications using multiprocessing.Pool.
Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.
Is there any easy way to be faster?
Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.
The sample code is as follows.
import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial
def f(d):
a = int(10*d)
N = int(10000/d)
for _ in range(N):
X = np.random.randn(a,10) # np.random.randn(10,10)
return X
# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]
# Serial processing
serial = []
for d in ds:
t1 = time()
for i in range(20):
f(d)
serial.append(time()-t1)
# Parallel processing
parallel = []
for d in ds:
t1 = time()
pool = Pool()
for i in range(20):
pool.apply_async(partial(f,d), args=())
pool.close()
pool.join()
parallel.append(time()-t1)
# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.plot(ds,serial,label='serial')
plt.plot(ds,parallel,label='parallel')
plt.xlabel('d (dimension)')
plt.ylabel('Total time (sec)')
plt.legend()
plt.show()
Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.
But the actual output is not.
System info:
Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU # 3.10GHz
NOTE I want to use parallel computation as a complicated internal simulation (like #), not sending data to the child process.
This is for self-reference.
Here, I found a solution.
My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.
If I run the code:
import os
os.environ['MKL_NUM_THREADS'] = '1'
before importing numpy, then it solved.
I just found an explanation here: https://github.com/numpy/numpy/issues/10145.
Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Multiple scipy.integrate.ode instances

I would like to use scipy.integrate.ode (or scipy.integrate.odeint) instances in multiple threads (one for each CPU core) in order to solve multiple IVPs at a time. However the documentation says: "This integrator is not re-entrant. You cannot have two ode instances using the “vode” integrator at the same time."
(Also odeint causes internal errors if instantiated multiple times although the documentation does not say so.)
Any idea what can be done?
One option is to use multiprocessing (i.e. use processes instead of threads). Here's an example that uses the map function of the multiprocessing.Pool class.
The function solve takes a set of initial conditions and returns a solution generated by odeint. The "serial" version of the code in the main section calls solve repeatedly, once for each set of initial conditions in ics. The "multiprocessing" version uses the map function of a multiprocessing.Pool instance to run several processes simultaneously, each calling solve. The map function takes care of doling out the arguments to solve.
My computer has four cores, and as I increase num_processes, the speedup maxes out at about 3.6.
from __future__ import division, print_function
import sys
import time
import multiprocessing as mp
import numpy as np
from scipy.integrate import odeint
def lorenz(q, t, sigma, rho, beta):
x, y, z = q
return [sigma*(y - x), x*(rho - z) - y, x*y - beta*z]
def solve(ic):
t = np.linspace(0, 200, 801)
sigma = 10.0
rho = 28.0
beta = 8/3
sol = odeint(lorenz, ic, t, args=(sigma, rho, beta), rtol=1e-10, atol=1e-12)
return sol
if __name__ == "__main__":
ics = np.random.randn(100, 3)
print("multiprocessing:", end='')
tstart = time.time()
num_processes = 5
p = mp.Pool(num_processes)
mp_solutions = p.map(solve, ics)
tend = time.time()
tmp = tend - tstart
print(" %8.3f seconds" % tmp)
print("serial: ", end='')
sys.stdout.flush()
tstart = time.time()
serial_solutions = [solve(ic) for ic in ics]
tend = time.time()
tserial = tend - tstart
print(" %8.3f seconds" % tserial)
print("num_processes = %i, speedup = %.2f" % (num_processes, tserial/tmp))
check = [(sol1 == sol2).all()
for sol1, sol2 in zip(serial_solutions, mp_solutions)]
if not all(check):
print("There was at least one discrepancy in the solutions.")
On my computer, the output is:
multiprocessing: 6.904 seconds
serial: 24.756 seconds
num_processes = 5, speedup = 3.59
SciPy.integrate.ode appears to use the LLNL SUNDIALS solvers, although SciPy doesn't say so explicitly, but they should, in my opinion.
The current version of the CVODE ode solver, 3.2.2, is re-entrant, which means that it can be used to solve multiple problems concurrently. The relevant information appears in User Documentation for CVODE v3.2.0 (SUNDIALS v3.2.0).
All state information used by cvode to solve a given problem is saved in a structure, and a pointer
to that structure is returned to the user. There is no global data in the cvode package, and so, in this
respect, it is reentrant. State information specific to the linear solver is saved in a separate structure,
a pointer to which resides in the cvode memory structure. The reentrancy of cvode was motivated
by the anticipated multicomputer extension, but is also essential in a uniprocessor setting where two
or more problems are solved by intermixed calls to the package from within a single user program.
But I don't know whether SciPy.integrate.ode, or other ode solvers like scikits.odes.ode, support this concurrency.

Python - multiprocessing for matplotlib griddata

Following my former question [1], I would like to apply multiprocessing to matplotlib's griddata function. Is it possible to split the griddata into, say 4 parts, one for each of my 4 cores? I need this to improve performance.
For example, try the code below, experimenting with different values for size:
import numpy as np
import matplotlib.mlab as mlab
import time
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print 'Time=', toc-tic
I ran the example code below in Python 3.4.2, with numpy version 1.9.1 and matplotlib version 1.4.2, on a Macbook Pro with 4 physical CPUs (i.e., as opposed to "virtual" CPUs, which the Mac hardware architecture also makes available for some use cases):
import numpy as np
import matplotlib.mlab as mlab
import time
import multiprocessing
# This value should be set much larger than nprocs, defined later below
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print('Single Processor Time={0}'.format(toc-tic))
# Put interpolation points into a single array so that we can slice it easily
xi = x + u
yi = y + v
# My example test machine has 4 physical CPUs
nprocs = 4
jump = int(size/nprocs)
# Enclose the griddata function in a wrapper which will communicate its
# output result back to the calling process via a Queue
def wrapper(x, y, z, xi, yi, q):
test_w = mlab.griddata(x, y, z, xi, yi, interp='linear')
q.put(test_w)
# Measure the elapsed time for multiprocessing separately
ticm = time.clock()
queue, process = [], []
for n in range(nprocs):
queue.append(multiprocessing.Queue())
# Handle the possibility that size is not evenly divisible by nprocs
if n == (nprocs-1):
finalidx = size
else:
finalidx = (n + 1) * jump
# Define the arguments, dividing the interpolation variables into
# nprocs roughly evenly sized slices
argtuple = (x.flatten(), y.flatten(), test.flatten(),
xi[:,(n*jump):finalidx], yi[:,(n*jump):finalidx], queue[-1])
# Create the processes, and launch them
process.append(multiprocessing.Process(target=wrapper, args=argtuple))
process[-1].start()
# Initialize an array to hold the return value, and make sure that it is
# null-valued but of the appropriate size
test_m = np.asarray([[] for s in range(size)])
# Read the individual results back from the queues and concatenate them
# into the return array
for q, p in zip(queue, process):
test_m = np.concatenate((test_m, q.get()), axis=1)
p.join()
tocm = time.clock()
print('Multiprocessing Time={0}'.format(tocm-ticm))
# Check that the result of both methods is actually the same; should raise
# an AssertionError exception if assertion is not True
assert np.all(test_d == test_m)
and I got the following result:
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/tri/triangulation.py:110: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.self._neighbors)
Single Processor Time=8.495998
Multiprocessing Time=2.249938
I'm not really sure what is causing the "future warning" from triangulation.py (evidently my version of matplotlib did not like something about the input values that were originally provided for the question), but regardless, the multiprocessing does appear to achieve the desired speedup of 8.50/2.25 = 3.8, (edit: see comments) which is roughly in the neighborhood of about 4X that we would expect for a machine with 4 CPUs. And the assertion statement at the end also executes successfully, proving that the two methods get the same answer, so in spite of the slightly weird warning message, I believe that the code above is a valid solution.
EDIT: A commenter has pointed out that both my solution, as well as the code snippet posted by the original author, are likely using the wrong method, time.clock(), for measuring execution time; he suggests using time.time() instead. I think I'm also coming around to his point of view. (Digging into the Python documentation a bit further, I'm still not convinced that even this solution is 100% correct, as newer versions of Python appear to have deprecated time.clock() in favor of time.perf_counter() and time.process_time(). But regardless, I do agree that whether or not time.time() is absolutely the most correct way of taking this measurement, it's still probably more correct than what I had been using before, time.clock().)
Assuming the commenter's point is correct, then it means the approximately 4X speedup that I thought I had measured is in fact wrong.
However, that does not mean that the underlying code itself wasn't correctly parallelized; rather, it just means that parallelization didn't actually help in this case; splitting up the data and running on multiple processors didn't improve anything. Why would this be? Other users have pointed out that, at least in numpy/scipy, some functions run on multiple cores, and some do not, and it can be a seriously challenging research project for an end-user to try to figure out which ones are which.
Based on the results of this experiment, if my solution correctly achieves parallelization within Python, but no further speedup is observed, then I would suggest the simplest likely explanation is that matplotlib is probably also parallelizing some of its functions "under the hood", so to speak, in compiled C++ libraries, just like numpy/scipy already do. Assuming that's the case, then the correct answer to this question would be that nothing further can be done: further parallelizing in Python will do no good if the underlying C++ libraries are already silently running on multiple cores to begin with.

Categories