I'm using Python's multiprocessing lib to speed up some code (least squares fitting with scipy).
It works fine on 3 different machines, but it shows a strange behaviour on a 4th machine.
The code:
import numpy as np
from scipy.optimize import least_squares
import time
import parmap
from multiprocessing import Pool
p0 = [1., 1., 0.5]
def f(p, xx):
return p[0]*np.exp(-xx ** 2 / p[1] ** 2) + p[2]
def errorfunc(p, xx, yy):
return f(p, xx) - yy
def do_fit(yy, xx):
return least_squares(errorfunc, p0[:], args=(xx, yy))
if __name__ == '__main__':
# create data
x = np.linspace(-10, 10, 1000)
y = []
np.random.seed(42)
for i in range(1000):
y.append(f([np.random.rand(1) * 10, np.random.rand(1), 0.], x) + np.random.rand(len(x)))
# fit without multiprocessing
t1 = time.time()
for y_data in y:
p1 = least_squares(errorfunc, p0[:], args=(x, y_data))
t2 = time.time()
print t2 - t1
# fit with multiprocessing lib
times = []
for p in range(1,13):
my_pool = Pool(p)
t3 = time.time()
results = parmap.map(do_fit, y, x, pool=my_pool)
t4 = time.time()
times.append(t4-t3)
my_pool.close()
print times
For the 3 machines where it works, it speeds up roughly in the expected way. E.g. on my i7 laptop it gives:
[4.92650294303894, 2.5883090496063232, 1.7945551872253418, 1.629533052444458,
1.4896039962768555, 1.3550388813018799, 1.1796400547027588, 1.1852478981018066,
1.1404039859771729, 1.2239141464233398, 1.1676840782165527, 1.1416618824005127]
I'm running Ubuntu 14.10, Python 2.7.6, numpy 1.11.0 and scipy 0.17.0.
I tested it on another Ubuntu machine, a Dell PowerEdge R210 with similar results and on a MacBook Pro Retina (here with Python 2.7.11, and same numpy and scipy versions).
The computer that causes issues is a PowerEdge R710 (two hexcores) running Ubuntu 15.10, Python 2.7.11 and same numpy and scipy version as above.
However, I don't observe any speedup. Times are around 6 seconds, no matter what poolsize I use. In fact, it is slightly better for a poolsize of 2 and gets worse for more processes.
htop shows that somehow more processes get spawned than I would expect.
E.g. on my laptop htop shows one entry per process (which matches the poolsize) and eventually each process shows 100% CPU load.
On the PowerEdge R710 I see about 8 python processes for a poolsize of 1 and about 20 processes for a poolsize of 2 etc. each of which shows 100% CPU load.
I checked BIOS settings of the R710 and I couldn't find anything unusual.
What should I look for?
EDIT:
Answering to the comment, I used another simple script. Surprisingly this one seems to 'work' for all machines:
from multiprocessing import Pool
import time
import math
import numpy as np
def f_np(x):
return x**np.sin(x)+np.fabs(np.cos(x))**np.arctan(x)
def f(x):
return x**math.sin(x)+math.fabs(math.cos(x))**math.atan(x)
if __name__ == '__main__':
print "#pool", ", numpy", ", pure python"
for p in range(1,9):
pool = Pool(processes=p)
np.random.seed(42)
a = np.random.rand(1000,1000)
t1 = time.time()
for i in range(5):
pool.map(f_np, a)
t2 = time.time()
for i in range(5):
pool.map(f, range(1000000))
print p, t2-t1, time.time()-t2
pool.close()
gives:
#pool , numpy , pure python
1 1.34186911583 5.87641906738
2 0.697530984879 3.16030216217
3 0.470160961151 2.20742988586
4 0.35701417923 1.73128080368
5 0.308979988098 1.47339701653
6 0.286448001862 1.37223601341
7 0.274246931076 1.27663207054
8 0.245123147964 1.24748778343
on the machine that caused the trouble. There are no more threads (or processes?) spawned than I would expect.
It looks like numpy is not the problem, but as soon as I use scipy.optimize.least_squares the issue arises.
Using on htop on the processes shows a lot of sched_yield() calls which I don't see if I don't use scipy.optimize.least_squares and which I also don't see on my laptop even when using least_squares.
According to here, there is an issue when OpenBLAS is used together with joblib.
Similar issues occur when MKL is used (see here).
The solution given here, also worked for me:
Adding
import os
os.environ['MKL_NUM_THREADS'] = '1'
at the beginning of my python script solves the issue.
Related
I want to speed up my code by parallel in Numba (0.55.1 version) as below: As it contain a for loop, I want to speed it up by parallel computing with Numba
from numba import prange
from numba import njit
import numpy as np
from numpy import linalg as LA
import time
#njit(nogil=True)
def func(n):
nprime = n-1
main = np.sqrt(2.) * np.random.normal(0., 1., (nprime))
off = np.random.normal(0., 1., (nprime, nprime))
tril = np.tril(off, -1)
W_n = tril + tril.T
np.fill_diagonal(W_n, main)
eigenvalues = LA.eigvals(W_n)
return np.sort(eigenvalues)[::-1][0:2]
#njit(nogil=True, parallel=True)
def GOE_L12_sim_pa(n=200, rep=500):
for x0 in prange(rep):
func(n)
start = time.time()
GOE_L12_sim_pa()
end = time.time()
print(end-start)
Error:
OMP: Error #131: Thread identifier invalid.
OMP: Error #131: Thread identifier invalid.
Noting that it works when I change the decorator from #njit(nogil=True, parallel=True) to #njit(nogil=True).
I encountered the same OMP #131 error due to a bug in llvm-openmp 12.0.0 running on the M1 Mac processor. Updating to the latest version 14.0.6 resolved the issue.
I want to use cuda streams in order to speed up small calculations on the GPU. My test so far consists of the following:
import cupy as xp
import time
x = xp.random.randn(10003, 20000) + 1j * xp.random.randn(10003, 20000)
y = xp.zeros_like(x)
nStreams = 16
streams = [xp.cuda.stream.Stream() for ii in range(nStreams)]
f = xp.fft.fft(x[:,:200])
t = time.time()
for ii in range(int(x.shape[1]/100)):
ss = streams[ii % nStreams]
with ss:
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
for ii,ss in enumerate(streams):
ss.synchronize()
print(time.time()-t)
t = time.time()
for ii in range(int(x.shape[1]/100)):
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
xp.cuda.Stream.null.synchronize()
print(time.time()-t)
produces
[user#pc snippets]$ intelpython3 strm.py
0.019365549087524414
0.018717050552368164
which I have trouble believing that I do everything correctly. Additionally, the situation becomes even more severe when replacing the FFT-calls with calls to xp.sum, which yields
[user#pc snippets]$ intelpython3 strm.py
0.002195596694946289
0.001004934310913086
What is the rationale behind cupy streams? How do I use them to my advantage?
I'm using parallel processing to generate a plot of functions using complex numbers. My script allows you to zoom in on an area of the plot using the standard matplotlib controls and then regenerate the plot within the new limits to improve resolution.
This is my first foray into parallel processing and I've got as far as understanding that I need to preface with if __name__ == __main__: to allow the module to be imported properly. When running my script, the first plot is successfully generated and appears as expected. However, when the plotting function is called again from my event handler it instead hangs indefinitely. I assume that the hang is caused by some similar issue to that of requiring if __name__ == __main__:, as the parallel processes are being spawned from outside the main body of the script, but I haven't figured out anything further than this.
import numpy as np
import matplotlib.pyplot as plt
from concurrent.futures import ProcessPoolExecutor
import multiprocessing
res = [1000, 1000]
base_factor = 2.
cpuNum = multiprocessing.cpu_count()
def brot(c, depth=200):
z = complex(0)
for i in range(depth):
z = pow(z, 2) + c
if abs(z) > 2:
return i
return -1
def brot_gen(span):
re_span = span[0]
im_span = span[1]
mset = np.zeros([len(im_span), len(re_span)])
for re in range(len(re_span)):
for im in range(len(im_span)):
mset[im][re] = brot(complex(re_span[re], im_span[im]))
return mset
def brot_gen_parallel(re_lim, im_lim):
re_span = np.linspace(re_lim[0], re_lim[1], res[0])
im_span = np.linspace(im_lim[0], im_lim[1], res[1])
split_re_span = np.array_split(re_span, cpuNum)
packages = [(sec, im_span) for sec in split_re_span]
print("Generating set between", re_lim, "and", im_lim, "...")
with ProcessPoolExecutor(max_workers = cpuNum) as executor:
result = executor.map(brot_gen, packages)
mset = np.concatenate(list(result), axis=1)
print("Set generated")
return mset
def handler(ax):
def action(event):
if event.button == 2:
cur_re_lim = ax.get_xlim()
cur_im_lim = ax.get_ylim()
mset = brot_gen_parallel(cur_re_lim, cur_im_lim)
ax.cla()
ax.imshow(mset, extent=[cur_re_lim[0], cur_re_lim[1], cur_im_lim[0], cur_im_lim[1]], origin="lower", vmin=0, vmax=200, interpolation="bilinear")
plt.draw()
fig = ax.get_figure()
fig.canvas.mpl_connect('button_release_event', action)
return action
if __name__ == "__main__":
re_lim = np.array([-2.5, 2.5])
im_lim = res[1]/res[0] * re_lim
mset = brot_gen_parallel(re_lim, im_lim)
plt.imshow(mset, extent=[re_lim[0], re_lim[1], im_lim[0], im_lim[1]], origin="lower", vmin=0, vmax=200, interpolation="bilinear")
ax = plt.gca()
f = handler(ax)
plt.show()
EDIT: I wondered if there was a bug in the code causing an exception, but that this might not be being successfully passed back to the console, however I tested this by running the same task without splitting it into parallel tasks and it completed successfully.
I have discovered the answer to my own question. The answer lies in the IDE I was using. In my experience, in most IDEs plt.show() blocks execution by default, however in Spyder the default seems to be the equivalent of plt.show(block=False), meaning that the script completed and so whatever was required to successfully start the parallel processes was no longer available, causing the hang. This was solved by simply changing the statement to plt.show(block=True), meaning that the script was still live.
I'm still very new to parallel processing so I'd be very interested in any more information anyone can give on what was lacking to stop the parallel processing from working.
I am working on video editor for raspberry pi, and I have a problem with speed of placing image over image. Currently, using imagemagick it takes up to 10 seconds just to place one image over another, using 1080x1920 png images, on raspberry pi, and that's way too much. With the number of images time goes up as well. Any ideas on how to speed it up?
Imagemagick code:
composite -blend 90 img1.png img2.png new.png
Video editor with yet slow opacity support here
--------EDIT--------
slightly faster way:
import numpy as np
from PIL import Image
size_X, size_Y = 1920, 1080# put images resolution, else output may look wierd
image1 = np.resize(np.asarray(Image.open('img1.png').convert('RGB')), (size_X, size_Y, 3))
image2 = np.resize(np.asarray(Image.open('img2.png').convert('RGB')), (size_X, size_Y, 3))
output = image1*transparency+image2*(1-transparency)
Image.fromarray(np.uint8(output)).save('output.png')
My Raspberry Pi is unavailable at the moment - all I am saying is that there was some smoke involved and I do software, not hardware! As a result, I have only tested this on a Mac. It uses Numba.
First I used your Numpy code on these 2 images:
and
Then I implemented the same thing using Numba. The Numba version runs 5.5x faster on my iMac. As the Raspberry Pi has 4 cores, you could try experimenting with:
#jit(nopython=True,parallel=True)
def method2(image1,image2,transparency):
...
Here is the code:
#!/usr/bin/env python3
import numpy as np
from PIL import Image
import numba
from numba import jit
def method1(image1,image2,transparency):
result = image1*transparency+image2*(1-transparency)
return result
#jit(nopython=True)
def method2(image1,image2,transparency):
h, w, c = image1.shape
for y in range(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency + (image2[y][x][z]*(1-transparency))
return image1
i1 = np.array(Image.open('image1.jpg').convert('RGB'))
i2 = np.array(Image.open('image2.jpg').convert('RGB'))
res = method1(i1,i2,0.4)
res = method2(i1,i2,0.4)
Image.fromarray(np.uint8(res)).save('result.png')
The result is:
Other thoughts... I did the composite in-place, overwriting the input image1 to try and save cache space. That may help or hinder - please experiment. I may not have processed the pixels in the optimal order - please experiment.
Just as another option, I tried in pyvips (full disclosure: I'm the pyvips maintainer, so I'm not very neutral):
#!/usr/bin/python3
import sys
import time
import pyvips
start = time.time()
a = pyvips.Image.new_from_file(sys.argv[1], access="sequential")
b = pyvips.Image.new_from_file(sys.argv[2], access="sequential")
out = a * 0.2 + b * 0.8
out.write_to_file(sys.argv[3])
print("pyvips took {} milliseconds".format(1000 * (time.time() - start)))
pyvips is a "pipeline" image processing library, so that code will execute the load, processing and save all in parallel.
On this two core, four thread i5 laptop using Mark's two test images I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
took 39.156198501586914 milliseconds
So 39ms for two jpg loads, processing and one jpg save.
You can time just the blend part by copying the source images and the result to memory, like this:
a = pyvips.Image.new_from_file(sys.argv[1]).copy_memory()
b = pyvips.Image.new_from_file(sys.argv[2]).copy_memory()
start = time.time()
out = (a * 0.2 + b * 0.8).copy_memory()
print("pyvips between memory buffers took {} milliseconds"
.format(1000 * (time.time() - start)))
I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
pyvips between memory buffers took 15.432596206665039 milliseconds
numpy is about 60ms on this same test.
I tried a slight variant of Mark's nice numba example:
#!/usr/bin/python3
import sys
import time
import numpy as np
from PIL import Image
import numba
from numba import jit, prange
#jit(nopython=True, parallel=True)
def method2(image1, image2, transparency):
h, w, c = image1.shape
for y in prange(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency \
+ (image2[y][x][z] * (1 - transparency))
return image1
# run once to force a compile
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
res = method2(i1, i2, 0.2)
# run again and time it
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
start = time.time()
res = method2(i1, i2, 0.2)
print("numba took {} milliseconds".format(1000 * (time.time() - start)))
Image.fromarray(np.uint8(res)).save(sys.argv[3])
And I see:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba took 8.110523223876953 milliseconds
So on this laptop, numba is about 2x faster than pyvips.
If you time load and save as well, it's quite a bit slower:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba plus load and save took 272.8157043457031 milliseconds
But that seems unfair, since almost all that time is in PIL load and save.
Why does Anaconda Accelerate compute dot products slower than plain NumPy on Python 3? I'm using accelerate version 2.3.1 with accelerate_cudalib 2.0 installed, Python 3.5.2 Windows 10 64-bit.
import numpy as np
from accelerate.cuda.blas import dot as gpu_dot
import time
def numpydot():
start= time.time()
for i in range(100):
np.dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
elapsedtime = time.time()-start
return elapsedtime
def acceleratedot():
start= time.time()
for i in range(100):
gpu_dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
elapsedtime = time.time()-start
return elapsedtime
numpydot()
0.6446375846862793
acceleratedot()
1.33168363571167
I figured out that shared arrays are created with Numba, a separate library. They have the documentation on their site.