Anaconda Accelerate dot product is 2x slower than plain NumPy

Anaconda Accelerate dot product is 2x slower than plain NumPy - python

Why does Anaconda Accelerate compute dot products slower than plain NumPy on Python 3? I'm using accelerate version 2.3.1 with accelerate_cudalib 2.0 installed, Python 3.5.2 Windows 10 64-bit.
import numpy as np
from accelerate.cuda.blas import dot as gpu_dot
import time
def numpydot():
start= time.time()
for i in range(100):
np.dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
elapsedtime = time.time()-start
return elapsedtime
def acceleratedot():
start= time.time()
for i in range(100):
gpu_dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
elapsedtime = time.time()-start
return elapsedtime
numpydot()
0.6446375846862793
acceleratedot()
1.33168363571167

I figured out that shared arrays are created with Numba, a separate library. They have the documentation on their site.

Related

OMP: Error #131: Thread identifier invalid. when use Numba #njit(nogil=True, parallel=True)

I want to speed up my code by parallel in Numba (0.55.1 version) as below: As it contain a for loop, I want to speed it up by parallel computing with Numba
from numba import prange
from numba import njit
import numpy as np
from numpy import linalg as LA
import time
#njit(nogil=True)
def func(n):
nprime = n-1
main = np.sqrt(2.) * np.random.normal(0., 1., (nprime))
off = np.random.normal(0., 1., (nprime, nprime))
tril = np.tril(off, -1)
W_n = tril + tril.T
np.fill_diagonal(W_n, main)
eigenvalues = LA.eigvals(W_n)
return np.sort(eigenvalues)[::-1][0:2]
#njit(nogil=True, parallel=True)
def GOE_L12_sim_pa(n=200, rep=500):
for x0 in prange(rep):
func(n)
start = time.time()
GOE_L12_sim_pa()
end = time.time()
print(end-start)
Error:
OMP: Error #131: Thread identifier invalid.
OMP: Error #131: Thread identifier invalid.
Noting that it works when I change the decorator from #njit(nogil=True, parallel=True) to #njit(nogil=True).

I encountered the same OMP #131 error due to a bug in llvm-openmp 12.0.0 running on the M1 Mac processor. Updating to the latest version 14.0.6 resolved the issue.

Error when using numba and jit to run python with my gpu

This code is from geeksforgeeks and is meant to run normally (with a lower time for the gpu):
from numba import jit, cuda, errors
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
#jit(target ="cuda")
def func2(a):
for i in range(10000000):
a[i]+= 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
b = np.ones(n, dtype = np.float32)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
but i get an error on the 'def func2(a)' line saying:
__init__() got an unexpected keyword argument 'locals'
and in the terminal the error is:
C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\numba\core\decorators.py:153: NumbaDeprecationWarning: The 'target' keyword argument is deprecated.
warnings.warn("The 'target' keyword argument is deprecated.", NumbaDeprecationWarning)
Why does this happen and how do i fix it?
I have an intel i7 10750H and a 1650ti gpu

To get rid of the deprecation warning: https://numba.pydata.org/numba-doc/dev/reference/deprecation.html#deprecation-of-the-target-kwarg
This is a hack but try updating your CUDA and driver version first, then retry the code. Then finally try this hack if nothing works:
from numba import cuda
import code
code.interact(local=locals)
# function optimized to run on gpu
#cuda.jit(target ="cuda")
def func2(a):
for i in range(10000000):
a[i]+= 1

How to merge images as transparent layers?

I am working on video editor for raspberry pi, and I have a problem with speed of placing image over image. Currently, using imagemagick it takes up to 10 seconds just to place one image over another, using 1080x1920 png images, on raspberry pi, and that's way too much. With the number of images time goes up as well. Any ideas on how to speed it up?
Imagemagick code:
composite -blend 90 img1.png img2.png new.png
Video editor with yet slow opacity support here
--------EDIT--------
slightly faster way:
import numpy as np
from PIL import Image
size_X, size_Y = 1920, 1080# put images resolution, else output may look wierd
image1 = np.resize(np.asarray(Image.open('img1.png').convert('RGB')), (size_X, size_Y, 3))
image2 = np.resize(np.asarray(Image.open('img2.png').convert('RGB')), (size_X, size_Y, 3))
output = image1*transparency+image2*(1-transparency)
Image.fromarray(np.uint8(output)).save('output.png')

My Raspberry Pi is unavailable at the moment - all I am saying is that there was some smoke involved and I do software, not hardware! As a result, I have only tested this on a Mac. It uses Numba.
First I used your Numpy code on these 2 images:
and
Then I implemented the same thing using Numba. The Numba version runs 5.5x faster on my iMac. As the Raspberry Pi has 4 cores, you could try experimenting with:
#jit(nopython=True,parallel=True)
def method2(image1,image2,transparency):
...
Here is the code:
#!/usr/bin/env python3
import numpy as np
from PIL import Image
import numba
from numba import jit
def method1(image1,image2,transparency):
result = image1*transparency+image2*(1-transparency)
return result
#jit(nopython=True)
def method2(image1,image2,transparency):
h, w, c = image1.shape
for y in range(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency + (image2[y][x][z]*(1-transparency))
return image1
i1 = np.array(Image.open('image1.jpg').convert('RGB'))
i2 = np.array(Image.open('image2.jpg').convert('RGB'))
res = method1(i1,i2,0.4)
res = method2(i1,i2,0.4)
Image.fromarray(np.uint8(res)).save('result.png')
The result is:
Other thoughts... I did the composite in-place, overwriting the input image1 to try and save cache space. That may help or hinder - please experiment. I may not have processed the pixels in the optimal order - please experiment.

Just as another option, I tried in pyvips (full disclosure: I'm the pyvips maintainer, so I'm not very neutral):
#!/usr/bin/python3
import sys
import time
import pyvips
start = time.time()
a = pyvips.Image.new_from_file(sys.argv[1], access="sequential")
b = pyvips.Image.new_from_file(sys.argv[2], access="sequential")
out = a * 0.2 + b * 0.8
out.write_to_file(sys.argv[3])
print("pyvips took {} milliseconds".format(1000 * (time.time() - start)))
pyvips is a "pipeline" image processing library, so that code will execute the load, processing and save all in parallel.
On this two core, four thread i5 laptop using Mark's two test images I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
took 39.156198501586914 milliseconds
So 39ms for two jpg loads, processing and one jpg save.
You can time just the blend part by copying the source images and the result to memory, like this:
a = pyvips.Image.new_from_file(sys.argv[1]).copy_memory()
b = pyvips.Image.new_from_file(sys.argv[2]).copy_memory()
start = time.time()
out = (a * 0.2 + b * 0.8).copy_memory()
print("pyvips between memory buffers took {} milliseconds"
.format(1000 * (time.time() - start)))
I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
pyvips between memory buffers took 15.432596206665039 milliseconds
numpy is about 60ms on this same test.
I tried a slight variant of Mark's nice numba example:
#!/usr/bin/python3
import sys
import time
import numpy as np
from PIL import Image
import numba
from numba import jit, prange
#jit(nopython=True, parallel=True)
def method2(image1, image2, transparency):
h, w, c = image1.shape
for y in prange(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency \
+ (image2[y][x][z] * (1 - transparency))
return image1
# run once to force a compile
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
res = method2(i1, i2, 0.2)
# run again and time it
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
start = time.time()
res = method2(i1, i2, 0.2)
print("numba took {} milliseconds".format(1000 * (time.time() - start)))
Image.fromarray(np.uint8(res)).save(sys.argv[3])
And I see:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba took 8.110523223876953 milliseconds
So on this laptop, numba is about 2x faster than pyvips.
If you time load and save as well, it's quite a bit slower:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba plus load and save took 272.8157043457031 milliseconds
But that seems unfair, since almost all that time is in PIL load and save.

why scipy.spatial.ckdtree runs slower than scipy.spatial.kdtree

Normally,scipy.spatial.ckdtree runs much faster than scipy.spatial.kdtree.
But in my case,scipy.spatial.ckdtree runs slower than scipy.spatial.kdtree.
My code is as follows:
import numpy as np
from laspy.file import File
from scipy import spatial
from timeit import default_timer as timer
inFile = File("Toronto_Strip_01.las")
dataset = np.vstack([inFile.x, inFile.y, inFile.z]).transpose()
print(dataset.shape)
start=timer()
tree = spatial.cKDTree(dataset)
# balanced_tree = False
end=timer()
distance,index=tree.query(dataset[100,:],k=5)
print(distance,index)
print(end-start)
start=timer()
tree = spatial.KDTree(dataset)
end=timer()
dis,indices= tree.query(dataset[100,:],k=5)
print(dis,indices)
print(end-start)
dataset.shape is (2727891, 3),dataset.max() is 4834229.32
But, in a test case, scipy.spatial.ckdtree runs much faster than scipy.spatial.kdtree,the code is as follows:
import numpy as np
from timeit import default_timer as timer
from scipy import spatial
np.random.seed(0)
A = np.random.random((2000000,3))*2000000
start1 = timer()
kdt=spatial.KDTree(A)
end1 = timer()
distance,index = kdt.query(A[100,:],k=5)
print(distance,index)
print(end1-start1)
start2 = timer()
kdt = spatial.cKDTree(A) # cKDTree + outside construction
end2 = timer()
distance,index = kdt.query(A[100,:],k=5)
print(distance,index)
print(end2-start2)
Here is my problem: in my code,Do I need to process the dataset to speedup the cKDTree?
my python version is 3.6.5，scipy version is 1.1.0,cython is 0.28.4

Perhaps more of a long comment; but you should consider how the cKDTree parameters impact performance with your particular dataset.
Especially balanced_tree, and compact_nodes - as pointed out here.

python multiprocessing module: strange behaviour and processor load when using Pool

I'm using Python's multiprocessing lib to speed up some code (least squares fitting with scipy).
It works fine on 3 different machines, but it shows a strange behaviour on a 4th machine.
The code:
import numpy as np
from scipy.optimize import least_squares
import time
import parmap
from multiprocessing import Pool
p0 = [1., 1., 0.5]
def f(p, xx):
return p[0]*np.exp(-xx ** 2 / p[1] ** 2) + p[2]
def errorfunc(p, xx, yy):
return f(p, xx) - yy
def do_fit(yy, xx):
return least_squares(errorfunc, p0[:], args=(xx, yy))
if __name__ == '__main__':
# create data
x = np.linspace(-10, 10, 1000)
y = []
np.random.seed(42)
for i in range(1000):
y.append(f([np.random.rand(1) * 10, np.random.rand(1), 0.], x) + np.random.rand(len(x)))
# fit without multiprocessing
t1 = time.time()
for y_data in y:
p1 = least_squares(errorfunc, p0[:], args=(x, y_data))
t2 = time.time()
print t2 - t1
# fit with multiprocessing lib
times = []
for p in range(1,13):
my_pool = Pool(p)
t3 = time.time()
results = parmap.map(do_fit, y, x, pool=my_pool)
t4 = time.time()
times.append(t4-t3)
my_pool.close()
print times
For the 3 machines where it works, it speeds up roughly in the expected way. E.g. on my i7 laptop it gives:
[4.92650294303894, 2.5883090496063232, 1.7945551872253418, 1.629533052444458,
1.4896039962768555, 1.3550388813018799, 1.1796400547027588, 1.1852478981018066,
1.1404039859771729, 1.2239141464233398, 1.1676840782165527, 1.1416618824005127]
I'm running Ubuntu 14.10, Python 2.7.6, numpy 1.11.0 and scipy 0.17.0.
I tested it on another Ubuntu machine, a Dell PowerEdge R210 with similar results and on a MacBook Pro Retina (here with Python 2.7.11, and same numpy and scipy versions).
The computer that causes issues is a PowerEdge R710 (two hexcores) running Ubuntu 15.10, Python 2.7.11 and same numpy and scipy version as above.
However, I don't observe any speedup. Times are around 6 seconds, no matter what poolsize I use. In fact, it is slightly better for a poolsize of 2 and gets worse for more processes.
htop shows that somehow more processes get spawned than I would expect.
E.g. on my laptop htop shows one entry per process (which matches the poolsize) and eventually each process shows 100% CPU load.
On the PowerEdge R710 I see about 8 python processes for a poolsize of 1 and about 20 processes for a poolsize of 2 etc. each of which shows 100% CPU load.
I checked BIOS settings of the R710 and I couldn't find anything unusual.
What should I look for?
EDIT:
Answering to the comment, I used another simple script. Surprisingly this one seems to 'work' for all machines:
from multiprocessing import Pool
import time
import math
import numpy as np
def f_np(x):
return x**np.sin(x)+np.fabs(np.cos(x))**np.arctan(x)
def f(x):
return x**math.sin(x)+math.fabs(math.cos(x))**math.atan(x)
if __name__ == '__main__':
print "#pool", ", numpy", ", pure python"
for p in range(1,9):
pool = Pool(processes=p)
np.random.seed(42)
a = np.random.rand(1000,1000)
t1 = time.time()
for i in range(5):
pool.map(f_np, a)
t2 = time.time()
for i in range(5):
pool.map(f, range(1000000))
print p, t2-t1, time.time()-t2
pool.close()
gives:
#pool , numpy , pure python
1 1.34186911583 5.87641906738
2 0.697530984879 3.16030216217
3 0.470160961151 2.20742988586
4 0.35701417923 1.73128080368
5 0.308979988098 1.47339701653
6 0.286448001862 1.37223601341
7 0.274246931076 1.27663207054
8 0.245123147964 1.24748778343
on the machine that caused the trouble. There are no more threads (or processes?) spawned than I would expect.
It looks like numpy is not the problem, but as soon as I use scipy.optimize.least_squares the issue arises.
Using on htop on the processes shows a lot of sched_yield() calls which I don't see if I don't use scipy.optimize.least_squares and which I also don't see on my laptop even when using least_squares.

According to here, there is an issue when OpenBLAS is used together with joblib.
Similar issues occur when MKL is used (see here).
The solution given here, also worked for me:
Adding
import os
os.environ['MKL_NUM_THREADS'] = '1'
at the beginning of my python script solves the issue.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Anaconda Accelerate dot product is 2x slower than plain NumPy - python

I figured out that shared arrays are created with Numba, a separate library. They have the documentation on their site.

Related

OMP: Error #131: Thread identifier invalid. when use Numba #njit(nogil=True, parallel=True)

Error when using numba and jit to run python with my gpu

How to merge images as transparent layers?

why scipy.spatial.ckdtree runs slower than scipy.spatial.kdtree

python multiprocessing module: strange behaviour and processor load when using Pool

Categories

Resources