How to speed up Cupy with Streams correctly?

How to speed up Cupy with Streams correctly? - python

I want to use cuda streams in order to speed up small calculations on the GPU. My test so far consists of the following:
import cupy as xp
import time
x = xp.random.randn(10003, 20000) + 1j * xp.random.randn(10003, 20000)
y = xp.zeros_like(x)
nStreams = 16
streams = [xp.cuda.stream.Stream() for ii in range(nStreams)]
f = xp.fft.fft(x[:,:200])
t = time.time()
for ii in range(int(x.shape[1]/100)):
ss = streams[ii % nStreams]
with ss:
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
for ii,ss in enumerate(streams):
ss.synchronize()
print(time.time()-t)
t = time.time()
for ii in range(int(x.shape[1]/100)):
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
xp.cuda.Stream.null.synchronize()
print(time.time()-t)
produces
[user#pc snippets]$ intelpython3 strm.py
0.019365549087524414
0.018717050552368164
which I have trouble believing that I do everything correctly. Additionally, the situation becomes even more severe when replacing the FFT-calls with calls to xp.sum, which yields
[user#pc snippets]$ intelpython3 strm.py
0.002195596694946289
0.001004934310913086
What is the rationale behind cupy streams? How do I use them to my advantage?

Related

How to free up RAM when using Juypter Notebook?

I have a Juypter Notebook where I am working with large matrices (20000x20000). I am running multiple iterations, but I am getting an error saying that I do not have enough RAM after every iteration. If I restart the kernel, I can run the next iteration, so perhaps the Juypter Notebook is running out of RAM because it stores the variables (which aren't needed for the next iteration). Is there a way to free up RAM?
Edit: I don't know if the bold segment is correct. In any case, I am looking to free up RAM, any suggestions are welcome.
## Outputs:
two_moons_n_of_samples = [int(_) for _ in np.repeat(20000, 10)]
for i in range(len(two_moons_n_of_samples)):
# print(f'n: {two_moons_n_of_samples[i]}')
## Generate the data and the graph
X, ground_truth, fid = synthetic_data({'type': 'two_moons', 'n': two_moons_n_of_samples[i], 'fidelity': 60, 'sigma': 0.18})
N = X.shape[0]
dist_mat = sqdist(X.T, X.T)
opt = {
'graph': 'full',
'tau': 0.004,
'type': 's'
}
LS = dense_laplacian(dist_mat, opt)
## Eigenvalues and eigenvectors
tic = time.time() ## Time how long to calculate eigenvalues/eigenvectors
V, E = np.linalg.eigh(LS)
idx = np.argsort(V)
V, E = V[idx], E[:, idx]
V = V / V.max()
decomposition_time = time.time() - tic
## Initialize u0
u0 = np.zeros(N)
for j in range(len(fid[0])):
u0[fid[0][j]] = 1
for j in range(len(fid[1])):
u0[fid[1][j]] = -1
## Initialize parameters
dt = 0.05
gamma = 0.07
max_iter = 100
## Run MAP estimation
tic = time.time()
u_eg, _ = probit_optimization_eig(E, V, u0, dt, gamma, fid, max_iter)
eg_time = time.time() - tic
## Run MAP estimation with CG
tic2 = time.time()
u_cg, _ = probit_optimization_cg(LS, u0, dt, gamma, fid, max_iter)
cg_time = time.time() - tic2
## Write to file:
with open('results2_two_moons_egvscg.txt', 'a') as f:
f.write(f'{i},{two_moons_n_of_samples[i]},{decomposition_time + eg_time},{cg_time}\n')
Error:
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\2/ipykernel_2344/941022539.py in <module>
11 'type': 's'
12 }
---> 13 LS = dense_laplacian(dist_mat, opt)
14
15 ## Eigenvalues and eigenvectors
C:/Users/\util\graph\dense_laplacian.py in dense_laplacian(dist_mat, opt)
69 D_inv_sqrt = 1.0 / np.sqrt(D)
70 D_inv_sqrt = np.diag(D_inv_sqrt)
---> 71 L = np.eye(W.shape[0]) - D_inv_sqrt # W # D_inv_sqrt
72 # L = 0.5 * (L + L.T)
73 if opt['type'] == 'rw':
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64

I faced the same problem, the way I solved it was -
Writing Functions wherever preprocessing is required and returning only preprocessed variables.
Deleting used huge variables just use del x
Clearing Garbage
import gc
gc.collect()
Sometimes clearing garbage doesn't helps and i used to clear the cache as well by using
import ctypes
libc = ctypes.CDLL("libc.so.6") # clearing cache
libc.malloc_trim(0)
I tried to batch my code as far as possible.
I think the best solution for you would be to batch the matrix multiplication. Libraries like TensorFlow and PyTorch does it by default, not sure about NumPy though. Check - https://www.tensorflow.org/api_docs/python/tf/linalg/matmul ( An API for matrix multiplication in batches ). Most of modern-day GPU calculations are possible due to batching !

I would suggest adding more swap space which is really easy and will probably save you more time and headache than redesigning the code to be less wasteful or trying to delete and garbage collect unnecessary objects. It would of course be slower than using ram memory since it will use the disk to simulate the extra memory needed.
Excellent answer on how to do this on ubuntu, link

Avoid memory re-allocation in tensorflow while_loop

In every step of the while_loop, I want to update a 0.5 GB variable. I cannot avoid the loop because each iteration depends on the previous iteration. My program need to run the while loop for 100 million times.
To test the performance of tf.while in this scenario, I make a test. The update here is simply adding a constant to the variable.
However, even this simple loop takes 24 seconds and requires 4 times 1 GB memory. I suspect the loop is constantly trying to reallocate 1 GB chunks of memory, which is horribly slow on a GPU. The GPU has 4 GB memory, when I set the variable to 2 GB, I get oom.
Is it possible to avoid the re-allocation?
I can use x as a loop variable instead of using the tf.control_dependencies. But that uses a bit more memory.
tf.contrib.compiler.jit.experimental_jit_scope leads to oom.
Thanks.
Test:
import tensorflow as tf
import numpy as np
from functools import partial
from timeit import default_timer as timer
def body1(x, i):
a = tf.assign(x, x + 0.001)
with tf.control_dependencies([a]):
return i + 1
def make_loop1(x, end_ix):
i = tf.Variable(0, name="i", dtype=np.int32)
cond = lambda i2: tf.less(i2, end_ix)
body = partial(body1, x)
return tf.while_loop(
cond, body, [i], back_prop=False,
parallel_iterations=1)
def main():
N = int(1e9 / 4)
x = tf.get_variable('x', shape=N, dtype=np.float32,
initializer=tf.ones_initializer)
end_ix = tf.constant(int(1000), dtype=np.int32)
loop1 = make_loop1(x, end_ix)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print("running_loop1")
st = timer()
sess.run(loop1)
en = timer()
print(st - en)
print(sess.run(x[0]))
main()

How to merge images as transparent layers?

I am working on video editor for raspberry pi, and I have a problem with speed of placing image over image. Currently, using imagemagick it takes up to 10 seconds just to place one image over another, using 1080x1920 png images, on raspberry pi, and that's way too much. With the number of images time goes up as well. Any ideas on how to speed it up?
Imagemagick code:
composite -blend 90 img1.png img2.png new.png
Video editor with yet slow opacity support here
--------EDIT--------
slightly faster way:
import numpy as np
from PIL import Image
size_X, size_Y = 1920, 1080# put images resolution, else output may look wierd
image1 = np.resize(np.asarray(Image.open('img1.png').convert('RGB')), (size_X, size_Y, 3))
image2 = np.resize(np.asarray(Image.open('img2.png').convert('RGB')), (size_X, size_Y, 3))
output = image1*transparency+image2*(1-transparency)
Image.fromarray(np.uint8(output)).save('output.png')

My Raspberry Pi is unavailable at the moment - all I am saying is that there was some smoke involved and I do software, not hardware! As a result, I have only tested this on a Mac. It uses Numba.
First I used your Numpy code on these 2 images:
and
Then I implemented the same thing using Numba. The Numba version runs 5.5x faster on my iMac. As the Raspberry Pi has 4 cores, you could try experimenting with:
#jit(nopython=True,parallel=True)
def method2(image1,image2,transparency):
...
Here is the code:
#!/usr/bin/env python3
import numpy as np
from PIL import Image
import numba
from numba import jit
def method1(image1,image2,transparency):
result = image1*transparency+image2*(1-transparency)
return result
#jit(nopython=True)
def method2(image1,image2,transparency):
h, w, c = image1.shape
for y in range(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency + (image2[y][x][z]*(1-transparency))
return image1
i1 = np.array(Image.open('image1.jpg').convert('RGB'))
i2 = np.array(Image.open('image2.jpg').convert('RGB'))
res = method1(i1,i2,0.4)
res = method2(i1,i2,0.4)
Image.fromarray(np.uint8(res)).save('result.png')
The result is:
Other thoughts... I did the composite in-place, overwriting the input image1 to try and save cache space. That may help or hinder - please experiment. I may not have processed the pixels in the optimal order - please experiment.

Just as another option, I tried in pyvips (full disclosure: I'm the pyvips maintainer, so I'm not very neutral):
#!/usr/bin/python3
import sys
import time
import pyvips
start = time.time()
a = pyvips.Image.new_from_file(sys.argv[1], access="sequential")
b = pyvips.Image.new_from_file(sys.argv[2], access="sequential")
out = a * 0.2 + b * 0.8
out.write_to_file(sys.argv[3])
print("pyvips took {} milliseconds".format(1000 * (time.time() - start)))
pyvips is a "pipeline" image processing library, so that code will execute the load, processing and save all in parallel.
On this two core, four thread i5 laptop using Mark's two test images I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
took 39.156198501586914 milliseconds
So 39ms for two jpg loads, processing and one jpg save.
You can time just the blend part by copying the source images and the result to memory, like this:
a = pyvips.Image.new_from_file(sys.argv[1]).copy_memory()
b = pyvips.Image.new_from_file(sys.argv[2]).copy_memory()
start = time.time()
out = (a * 0.2 + b * 0.8).copy_memory()
print("pyvips between memory buffers took {} milliseconds"
.format(1000 * (time.time() - start)))
I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
pyvips between memory buffers took 15.432596206665039 milliseconds
numpy is about 60ms on this same test.
I tried a slight variant of Mark's nice numba example:
#!/usr/bin/python3
import sys
import time
import numpy as np
from PIL import Image
import numba
from numba import jit, prange
#jit(nopython=True, parallel=True)
def method2(image1, image2, transparency):
h, w, c = image1.shape
for y in prange(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency \
+ (image2[y][x][z] * (1 - transparency))
return image1
# run once to force a compile
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
res = method2(i1, i2, 0.2)
# run again and time it
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
start = time.time()
res = method2(i1, i2, 0.2)
print("numba took {} milliseconds".format(1000 * (time.time() - start)))
Image.fromarray(np.uint8(res)).save(sys.argv[3])
And I see:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba took 8.110523223876953 milliseconds
So on this laptop, numba is about 2x faster than pyvips.
If you time load and save as well, it's quite a bit slower:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba plus load and save took 272.8157043457031 milliseconds
But that seems unfair, since almost all that time is in PIL load and save.

python multiprocessing module: strange behaviour and processor load when using Pool

I'm using Python's multiprocessing lib to speed up some code (least squares fitting with scipy).
It works fine on 3 different machines, but it shows a strange behaviour on a 4th machine.
The code:
import numpy as np
from scipy.optimize import least_squares
import time
import parmap
from multiprocessing import Pool
p0 = [1., 1., 0.5]
def f(p, xx):
return p[0]*np.exp(-xx ** 2 / p[1] ** 2) + p[2]
def errorfunc(p, xx, yy):
return f(p, xx) - yy
def do_fit(yy, xx):
return least_squares(errorfunc, p0[:], args=(xx, yy))
if __name__ == '__main__':
# create data
x = np.linspace(-10, 10, 1000)
y = []
np.random.seed(42)
for i in range(1000):
y.append(f([np.random.rand(1) * 10, np.random.rand(1), 0.], x) + np.random.rand(len(x)))
# fit without multiprocessing
t1 = time.time()
for y_data in y:
p1 = least_squares(errorfunc, p0[:], args=(x, y_data))
t2 = time.time()
print t2 - t1
# fit with multiprocessing lib
times = []
for p in range(1,13):
my_pool = Pool(p)
t3 = time.time()
results = parmap.map(do_fit, y, x, pool=my_pool)
t4 = time.time()
times.append(t4-t3)
my_pool.close()
print times
For the 3 machines where it works, it speeds up roughly in the expected way. E.g. on my i7 laptop it gives:
[4.92650294303894, 2.5883090496063232, 1.7945551872253418, 1.629533052444458,
1.4896039962768555, 1.3550388813018799, 1.1796400547027588, 1.1852478981018066,
1.1404039859771729, 1.2239141464233398, 1.1676840782165527, 1.1416618824005127]
I'm running Ubuntu 14.10, Python 2.7.6, numpy 1.11.0 and scipy 0.17.0.
I tested it on another Ubuntu machine, a Dell PowerEdge R210 with similar results and on a MacBook Pro Retina (here with Python 2.7.11, and same numpy and scipy versions).
The computer that causes issues is a PowerEdge R710 (two hexcores) running Ubuntu 15.10, Python 2.7.11 and same numpy and scipy version as above.
However, I don't observe any speedup. Times are around 6 seconds, no matter what poolsize I use. In fact, it is slightly better for a poolsize of 2 and gets worse for more processes.
htop shows that somehow more processes get spawned than I would expect.
E.g. on my laptop htop shows one entry per process (which matches the poolsize) and eventually each process shows 100% CPU load.
On the PowerEdge R710 I see about 8 python processes for a poolsize of 1 and about 20 processes for a poolsize of 2 etc. each of which shows 100% CPU load.
I checked BIOS settings of the R710 and I couldn't find anything unusual.
What should I look for?
EDIT:
Answering to the comment, I used another simple script. Surprisingly this one seems to 'work' for all machines:
from multiprocessing import Pool
import time
import math
import numpy as np
def f_np(x):
return x**np.sin(x)+np.fabs(np.cos(x))**np.arctan(x)
def f(x):
return x**math.sin(x)+math.fabs(math.cos(x))**math.atan(x)
if __name__ == '__main__':
print "#pool", ", numpy", ", pure python"
for p in range(1,9):
pool = Pool(processes=p)
np.random.seed(42)
a = np.random.rand(1000,1000)
t1 = time.time()
for i in range(5):
pool.map(f_np, a)
t2 = time.time()
for i in range(5):
pool.map(f, range(1000000))
print p, t2-t1, time.time()-t2
pool.close()
gives:
#pool , numpy , pure python
1 1.34186911583 5.87641906738
2 0.697530984879 3.16030216217
3 0.470160961151 2.20742988586
4 0.35701417923 1.73128080368
5 0.308979988098 1.47339701653
6 0.286448001862 1.37223601341
7 0.274246931076 1.27663207054
8 0.245123147964 1.24748778343
on the machine that caused the trouble. There are no more threads (or processes?) spawned than I would expect.
It looks like numpy is not the problem, but as soon as I use scipy.optimize.least_squares the issue arises.
Using on htop on the processes shows a lot of sched_yield() calls which I don't see if I don't use scipy.optimize.least_squares and which I also don't see on my laptop even when using least_squares.

According to here, there is an issue when OpenBLAS is used together with joblib.
Similar issues occur when MKL is used (see here).
The solution given here, also worked for me:
Adding
import os
os.environ['MKL_NUM_THREADS'] = '1'
at the beginning of my python script solves the issue.

Implementation from MATLAB to Python using numpy and cv2

I am in the process of attempting to implement a script in MATLAB over to Python. I have the following:
% I = im2double(imread('images\image.tif'));
% IC(:,:,1) = imresize(squeeze(I(:,:,1)),[N1 N2]);
% IC(:,:,2) = imresize(squeeze(I(:,:,2)),[N1 N2]);
% IC(:,:,3) = imresize(squeeze(I(:,:,3)),[N1 N2]);
and
[xi yi imv1] = find(squeeze(imagee(:,:,1))+0.1);
imv1 = imv1 - 0.1;
wd1 = (imv1*ones(1,length(imv1)) - ones(length(imv1),1)*imv1').^2;
I understand I can load the image with open cv, i.e.
I= cv2.imread('image',1)
And that I can use np.nonzero for MATLAB find,
index=(imagee+0.1).nonzero()
np.outer for the wd1 calculation:
timv1=np.transpose(imv1)
wd1 = np.abs(np.outer(imv1,holder1) - np.outer(holder1,timv1))
as well as np.squeeze for MATLAB's squeeze function.
However, how can i write these functions in a compact form to produce maximum efficiency and speed when the Python script is finalized?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up Cupy with Streams correctly? - python

Related

How to free up RAM when using Juypter Notebook?

Avoid memory re-allocation in tensorflow while_loop

How to merge images as transparent layers?

python multiprocessing module: strange behaviour and processor load when using Pool

Implementation from MATLAB to Python using numpy and cv2

Categories

Resources