How to accelerate numpy array masking? - python

I am profiling performance of a piece of Python code, using a line profiler.
In the code, I have a numpy array tt of shape (106906,) and dtype=int64. With the help of the profiler, I find that the the second line below mask[tt]=True is quite slow. Is there anyway to accelerate it? I am on Python 3 if that matters.
mask = np.zeros(100000, dtype='bool')
mask[tt] = True

You can use Numba as #orlevii has suggested:
from numba import njit
#njit
def f(mask,tt):
mask[tt] = True
#Test:
mask = np.zeros(1000000, dtype='bool')
tt = np.random.randint(0,1000000,106906)
f(mask,tt)
A simple %%timeit check suggests that you should expect roughly 3 times faster execution.
Further speed-up can be achieved by utilizing the GPU. An example of how to do it with PyTorch:
import torch
mask = torch.zeros(1000000).type(torch.cuda.FloatTensor)
tt = torch.randint(0,1000000,torch.Size([106906])).type(torch.cuda.LongTensor)
mask[tt] = True
Note that here we use a torch.Tensor object which is the equivalent of numpy.ndarray in PyTorch. Code will run only if you have a GPU (of NVIDIA) with CUDA. Expect x30 speed-up w.r.t your original code on Tesla V100-SXM2.

Related

Is there any simple method to parallel np.einsum?

I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)

How to run NumPy code on the GPU with Numba

I've developed several codes with high matrix implication; those run quite well but since I spent some extra money in GPU, I'd like take advantage from it... ;-) I've tryed in several configurations from numba manual, but clearly something is missing .. Why I receive an error when try to execute some functions in Cuda (nvidia gtx 1050)?
Would be numba or eighter other module, how can I let execute a portion of code (which require strong parallelization) into GPU ?
from numba import types
from numba.extending import intrinsic
from numba import jit, cuda
from numba import vectorize
#jit
def calculate_portfolio_return(returns, weights):
portfolio_return = np.sum(returns.mean()*weights)*252
print("Expected Portfolio Return:", portfolio_return)
#jit
def calculate_portfolio_risk(returns, weights):
portfolio_variance = np.sqrt(np.dot(weights.T, np.dot(returns.cov()*252,weights)))
print("Expected Risk:", portfolio_variance)
#jit
def generate_portfolios(weights, returns):
preturns = []
pvariances = []
for i in range(10000):
# weights = np.random.random(len(stocks_portfolio))
weights = np.random.random(data.shape[1])
weights/=np.sum(weights)
preturns.append(np.sum(returns.mean()*weights)*252)
pvariances.append(np.sqrt(np.dot(weights.T,np.dot(returns.cov()*252,weights))))
preturns = np.array(preturns)
pvariances = np.array(pvariances)
return preturns,pvariances
......

Turn a tf.data.Dataset to a jax.numpy iterator

I am interested about training a neural network using JAX. I had a look on tf.data.Dataset, but it provides exclusively tf tensors. I looked for a way to change the dataset into JAX numpy array and I found a lot of implementations that use Dataset.as_numpy_generator() to turn the tf tensors to numpy arrays. However I wonder if it is a good practice, as numpy arrays are stored in CPU memory and it is not what I want for my training (I use the GPU). So the last idea I found is to manually recast the arrays by calling jnp.array but it is not really elegant (I am afraid about the copy in GPU memory). Does anyone have a better idea for that?
Quick code to illustrate:
import os
import jax.numpy as jnp
import tensorflow as tf
def generator():
for _ in range(2):
yield tf.random.uniform((1, ))
ds = tf.data.Dataset.from_generator(generator, output_types=tf.float32,
output_shapes=tf.TensorShape([1]))
ds1 = ds.take(1).as_numpy_iterator()
ds2 = ds.skip(1)
for i, batch in enumerate(ds1):
print(type(batch))
for i, batch in enumerate(ds2):
print(type(jnp.array(batch)))
# returns:
<class 'numpy.ndarray'> # not good
<class 'jaxlib.xla_extension.DeviceArray'> # good but not elegant
Both tensorflow and JAX have the ability to convert arrays to dlpack tensors without copying memory, so one way you can create a JAX array from a tensorflow array without copying the underlying data buffer is to do it via dlpack:
import numpy as np
import tensorflow as tf
import jax.dlpack
tf_arr = tf.random.uniform((10,))
dl_arr = tf.experimental.dlpack.to_dlpack(tf_arr)
jax_arr = jax.dlpack.from_dlpack(dl_arr)
np.testing.assert_array_equal(tf_arr, jax_arr)
By doing the round-trip to JAX, you can compare unsafe_buffer_pointer() to ensure that the arrays point at the same buffer, rather than copying the buffer along the way:
def tf_to_jax(arr):
return jax.dlpack.from_dlpack(tf.experimental.dlpack.to_dlpack(tf_arr))
def jax_to_tf(arr):
return tf.experimental.dlpack.from_dlpack(jax.dlpack.to_dlpack(arr))
jax_arr = jnp.arange(20.)
tf_arr = jax_to_tf(jax_arr)
jax_arr2 = tf_to_jax(tf_arr)
print(jnp.all(jax_arr == jax_arr2))
# True
print(jax_arr.unsafe_buffer_pointer() == jax_arr2.unsafe_buffer_pointer())
# True
From Flax example:
https://github.com/google/flax/blob/6ae22681ef6f6c004140c3759e7175533bda55bd/examples/imagenet/train.py#L183
def prepare_tf_data(xs):
local_device_count = jax.local_device_count()
def _prepare(x):
x = x._numpy()
return x.reshape((local_device_count, -1) + x.shape[1:])
return jax.tree_util.tree_map(_prepare, xs)
it = map(prepare_tf_data, ds)
it = jax_utils.prefetch_to_device(it, 2)

Cuda Parallelized Kernel Shared Counter Variable

Is there a way to have an integer counter variable that can be incremented/decremented across all threads in a parallelized cuda kernel? The below code outputs "[1]" since the modifications to the counter array from one thread is not applied in the others.
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
counter[0] = counter[0] + 1
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())
One approach would be to use numba cuda atomics:
$ cat t18.py
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
cuda.atomic.add(counter, 0, 1)
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
print blockspergrid * threadsperblock
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())
$ python t18.py
1152
[1152]
$
An atomic operation performs an indivisible read-modify-write operation on the target, so threads do not interfere with each other when they update the target variable.
Certainly other methods are possible, depending on your actual needs, such as a classical parallel reduction. numba provides some reduction sugar also.

Fastest 2D convolution or image filter in Python

Several users have asked about the speed or memory consumption of image convolutions in numpy or scipy [1, 2, 3, 4]. From the responses and my experience using Numpy, I believe this may be a major shortcoming of numpy compared to Matlab or IDL.
None of the answers so far have addressed the overall question, so here it is: "What is the fastest method for computing a 2D convolution in Python?" Common python modules are fair game: numpy, scipy, and PIL (others?). For the sake of a challenging comparison, I'd like to propose the following rules:
Input matrices are 2048x2048 and 32x32, respectively.
Single or double precision floating point are both acceptable.
Time spent converting your input matrix to the appropriate format doesn't count -- just the convolution step.
Replacing the input matrix with your output is acceptable (does any python library support that?)
Direct DLL calls to common C libraries are alright -- lapack or scalapack
PyCUDA is right out. It's not fair to use your custom GPU hardware.
It really depends on what you want to do... A lot of the time, you don't need a fully generic (read: slower) 2D convolution... (i.e. If the filter is separable, you use two 1D convolutions instead... This is why the various scipy.ndimage.gaussian, scipy.ndimage.uniform, are much faster than the same thing implemented as a generic n-D convolutions.)
At any rate, as a point of comparison:
t = timeit.timeit(stmt='ndimage.convolve(x, y, output=x)', number=1,
setup="""
import numpy as np
from scipy import ndimage
x = np.random.random((2048, 2048)).astype(np.float32)
y = np.random.random((32, 32)).astype(np.float32)
""")
print t
This takes 6.9 sec on my machine...
Compare this with fftconvolve
t = timeit.timeit(stmt="signal.fftconvolve(x, y, mode='same')", number=1,
setup="""
import numpy as np
from scipy import signal
x = np.random.random((2048, 2048)).astype(np.float32)
y = np.random.random((32, 32)).astype(np.float32)
""")
print t
This takes about 10.8 secs. However, with different input sizes, using fft's to do a convolution can be considerably faster (Though I can't seem to come up with a good example, at the moment...).
On my machine, a hand-crafted circular convolution using FFTs seems to be fasted:
import numpy
x = numpy.random.random((2048, 2048)).astype(numpy.float32)
y = numpy.random.random((32, 32)).astype(numpy.float32)
z = numpy.fft.irfft2(numpy.fft.rfft2(x) * numpy.fft.rfft2(y, x.shape))
Note that this might treat the areas close to the edges differently than other ways, because it's a circular convolution.
I did some experiments with this too. My guess is that the SciPy convolution does not use the BLAS library to accelerate the computation. Using BLAS, I was able to code a 2D convolution that was comparable in speed to MATLAB's. It's more work, but your best bet is to recode the convolution in C++.
Here is the tight part of the loop (please forgive the weird () based array referencing, it is my convenience class for MATLAB arrays) The key part is that you don't iterate over the image, you iterate over the filter and let BLAS iterate over the image, because typically the image is much larger than the filter.
for(int n = 0; n < filt.numCols; n++)
{
for(int m = 0; m < filt.numRows; m++)
{
const double filt_val = filt(filt.numRows-1-m,filt.numCols-1-n);
for (int i =0; i < diffN; i++)
{
double *out_ptr = &outImage(0,i);
const double *im_ptr = &image(m,i+n);
cblas_daxpy(diffM,filt_val,im_ptr, 1, out_ptr,1);
}
}
}
I have been trying to improve the convolution speed in my application and I have been using signal.correlate which happens to be about 20 times slower than signal.correlate2d, my input matrices are smaller(27x27 and 5x5). As of 2018, this is what I observed on my machine(Dell Inspiron 13, Core i5) for the specified matrices in the actual question.
OpenCV did the best but the caveat with that is that it doesn't given "mode" options. Input and Output are of the same size.
>>> img= np.random.rand(2048,2048)
>>> kernel = np.ones((32,32), dtype=np.float)
>>> t1= time.time();dst1 = cv2.filter2D(img,-1,kernel);print(time.time()-t1)
0.208490133286
>>> t1= time.time();dst2 = signal.correlate(img,kernel,mode='valid',method='fft');print(time.time()-t1)
0.582989931107
>>> t1= time.time();dst3 = signal.convolve2d(img,kernel,mode='valid');print(time.time()-t1)
11.2672450542
>>> t1= time.time();dst4 = signal.correlate2d(img,kernel,mode='valid');print(time.time()-t1)
11.2443971634
>>> t1= time.time();dst5 = signal.fftconvolve(img,kernel,mode='valid');print(time.time()-t1)
0.581533193588
Scipy has a function fftconvolve, that can be used for 1D and 2D signals.
from scipy import signal
from scipy import misc
import numpy as np
import matplotlib.pyplot as plt
face = misc.face(gray=True)
kernel = np.outer(signal.gaussian(70, 8), signal.gaussian(70, 8))
blurred = signal.fftconvolve(face, kernel, mode='same')
fig, (ax_orig, ax_kernel, ax_blurred) = plt.subplots(3, 1, figsize=(6, 15))
ax_orig.imshow(face, cmap='gray')
ax_orig.set_title('Original')
ax_orig.set_axis_off()
ax_kernel.imshow(kernel, cmap='gray')
ax_kernel.set_title('Gaussian kernel')
ax_kernel.set_axis_off()
ax_blurred.imshow(blurred, cmap='gray')
ax_blurred.set_title('Blurred')
ax_blurred.set_axis_off()
fig.show()

Categories