Vectorization a code to make it faster than this - python

I have a little bit code which I'll have to vectorizate it to make it faster. I'm not very attached into python and thinking that the for loop is not so efficient.
Is there any way to reduce the time?
import numpy as np
import time
start = time.time()
N = 10000000 #9 seconds
#N = 100000000 #93 seconds
alpha = np.linspace(0.00000000000001, np.pi/2, N)
tmp = 2.47*np.sin(alpha)
for i in range(N):
if (abs(tmp[i])>1.0):
tmp[i]=1.0*np.sign(tmp[i])
beta = np.arcsin(tmp)
end = time.time()
print("Executed time: ",round(end-start,1),"Seconds")
I read about some numpy functions but I don't have a solution for this.

Clip the array:
tmp = np.clip(2.47 * np.sin(alpha), -1.0, 1.0)

Instead of using loop with condition, you can access the values by computing a mask. Here is example:
N = 10000000
alpha = np.linspace(0.00000000000001, np.pi/2, N)
tmp = 2.47*np.sin(alpha)
indices = np.abs(tmp) > 1.0
tmp[indices] = np.sign(tmp[indices])
beta = np.arcsin(tmp)
Results on my setup:
before: 5.66 s ± 30.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each),
after: 182 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each).

Related

Opencv Python: Fastest way to multiply pixel value

I'm trying to change the pixel value of an image.
I have a factor r, g and b which will be used to multiply the pixel values ​​of this image.
import cv2
import numpy as np
from matplotlib import pyplot as plt
import time
im = cv2.imread("boat.jpg")
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
im = cv2.resize(im, (4096,4096))
r_factor = 1.10
g_factor = 0.90
b_factor = 1.15
start = time.time()
im[...,0] = cv2.multiply(im[...,0], r_factor)
im[...,1] = cv2.multiply(im[...,1], g_factor)
im[...,2] = cv2.multiply(im[...,2], b_factor)
end = time.time()
This process takes time on large images. Is there any other method to multiply the value of the pixels ?
If I do this on my system, I get 568 ms:
import cv2
import numpy as np
# Known start image
im = np.full((4096,4096,3), [10,20,30], np.uint8)
In [49]: %%timeit
...: im[...,0] = cv2.multiply(im[...,0], r_factor)
...: im[...,1] = cv2.multiply(im[...,1], g_factor)
...: im[...,2] = cv2.multiply(im[...,2], b_factor)
...:
...:
568 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I do it like this, it takes 394 ms:
In [42]: %timeit res = cv2.multiply(im,(r_factor, g_factor,b_factor,0))
394 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You may get faster results doing it in-place, i.e. by specifying dst=im in the call. If I specify the type of the result, it comes out 5x faster at 63 ms - there must be something SIMD going on under the covers:
%timeit _ = cv2.multiply(im,(r_factor, g_factor,b_factor,0), dst=im, dtype=1)
63 ms ± 79.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you are really keen on making it even faster, look at some answers tagged with [numba].

Single precision rfft

I seek single precision rfft to accelerate computation; scipy.fftpack.rfft does this, but returns a real array that packs real and imaginary components in same axis, requiring a post-processing step. I implemented below to obtain the standard complex array, but Numpy's rfft ends up being faster for 2D inputs (but slower for 1D). Memory is also of concern, OOM with float64.
Does scipy or another library have a single precision rfft implementation that returns the standard complex array? (else, can below be done faster?)
import numpy as np
from numpy.fft import rfft
from scipy.fftpack import rfft as srfft
def rfft_sp(x): # assumes len(x) is even
xf = np.zeros((len(x)//2 + 1, x.shape[1]), dtype='complex64')
h = srfft(x, axis=0)
xf[0] = h[0]
xf[1:] = h[1::2]
xf[:1].imag = 0
xf[-1:].imag = 0
xf[1:-1].imag = h[2::2]
return xf
x = np.random.randn(500, 100000).astype('float32')
%timeit rfft_sp(x)
%timeit rfft(x, axis=0)
>>> 565 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 517 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
on the machine on which I tested, using scipy.fft.rfft and casting to complex64 is faster than your implementation:
import numpy as np
from numpy.fft import rfft
from scipy.fft import rfft as srfft
from scipy.fftpack import rfft as srfft2
def rfft_sp(x): # assumes len(x) is even
xf = np.zeros((len(x)//2 + 1, x.shape[1]), dtype='complex64')
h = srfft2(x, axis=0)
xf[0] = h[0]
xf[1:] = h[1::2]
xf[:1].imag = 0
xf[-1:].imag = 0
xf[1:-1].imag = h[2::2]
return xf
def rfft_cast(x):
h = srfft(x, axis=0)
return h.astype('complex64')
x = np.random.randn(500, 100000).astype('float32')
%timeit rfft(x, axis = 0 )
%timeit rfft_sp(x )
%timeit rfft_cast(x)
produces:
1.81 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.89 s ± 7.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.24 s ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
scipy.fft works with single precision.

Perform sum over different slice of each row for 2D array

I have a 2D array of numbers and would like to average over different indices in each row. Say I have
import numpy as np
data = np.arange(16).reshape(4, 4)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
I have two lists, specifying the first (inclusive) and last (exclusive) index for each row:
start = [1, 0, 1, 2]
end = [2, 1, 3, 4]
Then I would like to achieve this:
result = []
for i in range(4):
result.append(np.sum(data[i, start[i]:end[i]]))
which gives
[1, 4, 19, 29]
However, the arrays I use are a lot larger than in this example, so this method is too slow for me. Is there some smart way to avoid this loop?
My first idea was to flatten the array. Then, I guess, one would need to somehow make a list of slices and apply it in parallel on the array, which I don't know how to do.
Otherwise, I was thinking of using np.apply_along_axis but I think this only works for functions?
Let's run with your raveling idea. You can convert the indices of your array into raveled indices like this:
ind = np.stack((start, end), axis=0)
ind += np.arange(data.shape[0]) * data.shape[1]
ind = ind.ravel(order='F')
if ind[-1] == data.size:
ind = ind[:-1]
Now you can ravel the original array, and add.reduceat on the segments thus defined:
np.add.reduceat(data.ravel(), ind)[::2] / np.subtract(end, start)
TL;DR
def row_mean(data, start, end):
ind = np.stack((start, end), axis=0)
ind += np.arange(data.shape[0]) * data.shape[1]
ind = ind.ravel(order='F')
if ind[-1] == data.size:
ind = ind[:-1]
return np.add.reduceat(data.ravel(), ind)[::2] / np.subtract(end, start)
Timings
Using the exact same arrays shown in #Divakar's answer, we get the following results (specific to my machine of course):
%timeit einsum_mean(data, start, end)
261 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit broadcasting_mean(data, start, end)
405 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ragged_mean(data, start, end)
520 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit row_mean(data, start, end)
45.6 ms ± 708 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Somewhat surprisingly, this method runs 5-10x faster than all the others, despite doing lots of extra work by adding up all the numbers between the regions of interest. The reason is probably that it has extremely low overhead: the arrays of indices are small, and it only makes a single pass over a 1D array.
Let's try to create a mask with broadcasting:
start = np.array([1,0,1,2])
end = np.array([2,1,3,4])
idx = np.arange(data.shape[1])
mask = (start[:,None] <= idx) & (idx <end[:,None])
# this is for sum
(data * mask).sum(1)
Output:
array([ 1, 4, 19, 29])
And if you need average:
(data * mask).sum(1) / mask.sum(1)
which gives:
array([ 1. , 4. , 9.5, 14.5])
Here's one way with broadcasting to create the masking array and then using np.einsum to sum those per row using that mask -
start = np.asarray(start)
end = np.asarray(end)
r = np.arange(data.shape[1])
m = (r>=start[:,None]) & (r<end[:,None])
out = np.einsum('ij,ij->i',data,m)
To get the averages, divide by the mask summations -
avg_out = np.einsum('ij,ij->i',data,m)/np.count_nonzero(m,axis=1)
Or for the last step, use np.matmul/# :
out = (data[:,None] # m[:,:,None]).ravel()
Timings
# #Quang Hoang's soln with sum method
def broadcast_sum(data, start, end):
idx = np.arange(data.shape[1])
mask = (start[:,None] <= idx) & (idx <end[:,None])
return (data * mask).sum(1) / mask.sum(1)
# From earlier in this post
def broadcast_einsum(data, start, end):
r = np.arange(data.shape[1])
m = (r>=start[:,None]) & (r<end[:,None])
return np.einsum('ij,ij->i',data,m)/np.count_nonzero(m,axis=1)
# #Paul Panzer's soln
def ragged_mean(data,left,right):
n,m = data.shape
ps = np.zeros((n,m+1),data.dtype)
left,right = map(np.asarray,(left,right))
rng = np.arange(len(data))
np.cumsum(data,axis=1,out=ps[:,1:])
return (ps[rng,right]-ps[rng,left])/(right-left)
# #Mad Physicist's soln
def row_mean(data, start, end):
ind = np.stack((start, end), axis=0)
ind += np.arange(data.shape[0]) * data.shape[1]
ind = ind.ravel(order='F')
if ind[-1] == data.size:
ind = ind[:-1]
return np.add.reduceat(data.ravel(), ind)[::2] / np.subtract(end, start)
1. Tall array
Using given sample and tiling it along rows :
In [74]: data = np.arange(16).reshape(4,4)
...: start = [1,0,1,2]
...: end = [2,1,3,4]
...:
...: N = 100000
...: data = np.repeat(data,N,axis=0)
...: start = np.tile(start,N)
...: end = np.tile(end,N)
In [75]: %timeit broadcast_sum(data, start, end)
...: %timeit broadcast_einsum(data, start, end)
...: %timeit ragged_mean(data, start, end)
...: %timeit row_mean(data, start, end)
41.4 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.8 ms ± 996 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
22.5 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2. Square array
Using a large square array (same as the given sample shape) -
In [76]: np.random.seed(0)
...: data = np.random.rand(10000,10000)
...: start = np.random.randint(0,5000,10000)
...: end = start + np.random.randint(1,5000,10000)
In [77]: %timeit broadcast_sum(data, start, end)
...: %timeit broadcast_einsum(data, start, end)
...: %timeit ragged_mean(data, start, end)
...: %timeit row_mean(data, start, end)
759 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
514 ms ± 5.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
932 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
72.5 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can avoid creating masks at the cost of doing a few more sums:
import numpy as np
def ragged_mean(data,left,right):
n,m = data.shape
ps = np.zeros((n,m+1),data.dtype)
left,right = map(np.asarray,(left,right))
rng = np.arange(len(data))
np.cumsum(data,axis=1,out=ps[:,1:])
return (ps[rng,right]-ps[rng,left])/(right-left)
data = np.arange(16).reshape(4,4)
start = [1,0,1,2]
end = [2,1,3,4]
print(ragged_mean(data,start,end))
Sample run:
[ 1. 4. 9.5 14.5]
When the slices (or rather the discarded areas) are large a hybrid (numpy core in one Python loop) method is even faster than #MadPhysicist's. This method is, however, quite slow on #Divakar's tall test case.
import operator as op
def hybrid(data,start,end):
off = np.arange(0,data.size,data.shape[1])
start = start+off
end = end+off
sliced = op.itemgetter(*map(slice,start,end))(data.ravel())
return np.fromiter(map(op.methodcaller("sum"),sliced),float,len(data))\
/(end-start)

Numpy Vectorization And Speedup

I found was a small code snippet that used to be a double for loop and I managed to bring it to a single for loop with vectorization. Having dones this resulted in a drastic time improvement so I was wondering if it is possible to get rid of the second for loop here via vectorization as well and if it would improve the performance.
import numpy as np
from timeit import default_timer as timer
nlin, npix = 478, 480
bb = np.random.rand(nlin,npix)
slope = -8
fac = 4
offset= 0
barray = np.zeros([2,2259]);
timex = timer()
for y in range(nlin):
for x in range(npix):
ling=(np.ceil((x-y/slope)*fac)+1-offset).astype(np.int);
barray[0,ling] +=1;
barray[1,ling] +=bb[y,x];
newVar = np.copy(barray)
print(timer() - timex)
So the ling can be taken out of the loops by creating the following matrix
lingMat = (np.ceil((np.vstack(npixrange)-nlinrange/slope)*fac)+1-offset).astype(np.int);
which satisfies lingMat[x,y] = "ling in the for loop at x and y". And this gives a first step of the vectorization.
In terms of vectorization, you could potentially use something based on np.add.at:
def yaco_addat(bb,slope,fac,offset):
barray = np.zeros((2,2259),dtype=np.float64)
nlin_range = np.arange(nlin)
npix_range = np.arange(npix)
ling_mat = (np.ceil((npix_range-nlin_range[:,None]/slope)*fac)+1-offset).astype(np.int)
np.add.at(barray[0,:],ling_mat,1)
np.add.at(barray[1,:],ling_mat,bb)
return barray
However, I would suggest you to optimize this directly with numba, using #jit decorator with option nopython=True, which gives you:
import numpy as np
from numba import jit
nlin, npix = 478, 480
bb = np.random.rand(nlin,npix)
slope = -8
fac = 4
offset= 0
def yaco_plain(bb,slope,fac,offset):
barray = np.zeros((2,2259),dtype=np.float64)
for y in range(nlin):
for x in range(npix):
ling=(np.ceil((x-y/slope)*fac)+1-offset).astype(np.int)
barray[0,ling] += 1
barray[1,ling] += bb[y,x]
return barray
#jit(nopython=True)
def yaco_numba(bb,slope,fac,offset):
barray = np.zeros((2,2259),dtype=np.float64)
for y in range(nlin):
for x in range(npix):
ling = int((np.ceil((x-y/slope)*fac)+1-offset))
barray[0,ling] += 1
barray[1,ling] += bb[y,x]
return barray
Let's check the outputs
np.allclose(yaco_plain(bb,slope,fac,offset),yaco_addat(bb,slope,fac,offset))
>>> True
np.allclose(yaco_plain(bb,slope,fac,offset),yaco_jit(bb,slope,fac,offset))
>>> True
and now time these
%timeit yaco_plain(bb,slope,fac,offset)
>>> 648 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit yaco_addat(bb,slope,fac,offset)
>>> 27.2 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit yaco_jit(bb,slope,fac,offset)
>>> 505 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
resulting in an optimized function that is way quicker than the initial 2 loops version and 53x faster than the np.add.at one. Hope this helps.

Numpy - How to remove trailing N*8 zeros

I have 1d array, I need to remove all trailing blocks of 8 zeros.
[0,1,1,0,1,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0]
->
[0,1,1,0,1,0,0,0]
a.shape[0] % 8 == 0 always, so no worries about that.
Is there a better way to do it?
import numpy as np
P = 8
arr1 = np.random.randint(2,size=np.random.randint(5,10) * P)
arr2 = np.random.randint(1,size=np.random.randint(5,10) * P)
arr = np.concatenate((arr1, arr2))
indexes = []
arr = np.flip(arr).reshape(arr.shape[0] // P, P)
for i, f in enumerate(arr):
if (f == 0).all():
indexes.append(i)
else:
break
arr = np.delete(arr, indexes, axis=0)
arr = np.flip(arr.reshape(arr.shape[0] * P))
You can do it without allocating more space by using views and np.argmax to get the last nonzero element:
index = arr.size - np.argmax(arr[::-1])
Rounding up to the nearest multiple of eight is easy:
index = np.ceil(index / 8) * 8
Now chop off the rest:
arr = arr[:index]
Or as a one-liner:
arr = arr[:(arr.size - np.argmax(arr[::-1])) / 8) * 8]
This version is O(n) in time and O(1) in space because it reuses the same buffers for everything (including the output).
This has the additional advantage that it will work correctly even if there are no trailing zeros. Using argmax does rely on all the elements being the same though. If that is not the case, you will need to compute a mask first, e.g. with arr.astype(bool).
If you want to use your original approach, you could vectorize that too, although there will be a bit more overhead:
view = arr.reshape(-1, 8)
mask = view.any(axis = 1)
index = view.shape[0] - np.argmax(mask[::-1])
arr = arr[:index * 8]
There is a numpy function that does almost what you want np.trim_zeros. We can use that:
import numpy as np
def trim_mod(a, m=8):
t = np.trim_zeros(a, 'b')
return a[:len(a)-(len(a)-len(t))//m*m]
def test(a, t, m=8):
assert (len(a) - len(t)) % m == 0
assert len(t) < m or np.any(t[-m:])
assert not np.any(a[len(t):])
for _ in range(1000):
a = (np.random.random(np.random.randint(10, 100000))<0.002).astype(int)
m = np.random.randint(4, 20)
t = trim_mod(a, m)
test(a, t, m)
print("Looks correct")
Prints:
Looks correct
It seems to scale linearly in the number of trailing zeros:
But feels rather slow in absolute terms (units are ms per trial), so maybe np.trim_zeros is just a python loop.
Code for the picture:
from timeit import timeit
A = (np.random.random(1000000)<0.02).astype(int)
m = 8
T = []
for last in range(1, 1000, 9):
A[-last:] = 0
A[-last] = 1
T.append(timeit(lambda: trim_mod(A, m), number=100)*10)
import pylab
pylab.plot(range(1, 1000, 9), T)
pylab.show()
A low level approach :
import numba
#numba.njit
def trim8(a):
n=a.size-1
while n>=0 and a[n]==0 : n-=1
c= (n//8+1)*8
return a[:c]
Some tests :
In [194]: A[-1]=1 # best case
In [196]: %timeit trim_mod(A,8)
5.7 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [197]: %timeit trim8(A)
714 ns ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [198]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
4.83 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: A[:]=0 #worst case
In [203]: %timeit trim_mod(A,8)
2.5 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [204]: %timeit trim8(A)
1.14 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [205]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
5.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It has a short circuit mechanism like trim_zeros, but is much faster.

Categories