Single precision rfft - python

I seek single precision rfft to accelerate computation; scipy.fftpack.rfft does this, but returns a real array that packs real and imaginary components in same axis, requiring a post-processing step. I implemented below to obtain the standard complex array, but Numpy's rfft ends up being faster for 2D inputs (but slower for 1D). Memory is also of concern, OOM with float64.
Does scipy or another library have a single precision rfft implementation that returns the standard complex array? (else, can below be done faster?)
import numpy as np
from numpy.fft import rfft
from scipy.fftpack import rfft as srfft
def rfft_sp(x): # assumes len(x) is even
xf = np.zeros((len(x)//2 + 1, x.shape[1]), dtype='complex64')
h = srfft(x, axis=0)
xf[0] = h[0]
xf[1:] = h[1::2]
xf[:1].imag = 0
xf[-1:].imag = 0
xf[1:-1].imag = h[2::2]
return xf
x = np.random.randn(500, 100000).astype('float32')
%timeit rfft_sp(x)
%timeit rfft(x, axis=0)
>>> 565 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 517 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

on the machine on which I tested, using scipy.fft.rfft and casting to complex64 is faster than your implementation:
import numpy as np
from numpy.fft import rfft
from scipy.fft import rfft as srfft
from scipy.fftpack import rfft as srfft2
def rfft_sp(x): # assumes len(x) is even
xf = np.zeros((len(x)//2 + 1, x.shape[1]), dtype='complex64')
h = srfft2(x, axis=0)
xf[0] = h[0]
xf[1:] = h[1::2]
xf[:1].imag = 0
xf[-1:].imag = 0
xf[1:-1].imag = h[2::2]
return xf
def rfft_cast(x):
h = srfft(x, axis=0)
return h.astype('complex64')
x = np.random.randn(500, 100000).astype('float32')
%timeit rfft(x, axis = 0 )
%timeit rfft_sp(x )
%timeit rfft_cast(x)
produces:
1.81 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.89 s ± 7.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.24 s ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.fft works with single precision.

Related

Opencv Python: Fastest way to multiply pixel value

I'm trying to change the pixel value of an image.
I have a factor r, g and b which will be used to multiply the pixel values ​​of this image.
import cv2
import numpy as np
from matplotlib import pyplot as plt
import time
im = cv2.imread("boat.jpg")
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
im = cv2.resize(im, (4096,4096))
r_factor = 1.10
g_factor = 0.90
b_factor = 1.15
start = time.time()
im[...,0] = cv2.multiply(im[...,0], r_factor)
im[...,1] = cv2.multiply(im[...,1], g_factor)
im[...,2] = cv2.multiply(im[...,2], b_factor)
end = time.time()
This process takes time on large images. Is there any other method to multiply the value of the pixels ?
If I do this on my system, I get 568 ms:
import cv2
import numpy as np
# Known start image
im = np.full((4096,4096,3), [10,20,30], np.uint8)
In [49]: %%timeit
...: im[...,0] = cv2.multiply(im[...,0], r_factor)
...: im[...,1] = cv2.multiply(im[...,1], g_factor)
...: im[...,2] = cv2.multiply(im[...,2], b_factor)
...:
...:
568 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I do it like this, it takes 394 ms:
In [42]: %timeit res = cv2.multiply(im,(r_factor, g_factor,b_factor,0))
394 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You may get faster results doing it in-place, i.e. by specifying dst=im in the call. If I specify the type of the result, it comes out 5x faster at 63 ms - there must be something SIMD going on under the covers:
%timeit _ = cv2.multiply(im,(r_factor, g_factor,b_factor,0), dst=im, dtype=1)
63 ms ± 79.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you are really keen on making it even faster, look at some answers tagged with [numba].

Create random integer ndarray sampled from different span per element

I want to generate an ndarray a full of random integers which are sampled from different ranges according to another array span. For example:
import numpy as np
span = [5,6,7,8,9]
def get_a(span, count):
a = np.stack([np.random.choice(i, count) for i in span], axis=0)
return a
get_a(span,2)
Is there a fast way to do get_a?
Yes. Yours:
import timeit
import numpy as np
span = np.arange(1,100)
def get_a(span, count):
a = np.stack([np.random.choice(i, count) for i in span], axis=0)
return a
%timeit get_a(span,2)
2.32 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My solution is 100s times faster for largish arrays:
def get_b(span, count):
b = (np.random.rand(len(span), count)*span[:,None]).astype(int)
return b
%timeit get_b(span,2)
6.91 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

How to vectorize custom algorithms in numpy or pytorch?

Suppose I have two matrices:
A: size k x m
B: size m x n
Using a custom operation, my output will be k x n.
This custom operation is not a dot product between the rows of A and columns of B. Suppose this custom operation is defined as:
For the Ith row of A and Jth column of B, the i,j element of the output is:
sum( (a[i] + b[j]) ^20 ), i loop over I, j loops over J
The only way I can see to implement this is to expand this equation, calculate each term, them sum them.
Is there a way in numpy or pytorch to do this without expanding the equation?
Apart from the method #hpaulj outlines in the comments, you can also use the fact that what you are calculating is essentially a pair-wise Minkowski distance:
import numpy as np
from scipy.spatial.distance import cdist
k,m,n = 10,20,30
A = np.random.random((k,m))
B = np.random.random((m,n))
method1 = ((A[...,None]+B)**20).sum(axis=1)
method2 = cdist(A,-B.T,'m',p=20)**20
np.allclose(method1,method2)
# True
You can implement it yourself
The following function generates all kind of dot product like functions, but don't use it to replace np.dot, because it will be quite a lot slower for larger arrays.
Template
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
def gen_dot_like_func(kernel,parallel=True):
kernel_nb=nb.njit(kernel,fastmath=True)
def cust_dot(A,B_in):
B=np.ascontiguousarray(B_in.T)
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
sum=0
for k in range(A.shape[1]):
sum+=kernel_nb(A[i,k],B[j,k])
out[i,j]=sum
return out
if parallel==True:
return nb.njit(cust_dot,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot,fastmath=True,parallel=False)
Generate your function
#This can be useful if you have a lot matrix-multiplication like functions
my_func=gen_dot_like_func(lambda A,B:(A+B)**20,parallel=True)
Timings
k,m,n = 10,20,30
%timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
192 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
208 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=my_func(A,B) #parallel=False
4.01 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
k,m,n = 500,100,500
timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
852 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
714 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res=my_func(A,B) #parallel=True
1.81 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What's the fastest, most efficient, and pythonic way to perform a mathematical sigma sum?

Let's say that I want to perform a mathematical summation, say the Madhava–Leibniz formula for π, in Python:
Within a function called Leibniz_pi(), I could create a loop to calculate the nth partial sum, such as:
def Leibniz_pi(n):
nth_partial_sum = 0 #initialize the variable
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
I'm assuming it would be faster to use something like xrange() instead of range(). Would it be even faster to use numpy and its built in numpy.sum() method? What would such an example look like?
I guess most people will define the fastest solution by #zero using only numpy as the most pythonic, but it is certainly not the fastest. With some additional optimizations you can beat the already fast numpy implementation by a factor of 50.
Using only Numpy (#zero)
import numpy as np
import numexpr as ne
import numba as nb
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
%timeit Leibniz_point(np.arange(1000)).sum()
33.8 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make use of numexpr
n=np.arange(1000)
%timeit ne.evaluate("sum((-1)**n / (2*n + 1))")
21 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Compile your function using Numba
# with error_model="numpy", turns off division-by-zero checks
#nb.njit(error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable as float64
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
6.48 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit, optimizing away the costly (-1)**n
import numba as nb
import numpy as np
#replacement for the much more costly (-1)**n
#nb.njit()
def sgn(i):
if i%2>0:
return -1.
else:
return 1.
# with error_model="numpy", turns off the division-by-zero checks
#
# fastmath=True makes SIMD-vectorization in this case possible
# floating point math is in general not commutative
# e.g. calculating four times sgn(i)/(2*i + 1) at once and then the sum
# is not exactly the same as doing this sequentially, therefore you have to
# explicitly allow the compiler to make the optimizations
#nb.njit(fastmath=True,error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable
for i in range(n+1):
nth_partial_sum += sgn(i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
777 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3 suggestions (with speed computation):
define the Leibniz point not the cumulative sum:
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
1) sum a list comprehension
%timeit sum([Leibniz_point(n) for n in range(100)])
58.8 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum([Leibniz_point(n) for n in range(1000)])
667 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2) standard for loop
%%timeit
sum = 0
for n in range(100):
sum += Leibniz_point(n)
61.8 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum = 0
for n in range(1000):
sum += Leibniz_point(n)
729 µs ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3) use a numpy array (suggested)
%timeit Leibniz_point(np.arange(100)).sum()
11.5 µs ± 866 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit Leibniz_point(np.arange(1000)).sum()
61.8 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In general, for operations involving collections of more than a few elements, numpy will be faster. A simple numpy implementation could be something like this:
def leibniz(n):
a = np.arange(n + 1)
return (((-1.0) ** a) / (2 * a + 1)).sum()
Note that you must specify that the numerator is a float with 1.0 on Python 2. On Python 3, 1 will be fine.

Efficient calculation of vector between sets of 3D points

I'm coding a particular version of raytracing in Python, and I'm trying to calculate the vectors between points on different planes.
I'm working with sets of point light sources, simulating a nonpoint light source. Each source generates one ray for each pixel on the "camera" plane. I managed to compute the vector for each of those rays, by iterating with a for loop for each pixel:
for sensor_point in sensor_points:
sp_min_ro = sensor_point - rayorigins #Vectors between the points
normalv = normalize(sp_min_ro) #Normalized vector between the points
Where sensor_points is a large numpy array with the [x,y,z] coordinates of the different pixel positions, and rayorigins is a numpy array with the [x,y,z] coordinates for the different point sources
This for loop approach works, but is extremely slow. I tried to remove the for loop and directly calculate spr_min_ro = sensor_points - rayorigins, with the whole array, but numpy can't operate it:
ValueError: operands could not be broadcast together with shapes (1002001,3) (36,3)
Is there a way to accelerate the process of finding the vectors between all the points?
Edit: Adding the normalize function definition I have been using, because it is also giving problems:
def normalize(v):
norm = np.linalg.norm(v, axis=1)
return v / norm[:,None]
When I try to pass the new (1002001, 36, 3) array from #aganders3 solution, it fails, I suppose because of the axis?
Numpy solution
import numpy as np
sensor_points=np.random.randn(1002001,3)#.astype(np.float32)
rayorigins=np.random.rand(36,3)#.astype(np.float32)
sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
norm=np.linalg.norm(sp_min_ro,axis=2)
sp_min_ro/=norm[:,:,np.newaxis]
Timings
np.float64: 1.76 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.float32: 1.42 s ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba solution
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins):
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 208 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 104 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba solution with preallocated memory
Memory allocation could be very costly. This example should show, why it is sometimes a good idea to avoid large temporary arrays if possible.
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins,res):
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 66.6 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 33.8 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Check out the rules for NumPy broadcasting. I think adding a new axis in the middle of your sensor_points array will work:
>> sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
>> sp_min_ro.shape
(1002001, 36, 3)

Categories