What would be the fastest way to collect segments of data from a NumPy array at every point in a dataset, normalize them based on the sign (+ve/-ve) at the start of the segment, and average all segments together?
At present I have:
import numpy as np
x0 = np.random.normal(0,1,5000) # Dataset to be analysed
l0 = 100 # Length of segment to be averaged
def average_seg(x,l):
return np.mean([x[i:i+l]*np.sign(x[i]) for i in range(len(x)-l)],axis=0)
av_seg = average_seg(x0,l0)
Timing for this is as follows:
%timeit average_seg(x0,l0)
22.2 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This does the job, but is there a faster way to do this?
The above code suffers when the length of x0 is large, and when the value of l0 is large. We're looking at looping through this code several million times, so even incremental improvements will help!
We can leverage 1D convolution -
np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
The idea is to do the windowed summations with convolution and with a flipped kernel as per the convolution definition.
Timings -
In [150]: x = np.random.normal(0,1,5000) # Dataset to be analysed
...: l = 100 # Length of segment to be averaged
In [151]: %timeit average_seg(x,l)
17.2 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %timeit np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
149 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [153]: av_seg = average_seg(x,l)
...: out = np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
...: print(np.allclose(out, av_seg))
True
100x+ speedup!
Related
tf.boolean_mask reads much nicer than then combination of tf.gather and tf.where. However, it seems to be much slower in the 1-D case:
import tensorflow as tf
# use this shape
shape = [5000]
# create random mask m and dummy vector v
m = tf.random.uniform(shape) > 0.5
v = tf.ones(shape)
# apply boolean_mask to select elements from v based on boolean mask m:
%timeit tf.boolean_mask(v, m)
# 1.23 ms ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# do the same with gather and where:
%timeit tf.gather(v, tf.where(m))
# 107 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Admittedly, the results have slightly different shapes:
tf.boolean_mask(v, m).shape
# TensorShape([2578])
tf.gather(v, tf.where(m)).shape
# TensorShape([2578, 1])
This can be fixed by squeezing out the additional dimension, which makes it 50% slower:
%timeit tf.squeeze(tf.gather(v, tf.where(m)))
# 149 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Still, this is almost a factor 10 faster than using boolean_mask. Is there a subtle difference, or is tf.boolean_mask just missing an optimization for the 1-D case?
PS: There seems to be a strong dependence on the size of the tensor, too. For shape = [5000000], the performance is on par.
It seems boolean_mask is wrapper of gather + where:
I have a 1D array of integers with D elements (i.e. idx = np.array([i0, i1, ...]), s.t. idx.size = D), where each element corresponds to the index along that dimension of an ND array with D dimensions (i.e. data s.t. data.ndim = D). How can I index the data array using the index array idx?
In python I would do data[tuple(idx)], but tuple aren't supported in numba nopython mode.
My current workaround is to use data.ravel() and convert from ND indices to 1D indices of the flattened array, but it seems like there must be an easier (and computationally faster) solution. Is there a take_along_each_axis(data, idx) method somewhere?
Lets do a bit of time testing:
In [135]: data = np.ones((100,100,100,100)); idx = (50,50,50,50)
That's nearly a Gb of memory - not huge enough to create a memory error, but still should be a reasonable test. Actually, I get the same time for basic indexing for much smaller arrays. And for other idx values
In [136]: timeit data[idx]
212 ns ± 9.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
the interpreter translates that into a method call:
In [137]: timeit data.__getitem__(idx)
283 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
indexing the 'flat' array, can be done with:
In [138]: timeit data.flat[np.ravel_multi_index(idx,data.shape)]
6.65 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
or taking the conversion out of the loop:
In [139]: %%timeit x=np.ravel_multi_index(idx,data.shape)
...: data.flat[x]
574 ns ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [142]: %%timeit x=np.ravel_multi_index(idx,data.shape);df=data.flat
...: df[x]
345 ns ± 6.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I think there are cases where flat indexing is faster, but this isn't one.
So a stand alone operation I don't see the point to writing a njit version. I suppose if it's part of some larger operation it could be worth it.
I need to calculate values for a lot of angles in degrees. In order to build up the coarse shape fast, and the fine bits in between later, I want to calculate the shape in this order (0°, 180°, 90°, 270°,45°,135°...)
The following code does what I want. I wonder: Is there a way to do that in a more straightforward way? It needs to work for any (whole) number (eg. 72, or 7465)
Thanks for your help.
import numpy as np
def evenly_spaced_star_order(number):
Total=np.linspace(0,360,number,endpoint=False)
Res=[]
for devider in [2**_ for _ in range(1000)]:
for counter in range(devider):
Number=(counter*len(Total))//devider
if np.isfinite(Total[Number]):
Res.append(Total[Number])
Total[Number]=np.nan
if np.all(np.isnan(Total)):
break
return(Res)
print(evenly_spaced_star_order(16))
My solution recursively separates the even- and odd-numbered indexes. The odd-numbered rows are then put at the end of the final list (in order), and the even-numbered rows are recursively split apart again.
My order is consistent with your original function, but it is a lot faster (by an order of magnitude or more) and it does indeed work for any whole number.
# recursive evenly_spaced_star_order()
def esso(number):
def interleave(arr):
return arr if len(arr) <= 1 else np.append(interleave(arr[0::2]), arr[1::2])
return interleave(np.linspace(0,360,number,endpoint=False))
print(esso(16))
My timings:
%timeit evenly_spaced_star_order(16)
885 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit esso(16)
60.1 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit evenly_spaced_star_order(1000)
5.88 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit esso(1000)
111 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Mine will perform better and better as the number of points increases (as compared to the original code).
Second solution
It's not nearly as pretty, but the order is closer and it is still faster.
def esso2(number):
def interleave(arr):
if arr.shape[0] <= 1:
return arr
mid = arr.shape[0] // 2
it1 = iter(interleave(arr[0:mid]))
it2 = iter(interleave(arr[mid:]))
return sum(zip(it1, it2), ()) + tuple(it2)
return np.array(interleave(np.linspace(0,360,number,endpoint=False)))
print(esso2(72))
I'm coding a particular version of raytracing in Python, and I'm trying to calculate the vectors between points on different planes.
I'm working with sets of point light sources, simulating a nonpoint light source. Each source generates one ray for each pixel on the "camera" plane. I managed to compute the vector for each of those rays, by iterating with a for loop for each pixel:
for sensor_point in sensor_points:
sp_min_ro = sensor_point - rayorigins #Vectors between the points
normalv = normalize(sp_min_ro) #Normalized vector between the points
Where sensor_points is a large numpy array with the [x,y,z] coordinates of the different pixel positions, and rayorigins is a numpy array with the [x,y,z] coordinates for the different point sources
This for loop approach works, but is extremely slow. I tried to remove the for loop and directly calculate spr_min_ro = sensor_points - rayorigins, with the whole array, but numpy can't operate it:
ValueError: operands could not be broadcast together with shapes (1002001,3) (36,3)
Is there a way to accelerate the process of finding the vectors between all the points?
Edit: Adding the normalize function definition I have been using, because it is also giving problems:
def normalize(v):
norm = np.linalg.norm(v, axis=1)
return v / norm[:,None]
When I try to pass the new (1002001, 36, 3) array from #aganders3 solution, it fails, I suppose because of the axis?
Numpy solution
import numpy as np
sensor_points=np.random.randn(1002001,3)#.astype(np.float32)
rayorigins=np.random.rand(36,3)#.astype(np.float32)
sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
norm=np.linalg.norm(sp_min_ro,axis=2)
sp_min_ro/=norm[:,:,np.newaxis]
Timings
np.float64: 1.76 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.float32: 1.42 s ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba solution
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins):
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 208 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 104 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba solution with preallocated memory
Memory allocation could be very costly. This example should show, why it is sometimes a good idea to avoid large temporary arrays if possible.
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins,res):
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 66.6 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 33.8 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Check out the rules for NumPy broadcasting. I think adding a new axis in the middle of your sensor_points array will work:
>> sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
>> sp_min_ro.shape
(1002001, 36, 3)
With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python