I'm coding a particular version of raytracing in Python, and I'm trying to calculate the vectors between points on different planes.
I'm working with sets of point light sources, simulating a nonpoint light source. Each source generates one ray for each pixel on the "camera" plane. I managed to compute the vector for each of those rays, by iterating with a for loop for each pixel:
for sensor_point in sensor_points:
sp_min_ro = sensor_point - rayorigins #Vectors between the points
normalv = normalize(sp_min_ro) #Normalized vector between the points
Where sensor_points is a large numpy array with the [x,y,z] coordinates of the different pixel positions, and rayorigins is a numpy array with the [x,y,z] coordinates for the different point sources
This for loop approach works, but is extremely slow. I tried to remove the for loop and directly calculate spr_min_ro = sensor_points - rayorigins, with the whole array, but numpy can't operate it:
ValueError: operands could not be broadcast together with shapes (1002001,3) (36,3)
Is there a way to accelerate the process of finding the vectors between all the points?
Edit: Adding the normalize function definition I have been using, because it is also giving problems:
def normalize(v):
norm = np.linalg.norm(v, axis=1)
return v / norm[:,None]
When I try to pass the new (1002001, 36, 3) array from #aganders3 solution, it fails, I suppose because of the axis?
Numpy solution
import numpy as np
sensor_points=np.random.randn(1002001,3)#.astype(np.float32)
rayorigins=np.random.rand(36,3)#.astype(np.float32)
sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
norm=np.linalg.norm(sp_min_ro,axis=2)
sp_min_ro/=norm[:,:,np.newaxis]
Timings
np.float64: 1.76 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.float32: 1.42 s ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba solution
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins):
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 208 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 104 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba solution with preallocated memory
Memory allocation could be very costly. This example should show, why it is sometimes a good idea to avoid large temporary arrays if possible.
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins,res):
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 66.6 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 33.8 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Check out the rules for NumPy broadcasting. I think adding a new axis in the middle of your sensor_points array will work:
>> sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
>> sp_min_ro.shape
(1002001, 36, 3)
Related
tf.boolean_mask reads much nicer than then combination of tf.gather and tf.where. However, it seems to be much slower in the 1-D case:
import tensorflow as tf
# use this shape
shape = [5000]
# create random mask m and dummy vector v
m = tf.random.uniform(shape) > 0.5
v = tf.ones(shape)
# apply boolean_mask to select elements from v based on boolean mask m:
%timeit tf.boolean_mask(v, m)
# 1.23 ms ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# do the same with gather and where:
%timeit tf.gather(v, tf.where(m))
# 107 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Admittedly, the results have slightly different shapes:
tf.boolean_mask(v, m).shape
# TensorShape([2578])
tf.gather(v, tf.where(m)).shape
# TensorShape([2578, 1])
This can be fixed by squeezing out the additional dimension, which makes it 50% slower:
%timeit tf.squeeze(tf.gather(v, tf.where(m)))
# 149 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Still, this is almost a factor 10 faster than using boolean_mask. Is there a subtle difference, or is tf.boolean_mask just missing an optimization for the 1-D case?
PS: There seems to be a strong dependence on the size of the tensor, too. For shape = [5000000], the performance is on par.
It seems boolean_mask is wrapper of gather + where:
I have a 1D array of integers with D elements (i.e. idx = np.array([i0, i1, ...]), s.t. idx.size = D), where each element corresponds to the index along that dimension of an ND array with D dimensions (i.e. data s.t. data.ndim = D). How can I index the data array using the index array idx?
In python I would do data[tuple(idx)], but tuple aren't supported in numba nopython mode.
My current workaround is to use data.ravel() and convert from ND indices to 1D indices of the flattened array, but it seems like there must be an easier (and computationally faster) solution. Is there a take_along_each_axis(data, idx) method somewhere?
Lets do a bit of time testing:
In [135]: data = np.ones((100,100,100,100)); idx = (50,50,50,50)
That's nearly a Gb of memory - not huge enough to create a memory error, but still should be a reasonable test. Actually, I get the same time for basic indexing for much smaller arrays. And for other idx values
In [136]: timeit data[idx]
212 ns ± 9.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
the interpreter translates that into a method call:
In [137]: timeit data.__getitem__(idx)
283 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
indexing the 'flat' array, can be done with:
In [138]: timeit data.flat[np.ravel_multi_index(idx,data.shape)]
6.65 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
or taking the conversion out of the loop:
In [139]: %%timeit x=np.ravel_multi_index(idx,data.shape)
...: data.flat[x]
574 ns ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [142]: %%timeit x=np.ravel_multi_index(idx,data.shape);df=data.flat
...: df[x]
345 ns ± 6.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I think there are cases where flat indexing is faster, but this isn't one.
So a stand alone operation I don't see the point to writing a njit version. I suppose if it's part of some larger operation it could be worth it.
What would be the fastest way to collect segments of data from a NumPy array at every point in a dataset, normalize them based on the sign (+ve/-ve) at the start of the segment, and average all segments together?
At present I have:
import numpy as np
x0 = np.random.normal(0,1,5000) # Dataset to be analysed
l0 = 100 # Length of segment to be averaged
def average_seg(x,l):
return np.mean([x[i:i+l]*np.sign(x[i]) for i in range(len(x)-l)],axis=0)
av_seg = average_seg(x0,l0)
Timing for this is as follows:
%timeit average_seg(x0,l0)
22.2 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This does the job, but is there a faster way to do this?
The above code suffers when the length of x0 is large, and when the value of l0 is large. We're looking at looping through this code several million times, so even incremental improvements will help!
We can leverage 1D convolution -
np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
The idea is to do the windowed summations with convolution and with a flipped kernel as per the convolution definition.
Timings -
In [150]: x = np.random.normal(0,1,5000) # Dataset to be analysed
...: l = 100 # Length of segment to be averaged
In [151]: %timeit average_seg(x,l)
17.2 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %timeit np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
149 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [153]: av_seg = average_seg(x,l)
...: out = np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
...: print(np.allclose(out, av_seg))
True
100x+ speedup!
With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python
It seems numpy.transpose only save strides, and do actually transpose lazily according to this
So, when data movement actually happened and how to move? use many many memcpy? or some other trick?
I follow the path:
array_reshape,
PyArray_Newshape,
PyArray_NewCopy,
PyArray_NewLikeArray,
PyArray_NewFromDescr,
PyArray_NewFromDescrAndBase,
PyArray_NewFromDescr_int
but see nothing about axis permute. When did it happen indeed?
Update 2021/1/19
Thanks for answers, numpy array copy with transpose is here, which use a common macro to implement it, this algorithm is very native, and it does not consider any of simd acceleration or cache friendliness
The answer to your question is: Numpy doesn't move data.
Did you see PyArray_Transpose on line 688 of your above links? There is a permute in this function,
n = permute->len;
axes = permute->ptr;
...
for (i = 0; i < n; i++) {
int axis = axes[i];
...
permutation[i] = axis;
}
Any array shape is purely metadata, used by Numpy to understand how to handle the data, as memory is always stored linearly and contiguously. There is therefore no reason to move or reorder any data, from the docs here,
Other operations, such as transpose, don't move data elements
around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move.
Typically these new versions of the array metadata but the same data buffer are
new 'views' into the data buffer. There is a different ndarray object, but it
uses the same data buffer. This is why it is necessary to force copies through
use of the .copy() method if one really wants to make a new and independent
copy of the data buffer.
The only reason to copy may be to maximize cache efficiency, although Numpy already considers this,
As it turns out, numpy is smart enough when dealing with ufuncs to determine which index is the most rapidly varying one in memory and uses that for the innermost loop.
Tracing through the numpy C code is a slow and tedious process. I prefer to deduce patterns of behavior from timings.
Make a sample array and its transpose:
In [168]: A = np.random.rand(1000,1000)
In [169]: At = A.T
First a fast view - no coping of the databuffer:
In [171]: timeit B = A.ravel()
262 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A fast copy (presumably uses some fast block memory coping):
In [172]: timeit B = A.copy()
2.2 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slow copy (presumably requires traversing the source in its strided order, and the target in its own order):
In [173]: timeit B = A.copy(order='F')
6.29 ms ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copying At without having to change the order - fast:
In [174]: timeit B = At.copy(order='F')
2.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Like [173] but going from 'F' to 'C':
In [175]: timeit B = At.copy(order='C')
6.29 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: timeit B = At.ravel()
6.54 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Copies with simpler strided reordering fall somewhere in between:
In [177]: timeit B = A[::-1,::-1].copy()
3.75 ms ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [178]: timeit B = A[::-1].copy()
3.73 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: timeit B = At[::-1].copy(order='K')
3.98 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This astype also requires the slower copy:
In [182]: timeit B = A.astype('float128')
6.7 ms ± 8.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PyArray_NewFromDescr_int is described as Generic new array creation routine. While I can't figure out where it copies data from the source to the target, it clearly is checking order and strides and dtype. Presumably it handles all cases where the generic copy is required. The axis permutation isn't a special case.