Performance difference between tf.boolean_mask and tf.gather + tf.where - python

tf.boolean_mask reads much nicer than then combination of tf.gather and tf.where. However, it seems to be much slower in the 1-D case:
import tensorflow as tf
# use this shape
shape = [5000]
# create random mask m and dummy vector v
m = tf.random.uniform(shape) > 0.5
v = tf.ones(shape)
# apply boolean_mask to select elements from v based on boolean mask m:
%timeit tf.boolean_mask(v, m)
# 1.23 ms ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# do the same with gather and where:
%timeit tf.gather(v, tf.where(m))
# 107 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Admittedly, the results have slightly different shapes:
tf.boolean_mask(v, m).shape
# TensorShape([2578])
tf.gather(v, tf.where(m)).shape
# TensorShape([2578, 1])
This can be fixed by squeezing out the additional dimension, which makes it 50% slower:
%timeit tf.squeeze(tf.gather(v, tf.where(m)))
# 149 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Still, this is almost a factor 10 faster than using boolean_mask. Is there a subtle difference, or is tf.boolean_mask just missing an optimization for the 1-D case?
PS: There seems to be a strong dependence on the size of the tensor, too. For shape = [5000000], the performance is on par.

It seems boolean_mask is wrapper of gather + where:

Related

Using vectorization to get rid of for cycle in matrix multiplication

We have two matrices A and B, where
A.shape # Output T,N
B.shape # Output N,N,T
and we are interested in calculating the mult value, which I am calculating using for cycle right now
mult=0
for t in range(0,T):
mult += np.matmul(np.matmul(A[t,:].T , B[:,:,t]) ,A[t,:])
I have already found a similar problem, which I am confident can be helpful in find the solution Is there a numpy/scipy dot product, calculating only the diagonal entries of the result?, but I am unable to come up with a clean solution without using the for cycle.
Is there a way of using only basic numpy operations to get the desired result?
It looks like you can get away with a call to np.einsum:
mult = np.einsum('ij, jki, ik ->', A, B, A)
What this does is compute the summation
sum_{i, j, k} A[i, j] * B[j, k, i] * A[i, k]
which is what you need.
Note that in your original code A[t, :].T is a 1d array, so transposition is a no-op, this is just A[t, :].
Also note that np.einsum might be concise, but it's often slower than the corresponding calls to np.dot or elementwise operations. So you might get other answers that end up being faster. I haven't timed this approach.
Case in point: there's a direct multiplication alternative making use of array broadcasting:
mult = (A[:, None, :] * B.T * A[..., None]).sum()
Now, there's a lot of legroom here: we could choose to transpose A instead, or only reorder some of the indices of B and so on. The corresponding performance will depend on your use case, because memory access ends up being highly non-trivial (which impacts caching, which is mostly where vectorisation speedup comes from).
For example, for T = 10 and N = 100 and uniform random matrices we get these timings:
>>> %timeit loopy(A, B)
... %timeit einsummed(A, B)
... %timeit mulled(A, B)
179 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
304 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
309 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(loopy() is your code, einsummed() is my first solution and mulled() is my second.) So both einsum and the direct summation end up being almost twice as slow. This is because your loop goes over a "small" index of size 10, where looping often outperforms vectorised solutions.
In contrast, if we choose T = 100 and N = 10 we get
>>> %timeit loopy(A, B)
... %timeit einsummed(A, B)
... %timeit mulled(A, B)
231 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
37.9 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
38.1 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Now my two approaches are almost six times faster.
Whether either of these is faster than your loop depends on your real problem. Time them and see.

Scaling columns of a Sparse Tensor by a vector in tensorflow

Assume A is a sparse 2D binary tensor (NxN) and B is a vector (1xN). I want to scale columns of A by B: A_{i, j} <- A_{i, j} x B_{j}
I am looking for an efficient way for doing this. Currently I tried two methods but they are too slow:
Convert B to the diagonal matrix BD and use tf.sparse_tensor_dense_matmul(A, BD). This is going to return a dense matrix and as a result, the method is not efficient.
Using map_fn for sparse tensors tf.SparseTensor(A.indices, tf.map_fn(func, (A.values, A.indices[:, 1]), dtype=tf.float32), A.dense_shape). This returns a sparse tensor but it seems one of the map_fn or SparseTensor functions are the bottleneck.
Thank you!
The faster solution is probably to call gather on your tensor B to be able to multiply directly the values of A. Another option is to use tf.sparse.map_values, creating your vector of updates with tf.gather.
On a 99% empty sparse matrix of size 1000x1000:
>>> %timeit tf.sparse.SparseTensor(A.indices, A.values*tf.gather(B,A.indices[:,1]), A.dense_shape)
855 µs ± 86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The map_values approach:
>>> %timeit tf.sparse.map_values(tf.multiply, A, tf.gather(B,A.indices[:,1]))
978 µs ± 73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Compared to the sparse_dense_matmul approach:
>>> %timeit tf.sparse.from_dense(tf.sparse.sparse_dense_matmul(A,tf.linalg.diag(B)))
26.7 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to average sign-normalized segments of data with NumPy?

What would be the fastest way to collect segments of data from a NumPy array at every point in a dataset, normalize them based on the sign (+ve/-ve) at the start of the segment, and average all segments together?
At present I have:
import numpy as np
x0 = np.random.normal(0,1,5000) # Dataset to be analysed
l0 = 100 # Length of segment to be averaged
def average_seg(x,l):
return np.mean([x[i:i+l]*np.sign(x[i]) for i in range(len(x)-l)],axis=0)
av_seg = average_seg(x0,l0)
Timing for this is as follows:
%timeit average_seg(x0,l0)
22.2 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This does the job, but is there a faster way to do this?
The above code suffers when the length of x0 is large, and when the value of l0 is large. We're looking at looping through this code several million times, so even incremental improvements will help!
We can leverage 1D convolution -
np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
The idea is to do the windowed summations with convolution and with a flipped kernel as per the convolution definition.
Timings -
In [150]: x = np.random.normal(0,1,5000) # Dataset to be analysed
...: l = 100 # Length of segment to be averaged
In [151]: %timeit average_seg(x,l)
17.2 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %timeit np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
149 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [153]: av_seg = average_seg(x,l)
...: out = np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
...: print(np.allclose(out, av_seg))
True
100x+ speedup!

Efficient calculation of vector between sets of 3D points

I'm coding a particular version of raytracing in Python, and I'm trying to calculate the vectors between points on different planes.
I'm working with sets of point light sources, simulating a nonpoint light source. Each source generates one ray for each pixel on the "camera" plane. I managed to compute the vector for each of those rays, by iterating with a for loop for each pixel:
for sensor_point in sensor_points:
sp_min_ro = sensor_point - rayorigins #Vectors between the points
normalv = normalize(sp_min_ro) #Normalized vector between the points
Where sensor_points is a large numpy array with the [x,y,z] coordinates of the different pixel positions, and rayorigins is a numpy array with the [x,y,z] coordinates for the different point sources
This for loop approach works, but is extremely slow. I tried to remove the for loop and directly calculate spr_min_ro = sensor_points - rayorigins, with the whole array, but numpy can't operate it:
ValueError: operands could not be broadcast together with shapes (1002001,3) (36,3)
Is there a way to accelerate the process of finding the vectors between all the points?
Edit: Adding the normalize function definition I have been using, because it is also giving problems:
def normalize(v):
norm = np.linalg.norm(v, axis=1)
return v / norm[:,None]
When I try to pass the new (1002001, 36, 3) array from #aganders3 solution, it fails, I suppose because of the axis?
Numpy solution
import numpy as np
sensor_points=np.random.randn(1002001,3)#.astype(np.float32)
rayorigins=np.random.rand(36,3)#.astype(np.float32)
sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
norm=np.linalg.norm(sp_min_ro,axis=2)
sp_min_ro/=norm[:,:,np.newaxis]
Timings
np.float64: 1.76 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.float32: 1.42 s ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba solution
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins):
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 208 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 104 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba solution with preallocated memory
Memory allocation could be very costly. This example should show, why it is sometimes a good idea to avoid large temporary arrays if possible.
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def normalized_vec(sensor_points,rayorigins,res):
for i in nb.prange(sensor_points.shape[0]):
for j in range(rayorigins.shape[0]):
vec_x=sensor_points[i,0]-rayorigins[j,0]
vec_y=sensor_points[i,1]-rayorigins[j,1]
vec_z=sensor_points[i,2]-rayorigins[j,2]
dist=np.sqrt(vec_x**2+vec_y**2+vec_z**2)
res[i,j,0]=vec_x/dist
res[i,j,1]=vec_y/dist
res[i,j,2]=vec_z/dist
return res
Timings
res=np.empty((sensor_points.shape[0],rayorigins.shape[0],3),dtype=sensor_points.dtype)
%timeit res=normalized_vec(sensor_points,rayorigins)
np.float64: 66.6 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.float32: 33.8 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Check out the rules for NumPy broadcasting. I think adding a new axis in the middle of your sensor_points array will work:
>> sp_min_ro = sensor_points[:, np.newaxis, :] - rayorigins
>> sp_min_ro.shape
(1002001, 36, 3)

With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being `n` an integer?

With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python

Categories