Making a Numpy Operation Vectorised - python

Long story short, I'm applying a function onto multiple different time intervals and then storing the resulting arrays at different indexs in an ndarray. Presently, I'm doing this by using the a for loop with the numpy equivalent of the enumerate function. As I understand it, this eliminates the major advantage of numpy: vectorisation. Is this a particular way my rountine could be implemented that retains this advantage?
Here is my code:
Most of is working parts for the function psi_t
import numpy as np
# Number of Walks and Number of Positions
N = 100
P = 2*N +1
hopping_rate = 0.5
psi_t0 = np.zeros(P)
psi_t0[N] = 1
#creates the line upon which the particle moves
#index N is the central position
def hamiltonian(line_length, hopping_rate):
'''
creates the simple non time dependent hamiltonian for H = γA
where A is the adjancency matrix
'''
return hopping_rate * line_adjacency_matrix(line_length)
def measurement_operator(positions,finished_quantum_state):
'''
Converts the finished quantum state into an array of probabilities for
being in each position.
Uses the measurement operator from Susan Blog
https://susan-stepney.blogspot.com/2014/02/mathjax.html
Improved on by this guy
https://github.com/Driminary/python-randomwalk-project/blob/master/quantum-2D.py
Apart from the fact that the measurement operator drops the extra dimensions of the spin space,
which isn't present in the continuous walk.
'''
probabilities = np.empty(P)
#M_hat = np.zeros((2*P,2*P,2*P))
for k in range(P):
posn = np.zeros(P) # values of positions to nought ..
posn[k] = 1 #except for the value we're interested in
#M_hat = np.kron(np.outer(posn,posn)) #perform measurement at the current pos
M_hat = np.outer(posn,posn)
proj = M_hat.dot(finished_quantum_state) #find the state the system is in
probabilities[k] = proj.dot(proj.conjugate()).real #Calculate Prob of Particle being there
return probabilities
def psi_t(initial_wave_function,positions,hopping_rate,time):
'''
Acts upon the initial state to give the 'position' of the quantum particle at time t. Applies the measurement operator
to return the probability of being at any position at time t.
'''
psi_t = np.matmul((LA.expm(-1j*hamiltonian(positions,hopping_rate)*time)),initial_wave_function) #state after the continuous walk after time evolution
probablities = measurement_operator(P, psi_t)
return probablities
time_evolution = 150 #how many 'seconds' the wavefunction is evolved for
time_interval = 0.5
number_of_intervals =int(time_evolution / time_interval )
number_of_positions = P
probabilities_at_t =np.ndarray((number_of_intervals,number_of_positions)) #creates the empty ndarray ready for the probabilites at time t
array_of_times = np.linspace(0,time_evolution,number_of_intervals) #produces the individual times at which psi_t is calculated,
for idx,time in np.ndenumerate(array_of_times):
probabilities_at_t[idx] = psi_t(psi_t0,P,hopping_rate,time) #the array probabillites_at_t is filled at index idx with the array of probabilities produced by psi_t.
#This is the step I am trying to vectorise
The function psi_t is called on a for loop to act on each of the time(s) in array_of_times individually. Is there way where psi_t could act on the array array_of_times like one can do x**2 for the array x? Can it be done in one fell swoop?
P.S Eagle Eyed Overflowers will note that within the measurement_operator there is a for loop anyway. I don't think there's a way to get rid of this however !

Question is not really reproducible because some of the functions that are being called are missing but here is my vectorised implementation of measurement_operator. This is with the assumption that finished_quantum_state has a shape of (P, ) (Not sure if that's the case, because couldn't reproduce till that part) .
def measurement_operator_vectorized(positions, finished_quantum_state):
M_hat = np.zeros((P, P, P))
M_hat[np.arange(0, P), np.arange(0, P), np.arange(0, P)] = 1
proj = np.tensordot(M_hat, finished_quantum_state, axes=((2), (0)))
probabilities = (proj * proj.conjugate()).sum(axis=1).real
return probabilities
Here is some benchmarkings -
P = 1000
a = np.random.rand(P)
b = np.random.rand(P)
%timeit c1 = measurement_operator(a, b)
%timeit c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
print(np.allclose(c1, c2))
Gives -
1.18 s ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
308 ms ± 6.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 86.43 MiB, increment: 0.00 MiB
peak memory: 90.34 MiB, increment: 3.91 MiB
True
The vectorised version is faster is comparable memory usage for P~1000.
Note that for really high values of P, the memory usage will increase a lot for the vectorised version.
This isn't exactly what the OP asked for, but to vectorise the other loop, a more complete code would be helpful.
However, this benchmark is valid only if finished_quantum_state is real. For complex values the tensordot operation is very slow and inefficient (in memory) so you might actually be better off with the non-vectorized version.
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
np.allclose(c1, c2)
2.97 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.69 MiB, increment: 0.03 MiB
peak memory: 15365.38 MiB, increment: 15262.69 MiB
However, if you really want the best performance, you are better off forgetting the physics details about measurement etc temporarily and just doing
def measurement_operator_fastest(positions, finished_quantum_state):
return (finished_quantum_state * finished_quantum_state.conjugate()).real
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%timeit -n1 -r1 c3 = measurement_operator_fastest(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
%memit c3 = measurement_operator_fastest(a, b)
print(np.allclose(c1, c2))
print(np.allclose(c1, c3))
2.87 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
16.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.70 MiB, increment: 0.00 MiB
peak memory: 15365.39 MiB, increment: 15262.69 MiB
peak memory: 102.69 MiB, increment: -0.01 MiB
True
True
By taking the inner product directly, you can make the function around 10^6 times faster. Of course that assumes the measurement operator as defined.

Related

How to determine indices in array between a set of values (in batch)

I was wondering if anyone has an idea on how to speed up the identification of which indices are between a set of values.
Let's say I have a 1d array of sorted values (~50k) and a large list (>100k) of a pair of min/max values and I want to determine which (if any) indices in the 1d array are present. I must also be able to do this many times where the 1d array changes in size/shape.
My current approach is to use numpy and numba and list comprehension but unfortunately it doesn't really scale. It's okay if I try to look for ~1k values but when the number is much larger, it's too slow to be able to repeat it 1000s of times.
Current code:
import numpy as np
import numba
#numba.njit()
def find_between_batch(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
res = []
for i in range(len(min_value)):
res.append(np.where(np.logical_and(array >= min_value[i], array <= max_value[i]))[0])
return res
Here is an example of the input:
x = np.linspace(0, 2000, 50000) # input 1d array
# these are the boundaries for which we should find the indices
mins = np.sort(np.random.choice(x, 10000)) - 0.01 # lower values to search for
maxs = mins + 0.02 # upper values to search for
And the current performance
# pre-compile
result = find_between_batch(x, mins, maxs)
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
And example output
result
[array([11]),
array([14]),
array([19]),
array([23]),
...
]
Does anyone have a suggestion on how to speed this up or if there is another approach that could give me the same results?
Thanks for the suggestion to use np.searchsorted - I've come up with a solution that is approx. 10-100x faster than my initial attempt.
#numba.njit()
def find_between_batch2(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
min_indices = np.searchsorted(array, min_value, side="left")
max_indices = np.searchsorted(array, max_value, side="right")
res = []
for i in range(len(min_value)):
_array = array[min_indices[i]:max_indices[i]]
res.append(min_indices[i] + find_between(_array, min_value[i], max_value[i]))
return res
Original code:
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Updated code:
%timeit -r 3 -n 10 find_between_batch2(x, mins, maxs)
6.36 ms ± 73.6 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

Elementwise maximum of sparse Scipy matrix & vector with broadcasting

I need a fast element-wise maximum that compares each row of an n-by-m scipy sparse matrix element-wise to a sparse 1-by-m matrix. This works perfectly in Numpy using np.maximum(mat, vec) via Numpy's broadcasting.
However, Scipy's .maximum() does not have broadcasting. My matrix is large, so I cannot cast it to a numpy array.
My current workaround is to loop over the many rows of mat with mat[row,:].maximum(vec). This big loop is ruining my code efficiency (it has to be done many times). My slow solution is in the second code snippet below -- Is there a better solution?
# Example
import numpy as np
from scipy import sparse
mat = sparse.csc_matrix(np.arange(12).reshape((4,3)))
vec = sparse.csc_matrix([-1, 5, 100])
# Numpy's np.maximum() gives the **desired result** using broadcasting (but it can't handle sparse matrices):
numpy_result = np.maximum( mat.toarray(), vec.toarray() )
print( numpy_result )
# [[ 0 5 100]
# [ 3 5 100]
# [ 6 7 100]
# [ 9 10 100]]
# Scipy only compares the top row of mat to vec (no broadcasting!):
scipy_result = mat.maximum(vec)
print( scipy_result.toarray() )
# [[ 0 5 100]
# [ 3 4 5]
# [ 6 7 8]
# [ 9 10 11]]
#Reversing the order of mat and vec in the call to vec.maximum(mat) results in a single row output, and also frequently seg faults (!):
Larger example & current solution for speed testing
import numpy as np
from scipy import sparse
import timeit
mat = sparse.csc_matrix( sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
vec = sparse.csc_matrix( sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
def sparse_elementwise_maximum(mat, vec):
output = sparse.lil_matrix(mat.shape)
for row_idx in range( mat.shape[0] ):
output[row_idx] = mat[row_idx,:].maximum(vec)
return output
# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
for _ in range(int(num_timing_loops)):
sparse_elementwise_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# 15 seconds per call (way too slow!)
EDIT
I'm accepting Max's answer, as the question was specifically about a high performance solution, and Max's solution offers huge 1000x-2500x speedups on various inputs I tried at the expense of more lines of code and Numba compiling. However, for general use, Daniel F's one-liner is a great solution offers 10x-50x speedups on examples I tried--I will probably use for many other things.
A low level approach
As always you can think on how a proper sparse matrix format for this operation is built up, for csr-matrices the main components are shape, data_arr,indices and ind_ptr.
With these parts of the scipy.sparse.csr object it is quite straight forward but maybe a bit time consuming to implement an efficient algorithm in a compiled language (C,C++,Cython, Python-Numba). Int his implemenation I used Numba, but porting it to C++ should be easily possible (syntax changes) and maybe avoiding the slicing.
Implementation (first try)
import numpy as np
import numba as nb
# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec):
mat_csr=mat.tocsr()
vec_csr=vec.tocsr()
shape_mat=mat_csr.shape
indices_mat=mat_csr.indices
indptr_mat=mat_csr.indptr
data_mat=mat_csr.data
indices_vec=vec_csr.indices
data_vec=vec_csr.data
res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec)
res=sparse.csr_matrix(res, shape=shape_mat)
return res
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
data_res=[]
indices_res=[]
indptr_mat_res=[]
indptr_mat_=0
indptr_mat_res.append(indptr_mat_)
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res.append(max(mat_row_data[mat_ptr],vec_row_data[vec_ptr]))
indices_res.append(ind_mat)
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res.append(mat_row_data[mat_ptr])
indices_res.append(ind_mat)
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res.append(vec_row_data[vec_ptr])
indices_res.append(ind_vec)
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res.append(mat_row_data[i])
indices_res.append(mat_row_ind[i])
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res.append(vec_row_data[i])
indices_res.append(vec_row_ind[i])
indptr_mat_+=1
indptr_mat_res.append(indptr_mat_)
return np.array(data_res),np.array(indices_res),np.array(indptr_mat_res)
Implementation (optimized)
In this approach the lists are replaced by a dynamically resized array. I increased the size of the output in 60 MB steps. On creation of the csr-object, there is also no copy of the data made, just references. If you want avoid a memory overhead you have to copy the arrays in the end.
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
mem_step=5_000_000
#preallocate memory for 5M non-zero elements (60 MB in this example)
data_res=np.empty(mem_step,dtype=data_mat.dtype)
indices_res=np.empty(mem_step,dtype=np.int32)
data_res_p=0
indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
indptr_mat_res[0]=0
indptr_mat_res_p=1
indptr_mat_=0
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
#check if resizing is necessary
if data_res.shape[0]<data_res_p+shape_mat[1]:
#add at least memory for another mem_step elements
size_to_add=mem_step
if shape_mat[1] >size_to_add:
size_to_add=shape_mat[1]
data_res_2 =np.empty(data_res.shape[0] +size_to_add,data_res.dtype)
indices_res_2=np.empty(indices_res.shape[0]+size_to_add,indices_res.dtype)
for i in range(data_res_p):
data_res_2[i]=data_res[i]
indices_res_2[i]=indices_res[i]
data_res=data_res_2
indices_res=indices_res_2
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
indices_res[data_res_p]=ind_mat
data_res_p+=1
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res[data_res_p]=mat_row_data[mat_ptr]
indices_res[data_res_p]=ind_mat
data_res_p+=1
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res[data_res_p]=vec_row_data[vec_ptr]
indices_res[data_res_p]=ind_vec
data_res_p+=1
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res[data_res_p]=mat_row_data[i]
indices_res[data_res_p]=mat_row_ind[i]
data_res_p+=1
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res[data_res_p]=vec_row_data[i]
indices_res[data_res_p]=vec_row_ind[i]
data_res_p+=1
indptr_mat_+=1
indptr_mat_res[indptr_mat_res_p]=indptr_mat_
indptr_mat_res_p+=1
return data_res[:data_res_p],indices_res[:data_res_p],indptr_mat_res
Maximum memory allocated in the beginning
The performance and usability of this approach heavily depends on the inputs. In this approach the maximal memory is allocated (this could easily cause out of memory errors).
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data,shrink_to_fit):
max_non_zero=shape_mat[0]*vec_row_data.shape[0]+data_mat.shape[0]
data_res=np.empty(max_non_zero,dtype=data_mat.dtype)
indices_res=np.empty(max_non_zero,dtype=np.int32)
data_res_p=0
indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
indptr_mat_res[0]=0
indptr_mat_res_p=1
indptr_mat_=0
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
indices_res[data_res_p]=ind_mat
data_res_p+=1
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res[data_res_p]=mat_row_data[mat_ptr]
indices_res[data_res_p]=ind_mat
data_res_p+=1
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res[data_res_p]=vec_row_data[vec_ptr]
indices_res[data_res_p]=ind_vec
data_res_p+=1
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res[data_res_p]=mat_row_data[i]
indices_res[data_res_p]=mat_row_ind[i]
data_res_p+=1
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res[data_res_p]=vec_row_data[i]
indices_res[data_res_p]=vec_row_ind[i]
data_res_p+=1
indptr_mat_+=1
indptr_mat_res[indptr_mat_res_p]=indptr_mat_
indptr_mat_res_p+=1
if shrink_to_fit==True:
data_res=np.copy(data_res[:data_res_p])
indices_res=np.copy(indices_res[:data_res_p])
else:
data_res=data_res[:data_res_p]
indices_res=indices_res[:data_res_p]
return data_res,indices_res,indptr_mat_res
# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True):
mat_csr=mat.tocsr()
vec_csr=vec.tocsr()
shape_mat=mat_csr.shape
indices_mat=mat_csr.indices
indptr_mat=mat_csr.indptr
data_mat=mat_csr.data
indices_vec=vec_csr.indices
data_vec=vec_csr.data
res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec,shrink_to_fit)
res=sparse.csr_matrix(res, shape=shape_mat)
return res
Timings
Numba has a compilation overhead or some overhead to load the function from cache. Don't consider the first call if you want to get the runtime and not compilation+runtime.
import numpy as np
from scipy import sparse
mat = sparse.csr_matrix( sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
vec = sparse.csr_matrix( sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
%timeit output=sparse_elementwise_maximum(mat, vec)
#for csc input
37.9 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#for csr input
10.7 s ± 90.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Daniel F
%timeit sparse_maximum(mat, vec)
164 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#low level implementation (first try)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
89.7 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#low level implementation (optimized, csr)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#low level implementation (preallocation, without copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#low level implementation (preallocation, with copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=False)
14.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True)
21.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#For comparison, copying the result takes
%%timeit
np.copy(res.data)
np.copy(res.indices)
np.copy(res.indptr)
7.8 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
scipy.sparse matrices don't broadcast. At all. So unless you can figure out some way to operate on the indices and inpts (I haven't), you're stuck stacking. Best I can figure out is just to vstack your vecs until they're the same shape as mat. It seems to give a good speedup, although it doesn't explain the segfault weirdness with csr.
#using `mat` and `vec` from the speed test
def sparse_maximum(mat, vec):
vec1 = sparse.vstack([vec for _ in range(mat.shape[0])])
return mat.maximum(vec1)
# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
sparse_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# I was getting 11-12 seconds on your original code
time per call is: 0.514533479333295 seconds
Proof that it works on original matrices:
vec = sparse.vstack([vec for _ in range(4)])
print(mat.maximum(vec).todense())
[[ 0 5 100]
[ 3 5 100]
[ 6 7 100]
[ 9 10 100]]
Looking at the maximum method, and code, especially the _binopt method in /scipy/sparse/compressed.py it's apparent that it can work with a scalar other. For a sparse other it constructs a new sparse matrix (of the same format and shape) using indptr, etc values. If other has the same shape, it works:
In [55]: mat = sparse.csr_matrix(np.arange(12).reshape((4,3)))
In [64]: mat.maximum(mat)
Out[64]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in Compressed Sparse Row format>
It fails is the other is a 1d sparse matrix:
In [65]: mat.maximum(mat[0,:])
Segmentation fault (core dumped)
mat.maximum(mat[:,0]) runs without error, though I'm not sure about the values. mat[:,0] will have the same size indptr.
I thought the mat.maximum(mat[:,0]) would give same fault if mat was csc, but it doesn't.
Let's be honest, this kind of operation is not a strong point for sparse matrices. The core of its math is matrix multiplication. That's what they were originally developed for - sparse linear algebra problems such as finite difference and finite element.

Fastest way to compute large number of 3x3 dot product

I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start

Vectorization in a loop slower than a nested loop in numba jitted function

So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0
Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

High performance weighted random choice for python 2?

I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:
def weighted_choice(seq, weights):
assert len(seq) == len(weights)
total = sum(weights)
r = random.uniform(0, total)
upto = 0
for i in range(len(seq)):
if upto + weights[i] >= r:
return seq[i]
upto += weights[i]
assert False, "Shouldn't get here"
If I call the above a million times with a 1000 element sequence, like this:
seq = range(1000)
weights = []
for i in range(1000):
weights.append(random.randint(1,100))
st=time.time()
for i in range(1000000):
r=weighted_choice(seq, weights)
print (time.time()-st)
it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6.
It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.
Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.
ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.
I've tried it with numpy this way:
weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
#r=weighted_choice(seq, weights)
#r=random.choices(seq, weights)
r=numpy.random.choice(seq, p=weights)
print (time.time()-st)
It ran for 70 seconds.
You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.
Implement as:
def weighted_choice(seq, weights):
w = np.asarray(weights)
p = w / w.sum() # can skip if weights always sum to 1
return np.random.choice(seq, p=w)
Edit:
Timings:
%timeit np.random.choice(x, p=w) # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.random.choice(y, p=w) # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need
#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))
def weights_numpy(seq,weights,iterations):
"""
:param seq: Input sequence
:param weights: Input Weights
:param iterations: Iterations to run
:return:
"""
r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
ar = weights.cumsum(0) # get cumulative sum
return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition
And the timing (python 3,numpy '1.14.0')
%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is a bit slower than PyPy, but hardly...

Categories