I need a fast element-wise maximum that compares each row of an n-by-m scipy sparse matrix element-wise to a sparse 1-by-m matrix. This works perfectly in Numpy using np.maximum(mat, vec) via Numpy's broadcasting.
However, Scipy's .maximum() does not have broadcasting. My matrix is large, so I cannot cast it to a numpy array.
My current workaround is to loop over the many rows of mat with mat[row,:].maximum(vec). This big loop is ruining my code efficiency (it has to be done many times). My slow solution is in the second code snippet below -- Is there a better solution?
# Example
import numpy as np
from scipy import sparse
mat = sparse.csc_matrix(np.arange(12).reshape((4,3)))
vec = sparse.csc_matrix([-1, 5, 100])
# Numpy's np.maximum() gives the **desired result** using broadcasting (but it can't handle sparse matrices):
numpy_result = np.maximum( mat.toarray(), vec.toarray() )
print( numpy_result )
# [[ 0 5 100]
# [ 3 5 100]
# [ 6 7 100]
# [ 9 10 100]]
# Scipy only compares the top row of mat to vec (no broadcasting!):
scipy_result = mat.maximum(vec)
print( scipy_result.toarray() )
# [[ 0 5 100]
# [ 3 4 5]
# [ 6 7 8]
# [ 9 10 11]]
#Reversing the order of mat and vec in the call to vec.maximum(mat) results in a single row output, and also frequently seg faults (!):
Larger example & current solution for speed testing
import numpy as np
from scipy import sparse
import timeit
mat = sparse.csc_matrix( sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
vec = sparse.csc_matrix( sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
def sparse_elementwise_maximum(mat, vec):
output = sparse.lil_matrix(mat.shape)
for row_idx in range( mat.shape[0] ):
output[row_idx] = mat[row_idx,:].maximum(vec)
return output
# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
for _ in range(int(num_timing_loops)):
sparse_elementwise_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# 15 seconds per call (way too slow!)
EDIT
I'm accepting Max's answer, as the question was specifically about a high performance solution, and Max's solution offers huge 1000x-2500x speedups on various inputs I tried at the expense of more lines of code and Numba compiling. However, for general use, Daniel F's one-liner is a great solution offers 10x-50x speedups on examples I tried--I will probably use for many other things.
A low level approach
As always you can think on how a proper sparse matrix format for this operation is built up, for csr-matrices the main components are shape, data_arr,indices and ind_ptr.
With these parts of the scipy.sparse.csr object it is quite straight forward but maybe a bit time consuming to implement an efficient algorithm in a compiled language (C,C++,Cython, Python-Numba). Int his implemenation I used Numba, but porting it to C++ should be easily possible (syntax changes) and maybe avoiding the slicing.
Implementation (first try)
import numpy as np
import numba as nb
# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec):
mat_csr=mat.tocsr()
vec_csr=vec.tocsr()
shape_mat=mat_csr.shape
indices_mat=mat_csr.indices
indptr_mat=mat_csr.indptr
data_mat=mat_csr.data
indices_vec=vec_csr.indices
data_vec=vec_csr.data
res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec)
res=sparse.csr_matrix(res, shape=shape_mat)
return res
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
data_res=[]
indices_res=[]
indptr_mat_res=[]
indptr_mat_=0
indptr_mat_res.append(indptr_mat_)
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res.append(max(mat_row_data[mat_ptr],vec_row_data[vec_ptr]))
indices_res.append(ind_mat)
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res.append(mat_row_data[mat_ptr])
indices_res.append(ind_mat)
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res.append(vec_row_data[vec_ptr])
indices_res.append(ind_vec)
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res.append(mat_row_data[i])
indices_res.append(mat_row_ind[i])
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res.append(vec_row_data[i])
indices_res.append(vec_row_ind[i])
indptr_mat_+=1
indptr_mat_res.append(indptr_mat_)
return np.array(data_res),np.array(indices_res),np.array(indptr_mat_res)
Implementation (optimized)
In this approach the lists are replaced by a dynamically resized array. I increased the size of the output in 60 MB steps. On creation of the csr-object, there is also no copy of the data made, just references. If you want avoid a memory overhead you have to copy the arrays in the end.
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
mem_step=5_000_000
#preallocate memory for 5M non-zero elements (60 MB in this example)
data_res=np.empty(mem_step,dtype=data_mat.dtype)
indices_res=np.empty(mem_step,dtype=np.int32)
data_res_p=0
indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
indptr_mat_res[0]=0
indptr_mat_res_p=1
indptr_mat_=0
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
#check if resizing is necessary
if data_res.shape[0]<data_res_p+shape_mat[1]:
#add at least memory for another mem_step elements
size_to_add=mem_step
if shape_mat[1] >size_to_add:
size_to_add=shape_mat[1]
data_res_2 =np.empty(data_res.shape[0] +size_to_add,data_res.dtype)
indices_res_2=np.empty(indices_res.shape[0]+size_to_add,indices_res.dtype)
for i in range(data_res_p):
data_res_2[i]=data_res[i]
indices_res_2[i]=indices_res[i]
data_res=data_res_2
indices_res=indices_res_2
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
indices_res[data_res_p]=ind_mat
data_res_p+=1
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res[data_res_p]=mat_row_data[mat_ptr]
indices_res[data_res_p]=ind_mat
data_res_p+=1
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res[data_res_p]=vec_row_data[vec_ptr]
indices_res[data_res_p]=ind_vec
data_res_p+=1
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res[data_res_p]=mat_row_data[i]
indices_res[data_res_p]=mat_row_ind[i]
data_res_p+=1
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res[data_res_p]=vec_row_data[i]
indices_res[data_res_p]=vec_row_ind[i]
data_res_p+=1
indptr_mat_+=1
indptr_mat_res[indptr_mat_res_p]=indptr_mat_
indptr_mat_res_p+=1
return data_res[:data_res_p],indices_res[:data_res_p],indptr_mat_res
Maximum memory allocated in the beginning
The performance and usability of this approach heavily depends on the inputs. In this approach the maximal memory is allocated (this could easily cause out of memory errors).
#nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data,shrink_to_fit):
max_non_zero=shape_mat[0]*vec_row_data.shape[0]+data_mat.shape[0]
data_res=np.empty(max_non_zero,dtype=data_mat.dtype)
indices_res=np.empty(max_non_zero,dtype=np.int32)
data_res_p=0
indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
indptr_mat_res[0]=0
indptr_mat_res_p=1
indptr_mat_=0
for row_idx in range(shape_mat[0]):
mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
mat_ptr=0
vec_ptr=0
while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
ind_mat=mat_row_ind[mat_ptr]
ind_vec=vec_row_ind[vec_ptr]
#value for both matrix and vector is present
if ind_mat==ind_vec:
data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
indices_res[data_res_p]=ind_mat
data_res_p+=1
mat_ptr+=1
vec_ptr+=1
indptr_mat_+=1
#only value for the matrix is present vector is assumed 0
elif ind_mat<ind_vec:
if mat_row_data[mat_ptr] >0:
data_res[data_res_p]=mat_row_data[mat_ptr]
indices_res[data_res_p]=ind_mat
data_res_p+=1
indptr_mat_+=1
mat_ptr+=1
#only value for the vector is present matrix is assumed 0
else:
if vec_row_data[vec_ptr] >0:
data_res[data_res_p]=vec_row_data[vec_ptr]
indices_res[data_res_p]=ind_vec
data_res_p+=1
indptr_mat_+=1
vec_ptr+=1
for i in range(mat_ptr,mat_row_ind.shape[0]):
if mat_row_data[i] >0:
data_res[data_res_p]=mat_row_data[i]
indices_res[data_res_p]=mat_row_ind[i]
data_res_p+=1
indptr_mat_+=1
for i in range(vec_ptr,vec_row_ind.shape[0]):
if vec_row_data[i] >0:
data_res[data_res_p]=vec_row_data[i]
indices_res[data_res_p]=vec_row_ind[i]
data_res_p+=1
indptr_mat_+=1
indptr_mat_res[indptr_mat_res_p]=indptr_mat_
indptr_mat_res_p+=1
if shrink_to_fit==True:
data_res=np.copy(data_res[:data_res_p])
indices_res=np.copy(indices_res[:data_res_p])
else:
data_res=data_res[:data_res_p]
indices_res=indices_res[:data_res_p]
return data_res,indices_res,indptr_mat_res
# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True):
mat_csr=mat.tocsr()
vec_csr=vec.tocsr()
shape_mat=mat_csr.shape
indices_mat=mat_csr.indices
indptr_mat=mat_csr.indptr
data_mat=mat_csr.data
indices_vec=vec_csr.indices
data_vec=vec_csr.data
res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec,shrink_to_fit)
res=sparse.csr_matrix(res, shape=shape_mat)
return res
Timings
Numba has a compilation overhead or some overhead to load the function from cache. Don't consider the first call if you want to get the runtime and not compilation+runtime.
import numpy as np
from scipy import sparse
mat = sparse.csr_matrix( sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
vec = sparse.csr_matrix( sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )
%timeit output=sparse_elementwise_maximum(mat, vec)
#for csc input
37.9 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#for csr input
10.7 s ± 90.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Daniel F
%timeit sparse_maximum(mat, vec)
164 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#low level implementation (first try)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
89.7 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#low level implementation (optimized, csr)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#low level implementation (preallocation, without copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#low level implementation (preallocation, with copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=False)
14.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True)
21.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#For comparison, copying the result takes
%%timeit
np.copy(res.data)
np.copy(res.indices)
np.copy(res.indptr)
7.8 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
scipy.sparse matrices don't broadcast. At all. So unless you can figure out some way to operate on the indices and inpts (I haven't), you're stuck stacking. Best I can figure out is just to vstack your vecs until they're the same shape as mat. It seems to give a good speedup, although it doesn't explain the segfault weirdness with csr.
#using `mat` and `vec` from the speed test
def sparse_maximum(mat, vec):
vec1 = sparse.vstack([vec for _ in range(mat.shape[0])])
return mat.maximum(vec1)
# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
sparse_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# I was getting 11-12 seconds on your original code
time per call is: 0.514533479333295 seconds
Proof that it works on original matrices:
vec = sparse.vstack([vec for _ in range(4)])
print(mat.maximum(vec).todense())
[[ 0 5 100]
[ 3 5 100]
[ 6 7 100]
[ 9 10 100]]
Looking at the maximum method, and code, especially the _binopt method in /scipy/sparse/compressed.py it's apparent that it can work with a scalar other. For a sparse other it constructs a new sparse matrix (of the same format and shape) using indptr, etc values. If other has the same shape, it works:
In [55]: mat = sparse.csr_matrix(np.arange(12).reshape((4,3)))
In [64]: mat.maximum(mat)
Out[64]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in Compressed Sparse Row format>
It fails is the other is a 1d sparse matrix:
In [65]: mat.maximum(mat[0,:])
Segmentation fault (core dumped)
mat.maximum(mat[:,0]) runs without error, though I'm not sure about the values. mat[:,0] will have the same size indptr.
I thought the mat.maximum(mat[:,0]) would give same fault if mat was csc, but it doesn't.
Let's be honest, this kind of operation is not a strong point for sparse matrices. The core of its math is matrix multiplication. That's what they were originally developed for - sparse linear algebra problems such as finite difference and finite element.
Related
I was wondering if anyone has an idea on how to speed up the identification of which indices are between a set of values.
Let's say I have a 1d array of sorted values (~50k) and a large list (>100k) of a pair of min/max values and I want to determine which (if any) indices in the 1d array are present. I must also be able to do this many times where the 1d array changes in size/shape.
My current approach is to use numpy and numba and list comprehension but unfortunately it doesn't really scale. It's okay if I try to look for ~1k values but when the number is much larger, it's too slow to be able to repeat it 1000s of times.
Current code:
import numpy as np
import numba
#numba.njit()
def find_between_batch(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
res = []
for i in range(len(min_value)):
res.append(np.where(np.logical_and(array >= min_value[i], array <= max_value[i]))[0])
return res
Here is an example of the input:
x = np.linspace(0, 2000, 50000) # input 1d array
# these are the boundaries for which we should find the indices
mins = np.sort(np.random.choice(x, 10000)) - 0.01 # lower values to search for
maxs = mins + 0.02 # upper values to search for
And the current performance
# pre-compile
result = find_between_batch(x, mins, maxs)
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
And example output
result
[array([11]),
array([14]),
array([19]),
array([23]),
...
]
Does anyone have a suggestion on how to speed this up or if there is another approach that could give me the same results?
Thanks for the suggestion to use np.searchsorted - I've come up with a solution that is approx. 10-100x faster than my initial attempt.
#numba.njit()
def find_between_batch2(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
min_indices = np.searchsorted(array, min_value, side="left")
max_indices = np.searchsorted(array, max_value, side="right")
res = []
for i in range(len(min_value)):
_array = array[min_indices[i]:max_indices[i]]
res.append(min_indices[i] + find_between(_array, min_value[i], max_value[i]))
return res
Original code:
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Updated code:
%timeit -r 3 -n 10 find_between_batch2(x, mins, maxs)
6.36 ms ± 73.6 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
Long story short, I'm applying a function onto multiple different time intervals and then storing the resulting arrays at different indexs in an ndarray. Presently, I'm doing this by using the a for loop with the numpy equivalent of the enumerate function. As I understand it, this eliminates the major advantage of numpy: vectorisation. Is this a particular way my rountine could be implemented that retains this advantage?
Here is my code:
Most of is working parts for the function psi_t
import numpy as np
# Number of Walks and Number of Positions
N = 100
P = 2*N +1
hopping_rate = 0.5
psi_t0 = np.zeros(P)
psi_t0[N] = 1
#creates the line upon which the particle moves
#index N is the central position
def hamiltonian(line_length, hopping_rate):
'''
creates the simple non time dependent hamiltonian for H = γA
where A is the adjancency matrix
'''
return hopping_rate * line_adjacency_matrix(line_length)
def measurement_operator(positions,finished_quantum_state):
'''
Converts the finished quantum state into an array of probabilities for
being in each position.
Uses the measurement operator from Susan Blog
https://susan-stepney.blogspot.com/2014/02/mathjax.html
Improved on by this guy
https://github.com/Driminary/python-randomwalk-project/blob/master/quantum-2D.py
Apart from the fact that the measurement operator drops the extra dimensions of the spin space,
which isn't present in the continuous walk.
'''
probabilities = np.empty(P)
#M_hat = np.zeros((2*P,2*P,2*P))
for k in range(P):
posn = np.zeros(P) # values of positions to nought ..
posn[k] = 1 #except for the value we're interested in
#M_hat = np.kron(np.outer(posn,posn)) #perform measurement at the current pos
M_hat = np.outer(posn,posn)
proj = M_hat.dot(finished_quantum_state) #find the state the system is in
probabilities[k] = proj.dot(proj.conjugate()).real #Calculate Prob of Particle being there
return probabilities
def psi_t(initial_wave_function,positions,hopping_rate,time):
'''
Acts upon the initial state to give the 'position' of the quantum particle at time t. Applies the measurement operator
to return the probability of being at any position at time t.
'''
psi_t = np.matmul((LA.expm(-1j*hamiltonian(positions,hopping_rate)*time)),initial_wave_function) #state after the continuous walk after time evolution
probablities = measurement_operator(P, psi_t)
return probablities
time_evolution = 150 #how many 'seconds' the wavefunction is evolved for
time_interval = 0.5
number_of_intervals =int(time_evolution / time_interval )
number_of_positions = P
probabilities_at_t =np.ndarray((number_of_intervals,number_of_positions)) #creates the empty ndarray ready for the probabilites at time t
array_of_times = np.linspace(0,time_evolution,number_of_intervals) #produces the individual times at which psi_t is calculated,
for idx,time in np.ndenumerate(array_of_times):
probabilities_at_t[idx] = psi_t(psi_t0,P,hopping_rate,time) #the array probabillites_at_t is filled at index idx with the array of probabilities produced by psi_t.
#This is the step I am trying to vectorise
The function psi_t is called on a for loop to act on each of the time(s) in array_of_times individually. Is there way where psi_t could act on the array array_of_times like one can do x**2 for the array x? Can it be done in one fell swoop?
P.S Eagle Eyed Overflowers will note that within the measurement_operator there is a for loop anyway. I don't think there's a way to get rid of this however !
Question is not really reproducible because some of the functions that are being called are missing but here is my vectorised implementation of measurement_operator. This is with the assumption that finished_quantum_state has a shape of (P, ) (Not sure if that's the case, because couldn't reproduce till that part) .
def measurement_operator_vectorized(positions, finished_quantum_state):
M_hat = np.zeros((P, P, P))
M_hat[np.arange(0, P), np.arange(0, P), np.arange(0, P)] = 1
proj = np.tensordot(M_hat, finished_quantum_state, axes=((2), (0)))
probabilities = (proj * proj.conjugate()).sum(axis=1).real
return probabilities
Here is some benchmarkings -
P = 1000
a = np.random.rand(P)
b = np.random.rand(P)
%timeit c1 = measurement_operator(a, b)
%timeit c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
print(np.allclose(c1, c2))
Gives -
1.18 s ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
308 ms ± 6.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 86.43 MiB, increment: 0.00 MiB
peak memory: 90.34 MiB, increment: 3.91 MiB
True
The vectorised version is faster is comparable memory usage for P~1000.
Note that for really high values of P, the memory usage will increase a lot for the vectorised version.
This isn't exactly what the OP asked for, but to vectorise the other loop, a more complete code would be helpful.
However, this benchmark is valid only if finished_quantum_state is real. For complex values the tensordot operation is very slow and inefficient (in memory) so you might actually be better off with the non-vectorized version.
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
np.allclose(c1, c2)
2.97 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.69 MiB, increment: 0.03 MiB
peak memory: 15365.38 MiB, increment: 15262.69 MiB
However, if you really want the best performance, you are better off forgetting the physics details about measurement etc temporarily and just doing
def measurement_operator_fastest(positions, finished_quantum_state):
return (finished_quantum_state * finished_quantum_state.conjugate()).real
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%timeit -n1 -r1 c3 = measurement_operator_fastest(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
%memit c3 = measurement_operator_fastest(a, b)
print(np.allclose(c1, c2))
print(np.allclose(c1, c3))
2.87 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
16.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.70 MiB, increment: 0.00 MiB
peak memory: 15365.39 MiB, increment: 15262.69 MiB
peak memory: 102.69 MiB, increment: -0.01 MiB
True
True
By taking the inner product directly, you can make the function around 10^6 times faster. Of course that assumes the measurement operator as defined.
I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start
Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?
Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.
import numpy as np
max_len = 3
streak = 0
prev = np.nan
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for c in np.nditer(a):
if c == prev:
streak += 1
if streak == max_len:
print(c)
break
else:
prev = c
streak = 1
Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.
I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)
#numba.jit(nopython=True)
def find_reps_numba(a, max_len):
streak = 1
val = a[0]
for i in range(1, len(a)):
if a[i] == val:
streak += 1
if streak >= max_len:
return i - max_len + 1
else:
streak = 1
val = a[i]
return -1
This turns out to be ~100x faster than the pure Python version.
The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.
def rolling_window(a, window):
a = numpy.ascontiguousarray(a) # This approach requires a C-ordered array
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def find_reps_numpy(a, max_len):
windows = rolling_window(a, max_len)
return (windows == windows[:, 0:1]).sum(axis=1).argmax()
I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)
a = numpy.random.randint(0, 100, 1000000)
%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:
a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.
So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.
You can use groupby from the itertools package.
import numpy as np
from itertools import groupby
max_len = 3
best = ()
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for k, g in groupby(a):
tup_g = tuple(g)
if tup_g==max_len:
best = tup_g
break
if len(tup_g) > len(best):
best = tup_g
best
# returns:
(2, 2, 2)
You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.
If that's True, then you have found your consecutive sequence of integers of length max_length.
def get_conseq(array, max_length):
sub_arrays = zip(*[array[i:] for i in range(max_length)])
for e in sub_arrays:
if sum(e) / len(e) == e[0]:
print("Found : {}".format(e))
return e
print("Nothing found")
return []
For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this:
[1,2]
[2,2]
[2,3]
[3,4]
[4,5]
On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.
You can break if that's what you prefer to do, instead of returning like I do.
You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).
Here are a few example calls:
>>> splits([1,2,3,4,5,6], 2)
Nothing found
>>> splits([1,2,2,3,4,5,6], 3)
Nothing found
>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]
>>> splits([1,2,2,3,3], 2)
Found : [2, 2]
Hope this helps !
Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -
m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
out = a[idx0[m2.argmax()]]
Another with binary-dilation -
from scipy.ndimage.morphology import binary_erosion
m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
out = a[m2.argmax()]
Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.
I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:
def weighted_choice(seq, weights):
assert len(seq) == len(weights)
total = sum(weights)
r = random.uniform(0, total)
upto = 0
for i in range(len(seq)):
if upto + weights[i] >= r:
return seq[i]
upto += weights[i]
assert False, "Shouldn't get here"
If I call the above a million times with a 1000 element sequence, like this:
seq = range(1000)
weights = []
for i in range(1000):
weights.append(random.randint(1,100))
st=time.time()
for i in range(1000000):
r=weighted_choice(seq, weights)
print (time.time()-st)
it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6.
It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.
Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.
ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.
I've tried it with numpy this way:
weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
#r=weighted_choice(seq, weights)
#r=random.choices(seq, weights)
r=numpy.random.choice(seq, p=weights)
print (time.time()-st)
It ran for 70 seconds.
You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.
Implement as:
def weighted_choice(seq, weights):
w = np.asarray(weights)
p = w / w.sum() # can skip if weights always sum to 1
return np.random.choice(seq, p=w)
Edit:
Timings:
%timeit np.random.choice(x, p=w) # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.random.choice(y, p=w) # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need
#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))
def weights_numpy(seq,weights,iterations):
"""
:param seq: Input sequence
:param weights: Input Weights
:param iterations: Iterations to run
:return:
"""
r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
ar = weights.cumsum(0) # get cumulative sum
return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition
And the timing (python 3,numpy '1.14.0')
%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is a bit slower than PyPy, but hardly...