The question is simple: here is my current algorithm. This is terribly slow because of the loops on the arrays. Is there a way to change it in order to avoid the loops and take advantage of the NumPy arrays types ?
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result +=, vector2) * vector2[2]
return np.array(resultArray)
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
result = loopingFunction(listOfVector1, listOfVector2)
I am supposed to deal with really big arrays, that have way more than 1000 vectors in each. So if you have any advice, I'll take it.
The obligatory np.einsum benchmark
r2 = np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:,2], optimize=['einsum_path', (1, 2), (0, 1)])
#%timeit result: 10000 loops, best of 5: 116 µs per loop
np.testing.assert_allclose(result, r2)
Just for fun, I wrote an optimized Numba implementation that outperform all others. It is based on the einsum optimization of the #MichaelSzczesny answer.
import numpy as np
import numba as nb
# This decorator ask Numba to eagerly compile the code using
# the provided signature string (containing the parameter types).
#nb.njit('(float64[:,::1], float64[:,::1])')
def loopingFunction_numba(listOfVector1, listOfVector2):
n, m = listOfVector1.shape
assert m == 3
result = np.empty(n)
s1 = s2 = s3 = 0.0
for i in range(n):
factor = listOfVector2[i, 2]
s1 += listOfVector2[i, 0] * factor
s2 += listOfVector2[i, 1] * factor
s3 += listOfVector2[i, 2] * factor
for i in range(n):
result[i] = listOfVector1[i, 0] * s1 + listOfVector1[i, 1] * s2 + listOfVector1[i, 2] * s3
return result
result = loopingFunction_numba(listOfVector1, listOfVector2)
Here are timings on my i5-9600KF processor:
Initial: 1052.0 ms
ymmx: 5.121 ms
MichaelSzczesny: 75.40 us
MechanicPig: 3.36 us
Numba: 2.74 us
Optimal lower bound: 0.66 us
This solution is ~384_000 times faster than the original one. Note that is does not even use the SIMD instructions of the processor that would result in a ~4x speed up on my machine. This is only possible by having transposed input that are much more SIMD-friendly than the current one. Transposition may also speed up other answers like the one of MechanicPig since BLAS can often benefit from this. The resulting code would reach the symbolic 1_000_000 speed up factor!
You can at least remove the two forloop to save alot of time, use matrix computation directly
import time
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result +=, vector2) * vector2[2]
return np.array(resultArray)
def loopingFunction2(listOfVector1, listOfVector2):
resultArray = np.sum(, listOfVector2.T) * listOfVector2[:,2], axis=1)
return resultArray
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
import time
t0 = time.time()
result = loopingFunction(listOfVector1, listOfVector2)
print('time old version',time.time() - t0)
t0 = time.time()
result2 = loopingFunction2(listOfVector1, listOfVector2)
print('time matrix computation version',time.time() - t0)
print('Are results are the same',np.allclose(result,result2))
Which gives
time old version 1.174513578414917
time matrix computation version 0.011968612670898438
Are results are the same True
Basically, the less loop the better.
Avoid nested loops and adjust the calculation order, which is 20 times faster than the optimized np.einsum and nearly 400_000 times faster than the original program:
>>> out =[:, 2].dot(listOfVector2))
>>> np.allclose(out, loopingFunction(listOfVector1, listOfVector2))
>>> timeit(lambda: loopingFunction(listOfVector1, listOfVector2), number=1)
>>> timeit(lambda:[:, 2].dot(listOfVector2)), number=400_000)
>>> timeit(lambda: np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:, 2], optimize=['einsum_path', (1, 2), (0, 1)]), number=18_000)
I have an image mask stored as a 2D numpy array where the values indicate the presence of objects that have been segmented in the image (0 = no object, 1..n = object 1 through n). I want to get a single coordinate for each object representing the center of the object. It doesn't have to be a perfectly accurate centroid or center of gravity. I'm just taking the mean of the x and y indices of all cells in the array that contain each object. I'm wondering if there's a faster way to do this than my current method:
for obj in np.unique(mask):
if obj == 0:
x, y = np.mean(np.where(mask == obj), axis=1)
Here is a reproducible example:
import numpy as np
mask = np.array([
points = []
for obj in np.unique(mask):
if obj == 0:
points.append(np.mean(np.where(mask == obj), axis=1))
This outputs:
[array([1.33333333, 1.66666667]),
array([1.28571429, 5. ]),
array([4., 2.]),
array([5., 6.])]
I came up with another way to do it that seems to be about 3x faster:
import numpy as np
mask = np.array([
flat = mask.flatten()
split = np.unique(np.sort(flat), return_index=True)[1]
points = []
for inds in np.split(flat.argsort(), split)[2:]:
points.append(np.array(np.unravel_index(inds, mask.shape)).mean(axis=1))
I wonder if the for loop can be replaced with a numpy operation which would likely be even faster.
You can copy this answer (give them an upvote too if this answer works for you) and use sparse matrices instead of np arrays. However, this only proves to be quicker for large arrays, with increasing speed boosts the larger your array is:
import numpy as np, time
from scipy.sparse import csr_matrix
def compute_M(data):
cols = np.arange(data.size)
return csr_matrix((cols, (np.ravel(data), cols)),
shape=(data.max() + 1, data.size))
def get_indices_sparse(data,M):
#M = compute_M(data)
return [np.mean(np.unravel_index(, data.shape),1) for R,row in enumerate(M) if R>0]
def gen_random_mask(C, n, m):
mask = np.zeros([n,m],int)
for i in range(C):
x = np.random.randint(n)
y = np.random.randint(m)
mask[x:x+np.random.randint(n-x),y:y+np.random.randint(m-y)] = i
return mask
N = 100
C = 4
for S in [10,100,1000,10000]:
mask = gen_random_mask(C, S, S)
print('Time for size {:d}x{:d}:'.format(S,S))
s = time.time()
for _ in range(N):
points = []
for obj in np.unique(mask):
if obj == 0:
points.append(np.mean(np.where(mask == obj), axis=1))
points_np = np.array(points)
print('NP: {:f}'.format((time.time() - s)/N))
mask_s = compute_M(mask)
s = time.time()
for _ in range(100):
points = get_indices_sparse(mask,mask_s)
print('Sparse: {:f}'.format((time.time() - s)/N))
Which results in the timings of:
Time for size 10x10:
NP: 0.000066
Sparse: 0.000226
Time for size 100x100:
NP: 0.000207
Sparse: 0.000253
Time for size 1000x1000:
NP: 0.018662
Sparse: 0.004472
Time for size 10000x10000:
NP: 2.545973
Sparse: 0.501061
The problem likely comes from np.where(mask == obj) which iterates on the whole mask array over and over. This is a problem when there are a lot of objects. You can solve this problem efficiently using a group-by strategy. However, Numpy do not yet provide such an operation. You can implement that using a sort followed by a split. But a sort is generally not efficient. An alternative method is to ask Numpy to return the index in the unique call so that you can then accumulate the value regarding the object (like a reduce-by-key where the reduction operator is an addition and the key are object integers). The mean can be obtained using a simple division in the end.
objects, inverts, counts = np.unique(mask, return_counts=True, return_inverse=True)
# Reduction by object
x = np.full(len(objects), 0.0)
y = np.full(len(objects), 0.0)
xPos = np.repeat(np.arange(mask.shape[0]), mask.shape[1])
yPos = np.tile(np.arange(mask.shape[1]), reps=mask.shape[0]), inverts, xPos), inverts, yPos)
# Compute the final mean from the sum
x /= counts
y /= counts
# Discard the first item (when obj == 0)
x = x[1:]
y = y[1:]
If you need something faster, you could use Numba and perform the reduction manually (and possibly in parallel).
EDIT: if you really need a list in output, you can use points = list(np.stack([x, y]).T) but this is rather slow to use lists instead of Numpy arrays (and not memory efficient either).
Because the mask values number the segments they can be directly used as indices into numpy arrays. Combined with Cython this can be used to achieve a strong speed-up.
In Jupyter start with loading Cython:
%load_ext Cython
then use Python magic and a single pass over the whole array to calculate the means:
%%cython -a
import cython
import numpy as np
cimport numpy as np
#cython.boundscheck(False) # turn off bounds-checking for entire function
#cython.wraparound(False) # turn off negative index wrapping for entire function
def calc_xy_mean4(int[:,:] mask, int number_of_maskvalues):
cdef int[:] sum_x = np.zeros(number_of_maskvalues, dtype='int')
cdef int[:] sum_y = np.zeros(number_of_maskvalues, dtype='int')
n = np.zeros(number_of_maskvalues, dtype='int')
cdef int[:] n_mv = n
mean_x = np.zeros(number_of_maskvalues, dtype='float')
mean_y = np.zeros(number_of_maskvalues, dtype='float')
cdef double[:] mean_x_mv = mean_x
cdef double[:] mean_y_mv = mean_y
cdef int x_max = mask.shape[0]
cdef int y_max = mask.shape[1]
cdef int segment_index
cdef int x
cdef int y
for x in range(x_max):
for y in range(y_max):
segment_index = mask[x,y]
n_mv[segment_index] += 1
sum_x[segment_index] += x
sum_y[segment_index] += y
for segment_index in range(number_of_maskvalues):
mean_x_mv[segment_index] = sum_x[segment_index]/n[segment_index]
mean_y_mv[segment_index] = sum_y[segment_index]/n[segment_index]
return mean_x, mean_y, n
and call it with timeit magic
mask = np.array([
%timeit calc_xy_mean4(mask, 5)
This Cython solution is on my machine 9 times faster than the original code.
6.32 µs ± 18.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and if we run the same instruction without the timeit magic:
calc_xy_mean4(mask, 5)
we obtain as output:
(array([3.07692308, 1.33333333, 1.28571429, 4. , 5. ]),
array([4.59615385, 1.66666667, 5. , 2. , 6. ]),
array([52, 3, 7, 3, 5]))
I want to generate random numbers like from np.random.exponential but clipped / truncated at values a,b. For example, if a=100, b=500 then I want the function to generate random numbers following e^(-x) in the range [100, 500].
An inefficient way would be:
rands = np.random.exponential(size=10**7)
rands = rands[(rands>a) and (rands<b)]
Is there an existing package that can do this for me? Ideally for various distributions, not just exponential.
If we clip the values after using the exponential generator, there are two problems with approach proposed in the question.
First, we lose values (For example, if we wanted 10**7 values, we might only get 10^6 values)
Second, np.random.exponential() returns values between 0 and 1, so we can't simply use 100 and 500 as the lower and upper bounds. We must scale the generated random numbers before scaling.
I wrote the workaround using exp(uniform). I tested your solution using smaller values of a and b (so that we don't get empty arrays). A timed approach shows this is faster by around 50%
import time
import numpy as np
import matplotlib.pyplot as plt
def truncated_exp_OP(a,b, how_many):
rands = np.random.exponential(size=how_many)
rands = rands[(rands>a) & (rands<b)]
return rands
def truncated_exp_NK(a,b, how_many):
a = -np.log(a)
b = -np.log(b)
rands = np.exp(-(np.random.rand(how_many)*(b-a) + a))
return rands
timeTakenOP = []
for i in range(20):
startTime = time.time()
r = truncated_exp_OP(0.001,0.39, 10**7)
endTime = time.time()
timeTakenOP.append(endTime - startTime)
print ("OP solution: ", np.mean(timeTakenOP))
plt.hist(r.flatten(), 300);
timeTakenNK = []
for i in range(20):
startTime = time.time()
r = truncated_exp_NK(100,500, 10**7)
endTime = time.time()
timeTakenNK.append(endTime - startTime)
print ("NK solution: ", np.mean(timeTakenNK))
plt.hist(r.flatten(), 300);
Average run time :
OP solution: 0.28491891622543336 vs
NK solution: 0.1437338709831238
The histogram plots of the random numbers are shown below:
OP's approach:
This approach:
I want to slice the same numpy array (data_arra) multiple times to find each time the values in a different range
data_ar shpe: (203,)
range_ar shape: (1000,)
I implemented it with a for loop, but it takes way to long since I have a lot of data_arrays:
#create results array
results_ar = np.zeros(shape=(1000),dtype=object)
for range in range_ar:
results_ar[i] = data_ar[( (data_ar>=(range-delta)) & (data_ar<(range+delta)) )].values
so for example:
data_ar = [1,3,4,6,10,12]
range_ar = [7,4,2]
delta= 3
expected output:
(note results_ar shpae=(3,) dtype=object, each element is an array)
some idea on how to tackle this?
You can use numba to speed up the computations.
import numpy as np
import numba
from numba.typed import List
import timeit
data_ar = np.array([1,3,4,6,10,12])
range_ar = np.array([7,4,2])
delta = 3
def foo(data_ar, range_ar):
results_ar = list()
for i in range_ar:
results_ar.append(data_ar[( (data_ar>=(i-delta)) & (data_ar<(i+delta)) )])
print(timeit.timeit(lambda :foo(data_ar, range_ar)))
#numba.njit(parallel=True, fastmath=True)
def foo(data_ar, range_ar):
results_ar = List()
for i in range_ar:
results_ar.append(data_ar[( (data_ar>=(i-delta)) & (data_ar<(i+delta)) )])
print(timeit.timeit(lambda :foo(data_ar, range_ar)))
An almost 9.8 times speedup.
You could use np.searchsorted like this:
data_ar = np.array([1, 3, 4, 6, 10, 12])
range_ar = np.array([7, 4, 2])
delta = 3
bounds = range_ar[:, None] + delta * np.array([-1, 1])
result = [data_ar[slice(*row)] for row in np.searchsorted(data_ar, bounds)]
I have a 2-D numpy array like x = array([[ 1., 5.],[ 3., 4.]]), I have to compare each row with every other row in the matrix and create an new array of minimum values from both the rows and take the sum of minimum row and save it in a new matrix. Finally I will get a symmetric matrix.
Eg: I compare array [1,5] with itself. New 2-D array is array([[ 1., 5.],[ 1., 5.]]), I create a minimum array along axis=0 i.e [ 1., 5.] then take the sum of array which will be 6. Similarly I repeat the operation for all the rows and I end up with a 2*2 matrix array([[ 6, 5.],[ 5, 7.]]).
import numpy as np
for i in range(len(x)):
for j in range(len(x)):
My 2-D array is very big and performing above mentioned operations are taking lot of time. I am new to python so any suggestion to improve the performance will be really helpful.
The obvious improvement to save roughly half the time is to run only on i>=j indices. For elegance and some saving you can also use less variables.
import numpy as np
import time
x=np.random.randint(0, 10, (500, 500))
# OP version
t0 = time.time()
for i in range(len(x)):
for j in range(len(x)):
print(time.time() - t0)
# modified version
t0 = time.time()
for i in range(len(x)):
for j in range(i, len(x)):
z[i, j]=np.sum(np.min([x[i], x[j]], axis=0))
z[j, i] = z[i, j]
print(time.time() - t0)
# verify that the result are the same
print(np.all(z == y))
The results on my machine:
The obvious way to speed up your code would be to do all the looping in numpy. I had a first solution (f2 in the code below), which would generate a matrix that contained all the combinations that need to be compared and then reduced that matrix into the final result performing the np.min and np.sum commands. Unfortunately that method is quite memory consuming and therefore becomes slow when the matrices are big, because the intermediate matrix is NxNx2xN for a NxN input matrix.
However, I found a different solution that uses one for loop (f3 below) and appears to be reasonably fast. The speed-up to the original posted by the OP is about 4 times for a 1000x1000 matrix. Here the codes with some tests:
import numpy as np
import timeit
def f(x):
y = np.zeros_like(x)
for i in range(x.shape[0]):
a = x[i]
for j in range(x.shape[1]):
b = x[j]
y[i,j] = np.sum(np.min([a,b], axis=0))
return y
def f2(x):
y = np.empty((x.shape[0],1,2,x.shape[0]))
y[:,0,0,:] = x[:,:]
y = np.repeat(y, x.shape[0],axis=1)
y[:,:,1,:] = x[:,:]
return np.sum(np.min(y,axis=2),axis=2)
def f3(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[:,i] = np.sum(np.minimum(x[i,:],x[:,:]),axis=1)
return y
##some testing that the functions work
x = np.array([[1,5],[3,4]])
x = np.array([[1,7,5],[2,3,8],[5,2,4]])
x = np.random.randint(0,10,(100,100))
##some speed testing:
print("speed test small")
x = np.random.randint(0,100,(100,100))
setup = 'from __main__ import f,x',
print("using np.repeat")
setup = 'from __main__ import f2,x',
print("one for loop")
setup = 'from __main__ import f3,x',
print("speed test big")
x = np.random.randint(0,100,(1000,1000))
setup = 'from __main__ import f,x',
print("one for loop")
setup = 'from __main__ import f3,x',
And here the output:
speed test small
using np.repeat
one for loop
speed test big
one for loop
With other words, f2 is pretty fast for matrices that don't exhaust your memory, but especially for big matrices, f3 is the fastest that I could find.
Inspired by #Aguy's answer and this post, here still a modification that only computes the lower triangle of the matrix and then copies the results to the upper triangle:
def f4(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[i:,i] = np.sum(np.minimum(x[i,:],x[i:,:]),axis=1)
i_upper = np.triu_indices(x.shape[1],1)
y[i_upper] = y.T[i_upper]
return y
The speed test for the 1000x1000 matrix now gives
speed test big
one for loop over lower triangle
Here still a version that uses numba for speed up. According to this post it is better to write the loops explicitly in this case:
import numba as nb
def f_nb(x):
res = np.empty_like(x)
for j in range(res.shape[1]):
for i in range(j,res.shape[0]):
res[j,i] = res[i,j] = np.sum(np.minimum(x[i,:], x[j,:]))
return res
And the relevant speed tests give:
0.015975199989043176 for a 100x100 matrix
0.37946902704425156 for a 1000x1000 matrix
467.06363476096885 for a 10000x10000 matrix
The 10000x10000 speed test for f4 didn't seem to want to finish at all, so I left it out. If your matrices get much bigger than that, you might actually run into memory problems -- did you consider this?