I currently have the following double loop in my Python code:
for i in range(a):
for j in range(b):
A[:,i]*=B[j][:,C[i,j]]
(A is a float matrix. B is a list of float matrices. C is a matrix of integers. By matrices I mean m x n np.arrays.
To be precise, the sizes are: A: mxa B: b matrices of size mxl (with l different for each matrix) C: axb. Here m is very large, a is very large, b is small, the l's are even smaller than b
)
I tried to speed it up by doing
for j in range(b):
A[:,:]*=B[j][:,C[:,j]]
but surprisingly to me this performed worse.
More precisely, this did improve performance for small values of m and a (the "large" numbers), but from m=7000,a=700 onwards the first appraoch is roughly twice as fast.
Is there anything else I can do?
Maybe I could parallelize? But I don't really know how.
(I am not committed to either Python 2 or 3)
Here's a vectorized approach assuming B as a list of arrays that are of the same shape -
# Convert B to a 3D array
B_arr = np.asarray(B)
# Use advanced indexing to index into the last axis of B array with C
# and then do product-reduction along the second axis.
# Finally, we perform elementwise multiplication with A
A *= B_arr[np.arange(B_arr.shape[0]),:,C].prod(1).T
For cases with smaller a, we could run a loop that iterates through the length of a instead. Also, for more performance, it might be a better idea to store those elements into a separate 2D array instead and perform the elementwise multiplication only once after we get out of the loop.
Thus, we would have an alternative implementation like so -
range_arr = np.arange(B_arr.shape[0])
out = np.empty_like(A)
for i in range(a):
out[:,i] = B_arr[range_arr,:,C[i,:]].prod(0)
A *= out
Related
I have an very long 2-D numpy array and want to count values inside intervals. I can do it using double loop, however it is very time consuming. Can anyone give me an faster alternative way? I guess it will be better with no loops.
Bellow, there is a simple code exemplifying what I want to do in a fast way.
a = np.random.random([10000, 2])
a[:, 1] += 2 # So we have the first column with values between 0. and 1.,
# and the 2nd column with values between 2. and 3.
for i in range(10):
for j in range(5):
s0 = a[a[:, 0] >= i * 0.1]
s1 = s0[s0[:, 0] < (i+1) * 0.1]
s2 = s1[s1[:, 1] >= 2 + j * 0.2]
s3 = s2[s2[:, 1] < 2 + (j+1) * 0.2]
print(len(s3))
Additional information: I tried using masked arrays, but it did not work because I need to compare an array with lower and higher limits. As much as I know, masked array only allows to compare values inside the numpy arrays with floats, but not with another array.
The operation is inefficient because it creates a lot of temporary arrays and read/write relatively large array over and over: 4 boolean arrays + 4 floating-point arrays per iteration and there are 50 iterations. This means 400 array. Not mention the array needs to be read/written completely over and over. Additionally, creating an array just to count items is not efficient either. You can just use np.count_nonzero instead. Note that printing is slow too but I guess you will not use it in a real-world code.
Additionally, the memory access pattern is not efficient: a[:,0] and a[:,1] are strided views that prevent Numpy to vectorize the code. It also cause twice more data to be read from the memory hierarchy. The transposed version should be proffered (with a copy so to avoid strided views). The transposed array can be precomputed once.
Here is an improved version:
a = np.random.random([10000, 2])
a[:, 1] += 2
x, y = a.T.copy()
for i in range(10):
for j in range(5):
cond = x >= i * 0.1
cond &= x < (i+1) * 0.1
cond &= y >= 2 + j * 0.2
cond &= y < 2 + (j+1) * 0.2
len_s3 = np.count_nonzero(cond)
#print(len_s3)
This is about 6 times faster on my machine. Note booleans array are still created but they are much faster to create and fill since they are 8 times smaller than double-precision floating-point ones in memory. You can use functions like np.logical_and combined with the out parameter so to speed up a bit the computation but the impact is pretty small (most of the cost comes from the internal copy and Numpy internal overheads).
If this is not enough, you can use Numba to speed this up significantly. An alternative solution is to sort the array so to then perform a fast binary search on the sub-parts though it is a bit more tricky to do.
I want to add two numpy arrays of different sizes starting at a specific index. As I need to do this couple of thousand times with large arrays, this needs to be efficient, and I am not sure how to do this efficiently without iterating through each cell.
a = [5,10,15]
b = [0,0,10,10,10,0,0]
res = add_arrays(b,a,2)
print(res) => [0,0,15,20,25,0,0]
naive approach:
# b is the bigger array
def add_arrays(b, a, i):
for j in range(len(a)):
b[i+j] = a[j]
You might assign smaller one into zeros array then add, I would do it following way
import numpy as np
a = np.array([5,10,15])
b = np.array([0,0,10,10,10,0,0])
z = np.zeros(b.shape,dtype=int)
z[2:2+len(a)] = a # 2 is offset
res = z+b
print(res)
output
[ 0 0 15 20 25 0 0]
Disclaimer: I assume that offset + len(a) is always less or equal len(b).
Nothing wrong with your approach. You cannot get better asymptotic time or space complexity. If you want to reduce code lines (which is not an end in itself), you could use slice assignment and some other utils:
def add_arrays(b, a, i):
b[i:i+len(a)] = map(sum, zip(b[i:i+len(a)], a))
But the functional overhead should makes this less performant, if anything.
Some docs:
map
sum
zip
It should be faster than Daweo answer, 1.5-5x times (depending on the size ratio between a and b).
result = b.copy()
result[offset: offset+len(a)] += a
Is there any efficient way of doing the following:
Assume I have a vector A of length n, I want to calculate a second vector B, where
B[i] = A[0] * A[1] * .. *A[i-1] * A[i+1] *..*A[n-1]
i.e., B[i] is the multipication of all elements in A except for the i'th elmenet.
Initially, I thought of doing something like:
C = np.prod(A)
B = C/A
But, then I have a problem when I have an element of A which is equal to zero. Of course, I can
find out if I have one zero and then immediately set B to be the all-zero vector except for that
single zero and to put there the multiple of the rest of A and in the case of more than 1 zero to zero out B completely. But this becomes a little cumbersome when I want to do that operation for every row inside a matrix and not just for a single vector.
Of course, I can do it in a loop but I was wondering if there is a more efficient way?
You could slice up to (but not including) i, then from i+1 and on. Concatenate those slices together and multiply.
np.prod(np.concatenate([a[:i], a[i+1:]]))
A possible one liner using np.eye, np.tile and np.prod:
np.prod(np.tile(A, (A.size, 1))[(1 - np.eye(A.size)).astype(bool)].reshape(A.size, -1), axis=1)
I have a numpy.ndarray variable A of size MxN. I wish to take each row and multiply with it's conjugate transposed. For the first row we will get:
np.matmul(np.expand_dims(A[0,:],axis=1),np.expand_dims(A[0,:].conj(),axis=0))
we get an NxN sized result. I want the final result for the total operation to be of size MxNxN.
I can fo this with a simple loop which iterates over the rows of A and concatenates the results. I wish to avoid a for loop for a faster run time with SIMD operations. Is there a way to do this in a single code line with broadcasting?
Otherwise, can I do something else and somehow reshape the results into my requierment?
The next code does what the same as your code snippet but without for-loop. On the other hand, it uses np.repeat twice, so you will need to benchmark both versions and compare them to test their memory/time performance.
import numpy as np
m, n = A.shape
x, y = A.conj().repeat(n, axis=0), A.reshape([-1, 1]).repeat(n, axis=1)
B = (x * y).reshape([m, n, n])
How it works
Basically x holds the conjugate values of the array A in a single column and then is repeated n times on the column axis (it has a shape m*n by n).
y repeats each row in the conjugate matrix of A, n consecutive times (its final shape is m*n by n also)
x and y are multiplied element-wise and the result is unwrapped to a matrix of shape m by n by n stored in B
A list comprehension comprehension could do the trick:
result = np.array([np.matmul(np.expand_dims(A[i,:],axis=1), np.expand_dims(A[i,:].conj(),axis=0)) for i in range(A.shape[0])])
The "vectorizing" of fancy indexing by Python's numpy library sometimes gives unexpected results. For example:
import numpy
a = numpy.zeros((1000,4), dtype='uint32')
b = numpy.zeros((1000,4), dtype='uint32')
i = numpy.random.random_integers(0,999,1000)
j = numpy.random.random_integers(0,3,1000)
a[i,j] += 1
for k in xrange(1000):
b[i[k],j[k]] += 1
Gives different results in the arrays 'a' and 'b' (i.e. the appearance of tuple (i,j) appears as 1 in 'a' regardless of repeats, whereas repeats are counted in 'b'). This is easily verified as follows:
numpy.sum(a)
883
numpy.sum(b)
1000
It is also notable that the fancy indexing version is almost two orders of magnitude faster than the for loop. My question is: "Is there an efficient way for numpy to compute the repeat counts as implemented using the for loop in the provided example?"
This should do what you want:
np.bincount(np.ravel_multi_index((i, j), (1000, 4)), minlength=4000).reshape(1000, 4)
As a breakdown, ravel_multi_index converts the index pairs specified by i and j to integer indices into a C-flattened array; bincount counts the number of times each value 0..4000 appears in that list of indices; and reshape converts the C-flattened array back to a 2d array.
In terms of performance, I measure it at 200 times faster than "b", and 5 times faster than "a"; your mileage may vary.
Since you need to write the counts to an existing array a, try this:
u, inv = np.unique(np.ravel_multi_index((i, j), (1000, 4)), return_inverse=True)
a.flat[u] += np.bincount(inv)
I make this second method a little slower (2x) than "a", which isn't too surprising as the unique stage is going to be slow.