Related
I want to optimize my numpy code, Im using large arrays so efficiency is required. I tried to omit using for-looop if possible.
Let`s assume simple 2-d array
1 3 5
2 0 1
5 6 2
My task is to choose this values from columns until cumsum reaches certain value (cutting values to it if needed). Lets, name this value as clip. So after this operation I`ll have array like this:
1 3 3
2 0 0
0 0 0
I get an, rather naive idea, to calculate it with simple transformations:
array_clipped = np.clip(array, 0, clip)
array_clipped_cumsum = np.cumsum(array_clipped, axis=0)
difference = clip - cumsum
difference_trimmed = np.where(difference<0, temp, 0)
final = array_clipped + difference_trimmed
final_clean = np.where(final>=0, final, 0)
As this code works, it looks very dirty and non-numpy.
Here is another one-liner:
A = np.random.randint(0,10,(6,4))
A
# array([[0, 8, 7, 6],
# [3, 2, 0, 4],
# [5, 6, 6, 4],
# [4, 5, 0, 3],
# [7, 9, 6, 8],
# [0, 9, 8, 3]])
cap = 15
np.diff(np.minimum(A.cumsum(0),cap),axis=0,prepend=0)
# array([[0, 8, 7, 6],
# [3, 2, 0, 4],
# [5, 5, 6, 4],
# [4, 0, 0, 1],
# [3, 0, 2, 0],
# [0, 0, 0, 0]])
Or in two lines avoiding the slow prepend:
out = np.minimum(A.cumsum(0),cap)
out[1:] -= out[:-1]
out
# array([[0, 8, 7, 6],
# [3, 2, 0, 4],
# [5, 5, 6, 4],
# [4, 0, 0, 1],
# [3, 0, 2, 0],
# [0, 0, 0, 0]])
A cleaner way would be -
# a is input array and clip is the clipping value
c = a.cumsum(0)
out = (a-c+c.clip(max=clip)).clip(min=0)
I have a 3D numpy array like this:
>>> a
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[6, 7, 8],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
I want to remove only those rows which contain duplicates within themselves. For instance the output should look like this:
>>> remove_row_duplicates(a)
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
This is the function that I am using:
delindices = np.empty(0, dtype=int)
for i in range(len(a)):
_, indices = np.unique(np.around(a[i], decimals=10), axis=0, return_index=True)
if len(indices) < len(a[i]):
delindices = np.append(delindices, i)
a = np.delete(a, delindices, 0)
This works perfectly, but the problem is now my array shape is like (1000000,7,3). The for loop is pretty slow in python and this take a lot of time. Also my original array contains floating numbers. Any one who has a better solution or who can help me vectorizing this function?
Sort it along the rows for each 2D block i.e. along axis=1 and then look for matching rows along the successive ones and finally look for any matches along the same axis=1 -
b = np.sort(a,axis=1)
out = a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Sample run with explanation
Input array :
In [51]: a
Out[51]:
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[6, 7, 8],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
Code steps :
# Sort along axis=1, i.e rows in each 2D block
In [52]: b = np.sort(a,axis=1)
In [53]: b
Out[53]:
array([[[0, 1, 2],
[0, 1, 2],
[6, 7, 8]],
[[0, 1, 2],
[6, 7, 8],
[6, 7, 8]],
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
In [54]: (b[:,1:] == b[:,:-1]).all(-1) # Look for successive matching rows
Out[54]:
array([[ True, False],
[False, True],
[False, False]])
# Look for matches along each row, which indicates presence
# of duplicate rows within each 2D block in original 2D array
In [55]: ((b[:,1:] == b[:,:-1]).all(-1)).any(1)
Out[55]: array([ True, True, False])
# Invert those as we need to remove those cases
# Finally index with boolean indexing and get the output
In [57]: a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Out[57]:
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
You can probably do this easily using broadcasting but since you're dealing with more than 2D arrays it wont be as optimized as you expect and even in some cases very slow. Instead you can use following approach inspired by Jaime's answer:
In [28]: u = np.unique(arr.view(np.dtype((np.void, arr.dtype.itemsize*arr.shape[1])))).view(arr.dtype).reshape(-1, arr.shape[1])
In [29]: inds = np.where((arr == u).all(2).sum(0) == u.shape[1])
In [30]: arr[inds]
Out[30]:
array([[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]])
I just came across a showstopper for a part of my code and I am not sure what I am doing wrong...
I simply have a large data cube and want to change the maximum values along the z axis to some other number:
import numpy as np
from time import time
x, y, z = 100, 100, 10
a = np.arange(x*y*z).reshape((z, y, x))
t = time()
a[np.argmax(a, axis=0)] = 1
print(time() - t)
This takes about 0.02s which is a bit slow for such a small array, but ok. My problem is that I need to do this with arrays as large as (32, 4096, 4096) and I have not had the patience to let this finish with the above code...it's just too inefficient, but it should actually be very fast! Am I doing something wrong with setting the array elements?
You are basically indexing your numpy array with a numpy array containing numbers. I think that is the reason why it is so slow (and I'm not sure if it really does what you want it to do).
If you create a boolean numpy array and use this as slice it's orders of magnitudes faster.
For example:
pos_max = np.expand_dims(np.argmax(a, axis=0), axis=0)
pos_max_indices = np.arange(a.shape[0]).reshape(10,1,1) == pos_max
a[pos_max_indices] = 1
is 20 times faster than the original and does the same.
I don't think it is the indexing with numbers that's slowing it down. Usually indexing a single dimension with a boolean vector is slower than indexing with the corresponding np.where.
Something else is going on here. Look at these shapes:
In [14]: a.shape
Out[14]: (10, 100, 100)
In [15]: np.argmax(a,axis=0).shape
Out[15]: (100, 100)
In [16]: a[np.argmax(a,axis=0)].shape
Out[16]: (100, 100, 100, 100)
The indexed a is much larger than the original, 1000x.
#MSeifert's solution is faster, but I can't help feeling it is more complex than needed.
In [35]: %%timeit
....: a=np.arange(x*y*z).reshape((z,y,x))
....: pos_max = np.expand_dims(np.argmax(a, axis=0), axis=0)
....: pos_max_indices = np.arange(a.shape[0]).reshape(10,1,1) == pos_max
....: a[pos_max_indices]=1
....:
1000 loops, best of 3: 1.28 ms per loop
I'm still working on an improvement.
The sample array isn't a good one - it's too big to display, and all the max values on the last z plane:
In [46]: x,y,z=4,2,3
In [47]: a=np.arange(x*y*z).reshape((z,y,x))
In [48]: a
Out[48]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [49]: a[np.argmax(a,axis=0)]=1
In [50]: a
Out[50]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[ 1, 1, 1, 1],
[ 1, 1, 1, 1]]])
I could access those same argmax values with:
In [51]: a[-1,...]
Out[51]:
array([[1, 1, 1, 1],
[1, 1, 1, 1]])
Let's try a random array, where the argmax can be in any plane:
In [57]: a=np.random.randint(2,10,(z,y,x))
In [58]: a
Out[58]:
array([[[9, 7, 6, 5],
[6, 3, 5, 2]],
[[5, 6, 2, 3],
[7, 9, 6, 9]],
[[7, 7, 8, 9],
[2, 4, 9, 7]]])
In [59]: a[np.argmax(a,axis=0)]=0
In [60]: a
Out[60]:
array([[[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 0, 0, 0],
[0, 0, 0, 0]]])
Oops - I turned everything to 0. Is that what you want?
Let's try the pos_max method:
In [61]: a=np.random.randint(0,10,(z,y,x))
In [62]: a
Out[62]:
array([[[9, 3, 9, 0],
[6, 6, 2, 4]],
[[9, 9, 4, 9],
[5, 9, 7, 9]],
[[1, 8, 1, 7],
[1, 0, 2, 3]]])
In [63]: pos_max = np.expand_dims(np.argmax(a, axis=0), axis=0)
In [64]: pos_max
Out[64]:
array([[[0, 1, 0, 1],
[0, 1, 1, 1]]], dtype=int32)
In [66]: pos_max_indices = np.arange(a.shape[0]).reshape(z,1,1) == pos_max
In [67]: pos_max_indices
Out[67]:
array([[[ True, False, True, False],
[ True, False, False, False]],
[[False, True, False, True],
[False, True, True, True]],
[[False, False, False, False],
[False, False, False, False]]], dtype=bool)
In [68]: a[pos_max_indices]=0
In [69]: a
Out[69]:
array([[[0, 3, 0, 0],
[0, 6, 2, 4]],
[[9, 0, 4, 0],
[5, 0, 0, 0]],
[[1, 8, 1, 7],
[1, 0, 2, 3]]])
That looks more reasonable. There still is a 9 in the 2nd plane, but that's because there was also a 9 in the 1st.
This still needs to be cleaned up, but here's a non-boolean mask solution:
In [98]: a=np.random.randint(0,10,(z,y,x))
In [99]: a1=a.reshape(z,-1) # it's easier to work with a 2d view
In [100]: ind=np.argmax(a1,axis=0)
In [101]: ind
Out[101]: array([2, 2, 1, 0, 2, 0, 1, 2], dtype=int32)
In [102]: a1[ind,np.arange(a1.shape[1])] # the largest values
Out[102]: array([9, 8, 7, 4, 9, 7, 9, 6])
In [104]: a1
Out[104]:
array([[3, 1, 5, 4, 2, 7, 4, 5],
[4, 4, 7, 1, 3, 7, 9, 4],
[9, 8, 3, 3, 9, 1, 2, 6]])
In [105]: a1[ind,np.arange(a1.shape[1])]=0
In [106]: a
Out[106]:
array([[[3, 1, 5, 0],
[2, 0, 4, 5]],
[[4, 4, 0, 1],
[3, 7, 0, 4]],
[[0, 0, 3, 3],
[0, 1, 2, 0]]])
Working with a1 the 2d view is easier; the exact shape of the x,y dimensions is not important to this problem. We are changing individual values, not columns or planes. Still I'd like to do get it working without `a1.
Here are two functions that replace the maximum value (in the 1st plane). I use copy since it makes repeated time testing easier.
def setmax0(a, value=-1):
# #MSeifert's
a = a.copy()
z = a.shape[0]
# a=np.arange(x*y*z).reshape((z,y,x))
pos_max = np.expand_dims(np.argmax(a, axis=0), axis=0)
pos_max_indices = np.arange(z).reshape(z,1,1) == pos_max
a[pos_max_indices]=value
return a
def setmax1(a, value=-2):
a = a.copy()
z = a.shape[0]
a1 = a.reshape(z, -1)
ind = np.argmax(a1, axis=0)
a1[ind, np.arange(a1.shape[1])] = value
return a
They produce the same result in a test like:
ab = np.random.randint(0,100,(20,1000,1000))
test = np.allclose(setmax1(ab,-1),setmax0(ab,-1))
Timings (using ipython timeit) are basically the same.
They do assign values in different orders, so setmax0(ab,-np.arange(...)) will be different.
I have a numpy array x (with (n,4) shape) of integers like:
[[0 1 2 3],
[1 2 7 9],
[2 1 5 2],
...]
I want to transform the array into an array of pairs:
[0,1]
[0,2]
[0,3]
[1,2]
...
so first element makes a pair with other elements in the same sub-array. I have already a for-loop solution:
y=np.array([[x[j,0],x[j,i]] for i in range(1,4) for j in range(0,n)],dtype=int)
but since looping over numpy array is not efficient, I tried slicing as the solution. I can do the slicing for every column as:
y[1]=np.array([x[:,0],x[:,1]]).T
# [[0,1],[1,2],[2,1],...]
I can repeat this for all columns. My questions are:
How can I append y[2] to y[1],... such that the shape is (N,2)?
If number of columns is not small (in this example 4), how can I find y[i] elegantly?
What are the alternative ways to achieve the final array?
The cleanest way of doing this I can think of would be:
>>> x = np.arange(12).reshape(3, 4)
>>> x
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> n = x.shape[1] - 1
>>> y = np.repeat(x, (n,)+(1,)*n, axis=1)
>>> y
array([[ 0, 0, 0, 1, 2, 3],
[ 4, 4, 4, 5, 6, 7],
[ 8, 8, 8, 9, 10, 11]])
>>> y.reshape(-1, 2, n).transpose(0, 2, 1).reshape(-1, 2)
array([[ 0, 1],
[ 0, 2],
[ 0, 3],
[ 4, 5],
[ 4, 6],
[ 4, 7],
[ 8, 9],
[ 8, 10],
[ 8, 11]])
This will make two copies of the data, so it will not be the most efficient method. That would probably be something like:
>>> y = np.empty((x.shape[0], n, 2), dtype=x.dtype)
>>> y[..., 0] = x[:, 0, None]
>>> y[..., 1] = x[:, 1:]
>>> y.shape = (-1, 2)
>>> y
array([[ 0, 1],
[ 0, 2],
[ 0, 3],
[ 4, 5],
[ 4, 6],
[ 4, 7],
[ 8, 9],
[ 8, 10],
[ 8, 11]])
Like Jaimie, I first tried a repeat of the 1st column followed by reshaping, but then decided it was simpler to make 2 intermediary arrays, and hstack them:
x=np.array([[0,1,2,3],[1,2,7,9],[2,1,5,2]])
m,n=x.shape
x1=x[:,0].repeat(n-1)[:,None]
x2=x[:,1:].reshape(-1,1)
np.hstack([x1,x2])
producing
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 7],
[1, 9],
[2, 1],
[2, 5],
[2, 2]])
There probably are other ways of doing this sort of rearrangement. The result will copy the original data in one way or other. My guess is that as long as you are using compiled functions like reshape and repeat, the time differences won't be significant.
Suppose the numpy array is
arr = np.array([[0, 1, 2, 3],
[1, 2, 7, 9],
[2, 1, 5, 2]])
You can get the array of pairs as
import itertools
m, n = arr.shape
new_arr = np.array([x for i in range(m)
for x in itertools.product(a[i, 0 : 1], a[i, 1 : n])])
The output would be
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 7],
[1, 9],
[2, 1],
[2, 5],
[2, 2]])
I have a large numpy array (8 by 30000) and I want to delete some rows according to some criteria. This criteria is only applicable in one column.
Example:
>>> p = np.array([[0, 1, 3], [1 , 5, 6], [4, 3, 56], [1, 34, 4]])
>>> p
array([[ 0, 1, 3],
[ 1, 5, 6],
[ 4, 3, 56],
[ 1, 34, 4]])
here I would like to remove every row in which the value of the 3rd column is >30, ie. here row 3.
As the array is pretty large, I'd like to avoid for loops. I thought of this:
>>> a[~(a>30).any(1), :]
array([[0, 1, 3],
[1, 5, 6]])
But there, it obviously removes the two last rows. Any ideas on how to do that in a efficient way?
p = p[~(p[:,2] > 30)]
or (if your condition is easily inversible):
p = p[p[:,2] <= 30]
returns
array([[ 0, 1, 3],
[ 1, 5, 6],
[ 1, 34, 4]])