I want to modify block elements of 3d array without for loop. Without loop because it is the bottleneck of my code.
To illustrate what I want, I draw a figure:
The code with for loop:
import numpy as np
# Create 3d array with 2x4x4 elements
a = np.arange(2*4*4).reshape(2,4,4)
b = np.zeros(np.shape(a))
# Change Block Elements
for it1 in range(2):
b[it1]= np.block([[a[it1,0:2,0:2], a[it1,2:4,0:2]],[a[it1,0:2,2:4], a[it1,2:4,2:4]]] )
First let's see if there's a way to do what you want for a 2D array using only indexing, reshape, and transpose operations. If there is, then there's a good chance that you can extend it to a larger number of dimensions.
x = np.arange(2 * 3 * 2 * 5).reshape(2 * 3, 2 * 5)
Clearly you can reshape this into an array that has the blocks along a separate dimension:
x.reshape(2, 3, 2, 5)
Then you can transpose the resulting blocks:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3)
So far, none of the data has been copied. To make the copy happen, reshape back into the original shape:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3).reshape(2 * 3, 2 * 5)
Adding additional leading dimensions is as simple as increasing the number of the dimensions you want to swap:
b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
Here is a quick benchmark of the other implementations with your original array:
a = np.arange(2*4*4).reshape(2,4,4)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(2):
b[it1] = np.block([[a[it1, 0:2, 0:2], a[it1, 2:4, 0:2]], [a[it1, 0:2, 2:4], a[it1, 2:4, 2:4]]])
27.7 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
2.22 µs ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
13.6 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
1.27 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For small arrays, the differences can sometimes be attributed to overhead. Here is a more meaningful comparison with arrays of size 10x1000x1000, split into 10 500x500 blocks:
a = np.arange(10*1000*1000).reshape(10, 1000, 1000)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(10):
b[it1]= np.block([[a[it1,0:500,0:500], a[it1,500:1000,0:500]],[a[it1,0:500,500:1000], a[it1,500:1000,500:1000]]])
58 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
b = a.copy()
b[:,0:500,500:1000], b[:,500:1000,0:500] = b[:,500:1000,0:500].copy(), b[:,0:500,500:1000].copy()
41.2 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = np.block([[a[:,0:500,0:500], a[:,500:1000,0:500]],[a[:,0:500,500:1000], a[:,500:1000,500:1000]]])
27.5 ms ± 569 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
20 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it seems that using numpy's own reshaping and transposition mechanism is fastest on my computer. Also, notice that the overhead of np.block becomes less important than copying the temporary arrays as size gets bigger, so the other two implementations change places.
You can directly replace the it1 by a slice of the whole dimension:
b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Will it make it faster?
import numpy as np
a = np.arange(2*4*4).reshape(2,4,4)
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Comparison with np.block() alternative from another answer.
Option 1:
%timeit b = a.copy(); b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Output:
5.44 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Option 2
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Output:
30.6 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I'm not sure why comparing on a sliced numpy array using , is a lot slower than ][. For example:
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99][1:99] == 1
print(time.time() - start)
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99, 1:99] == 1
print(time.time() - start)
3.2756259441375732
11.044903039932251
That's over 3 times worse.
The time measurements are approximately the same using timeit.
I'm working on a recursive algorithm (I intended to do so), and those problems make my program run a lot slower, from about 1 second increased to 10 seconds. I just want to know the reason behind them. May be this is a bug. I'm using Python 3.9.9. Thanks.
The first is the same as a[2:99]==1. A (98,100) slice followed by a (97,100), and then the == test.
In [177]: timeit (a[1:99][1:99]==1)
8.51 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [178]: timeit (a[1:99][1:99])
383 ns ± 5.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [179]: timeit (a[1:99])
208 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The bulk of the time is the test, not the slicing.
In [180]: a[1:99,1:99].shape
Out[180]: (98, 98)
In [181]: timeit a[1:99,1:99]==1
32.2 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [182]: timeit a[1:99,1:99]
301 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Again the slicing is a minor part of the timing, but the == test is significantly slower. In the first case we selected a subset of the rows, so the test is on a contiguous block of the data-buffer. In the second we select a subset of rows and columns. Iteration through the data-buffer is more complicated.
We can simplify the comparison by testing a slice of columns versus a slice of rows:
In [183]: timeit a[:,2:99]==1
32.3 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [184]: timeit a[2:99,:]==1
8.58 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a further test, make a new array with 'F' order. Now "rows" are the slow slice
In [189]: b = np.array(a, order='F')
In [190]: timeit b[:,2:99]==1
8.83 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit b[2:99,:]==1
32.8 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
===
But why are you trying to compare these two slices, one that makes a (97,100) array, and the other a (98,98). They are picking different parts of a.
I wonder if you really meant to test a sequential row, column slice, not two row slices.
In [193]: timeit (a[1:99][:,1:99]==1)
32.6 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing just the slicing we see that the sequential one is slower - by just a bit.
In [194]: timeit (a[1:99][:,1:99])
472 ns ± 3.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [195]: timeit (a[1:99,1:99])
306 ns ± 3.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
The data for a is actually stored in 1d c array. The numpy code uses strides and shape to iterate through it when doing something like a[...] == 1.
So imagine (3,6) data buffer looking like
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
sliced with [1:3], it will use
[_ _ _ _ _ _ 0 1 2 3 4 5 0 1 2 3 4 5]
slice with [:,1:4] it will use
[_ 1 2 3 _ _ _ 1 2 3 _ _ _ 1 2 3 _ _]
Regardless of the processor caching details, the iteration through the 2nd is more complex.
I have a large numpy 1D array with over a 100 million elements and am applying np.unique to it
import numpy as np
x = np.random.randint(0,10000, size=100_000_000)
_, index = np.unique(x, return_inverse=True)
What I actually need is the index that is returned from np.unique but I do not need the unique array at all (i.e., it is throwaway). Since, in my real use case, I need to call np.unique many times on different arrays (all with the same length), this becomes the bottleneck. I'm guessing that a lot of the time is spent on sorting the unique array.
What is the a fastest way to obtain the index for a large 1D array (it may be over a billion elements in length)?
Is there a parallelized option?
Here's a way with array-assignment + masking + indexing trickery specific to the case of positive integers only in the input array x -
def return_inverse_only(x, maxnum=None):
if maxnum is None:
maxnum = x.max()+1 # Determines extent of indexing array
p = np.zeros(maxnum, dtype=bool)
p[x] = 1
p2 = np.empty(maxnum, dtype=np.uint64)
c = p.sum()
p2[p] = np.arange(c)
out = p2[x]
return out
If max number in the input array is known before-hahnd, feed in one-added number as maxnum to boost perf. further.
Timings on large arrays -
In [146]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=100000)
In [147]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
10.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
539 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
446 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=1000000)
In [149]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
149 ms ± 5.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.1 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [150]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=10000000)
In [151]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
1.88 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.9 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
55.8 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30x+ speedup!
I have a dataframe which has a column that is a list. I want to extract the individual elements in every list in the column. So given this input dataframe:
A
0 [5, 4, 3, 6]
1 [7, 8, 9, 6]
The intended output should be a list:
[5, 4, 3, 6,7, 8, 9, 6]
You can use list comprehension with flatten:
a = [y for x in df.A for y in x]
Or use itertools.chain:
from itertools import chain
a = list(chain.from_iterable(df.A))
Or use numpy.concatenate:
a = np.concatenate(df.A).tolist()
Or Series.explode, working for pandas 0.25+:
a = df.A.explode().tolist()
Performance with sample data for 100k rows:
df = pd.DataFrame({
'A':[[5, 4, 3, 6], [7, 8, 9, 6]] * 50000})
print (df)
In [263]: %timeit [y for x in df.A for y in x]
37.7 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit list(chain.from_iterable(df.A))
27.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %timeit np.concatenate(df.A).tolist()
1.71 s ± 86.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [266]: %timeit df.A.explode().tolist()
207 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#ansev1
In [267]: %timeit np.hstack(df['A']).tolist()
328 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm trying to convert piece of matlab code to python.
a=[1 2 3;4 5 6]
b= sum(a<5)
//output :
ans :
2 1 1
Actually return the number of elements in every column which has the condition.
Is there any equivalent function in numpy (python) to do this ?
Its the same.
a=np.array([[1, 2, 3],[4, 5, 6]])
b=np.sum(a<5,axis=0) # the only difference is that you need to explicitly set the dimension
Although not made for this purpose, an alternate solution would be
a=np.array([[1, 2, 3],[4, 5, 6]])
np.count_nonzero(a<5, axis=0)
# array([2, 1, 1])
Performance
For small arrays, np.sum seems to be slightly faster
x = np.repeat([1, 2, 3], 100)
y = np.repeat([4, 5, 6], 100)
a=np.array([x,y])
%timeit np.sum(a<5, axis=0)
# 7.18 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.count_nonzero(a<5, axis=0)
# 11.8 µs ± 386 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For very large arrays, np.count_nonzero seems to be slightly faster
x = np.repeat([1, 2, 3], 5000000)
y = np.repeat([4, 5, 6], 5000000)
a=np.array([x,y])
%timeit np.sum(a<5, axis=0)
# 126 ms ± 6.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.count_nonzero(a<5, axis=0)
# 100 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)