I have a function which processes an input array of dimension (h,w,200) (the number 200 can vary) and returns an array of dimension (h,w,50,3). The function takes ~0.8 seconds for an input array of size 512,512,200.
def myfunc(arr, n = 50):
#shape of arr is (h,w,200)
#output shape is (h,w,50,3)
#a1 is an array of length 50, I get them from a different
#function, which doesn't take much time. For simplicity, I fix it
#as np.arange(0,50)
a1 = np.arange(0,50)
output = np.stack((arr[:,:,a1],)*3, axis = -1)
return output
This preprocessing step is done to ~8 arrays in a single batch, due to which loading a batch of data takes 8*0.8 = 6.4 seconds. Is there a way to speed up the computation of myfunc? Can I use libraries like numba for this?
I get about the same time:
In [14]: arr = np.ones((512,512,200))
In [15]: timeit output = np.stack((arr[:,:,np.arange(50)],)*3, axis=-1)
681 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: np.stack((arr[:,:,np.arange(50)],)*3, axis=-1).shape
Out[16]: (512, 512, 50, 3)
Looking at the timings in more detail.
First the index/copy step, takes about 1/3 of the time:
In [17]: timeit arr[:,:,np.arange(50)]
249 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the stack:
In [18]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp], axis=-1)
...:
...:
426 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
stack expands dimensions and then concatenates; so lets call concatenate directly:
In [19]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.concatenate([temp,temp,temp], axis=-1)
...:
...:
430 ms ± 8.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
another approach is to use repeat:
In [20]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.repeat(temp, 3, axis=-1)
...:
...:
531 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So looks like your code's as good as it gets.
Indexing and concatenate already use compiled code, so I don't expect numba to help much (not that I have much experience with it).
Stacking on a new front axis is faster (making (3, 512, 512, 50))
In [21]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp])
...:
...:
254 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That could then be transposed (cheaply), though subsequent operations might be slower (if they require a copy and/or reordering). A plain copy of the full output array times at around 350 ms.
Inspired by comments, I tried broadcasted assignment:
In [101]: %%timeit temp = arr[:,:,np.arange(50)]
...: res = np.empty(temp.shape + (3,), temp.dtype)
...: res[...] = temp[...,None]
...:
...:
...:
337 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Same ball park.
Another trick is to play with strides to make a 'virtual' copy:
In [74]: res1 = np.broadcast_to(arr, (3,)+arr.shape)
In [75]: res1.shape
Out[75]: (3, 512, 512, 200)
In [76]: res1.strides
Out[76]: (0, 819200, 1600, 8)
For some reason this does not work with (512,512,200,3). It may have something to do with the broadcast_to implementation. Maybe someone can experiment with as_strided.
Though I can transpose this just fine:
np.broadcast_to(arr, (3,)+arr.shape).transpose(1,2,3,0)
In any case this is much faster:
In [82]: timeit res1 = np.broadcast_to(arr, (3,)+arr.shape)
10.4 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
(but making a copy brings time back up.)
Related
I want to modify block elements of 3d array without for loop. Without loop because it is the bottleneck of my code.
To illustrate what I want, I draw a figure:
The code with for loop:
import numpy as np
# Create 3d array with 2x4x4 elements
a = np.arange(2*4*4).reshape(2,4,4)
b = np.zeros(np.shape(a))
# Change Block Elements
for it1 in range(2):
b[it1]= np.block([[a[it1,0:2,0:2], a[it1,2:4,0:2]],[a[it1,0:2,2:4], a[it1,2:4,2:4]]] )
First let's see if there's a way to do what you want for a 2D array using only indexing, reshape, and transpose operations. If there is, then there's a good chance that you can extend it to a larger number of dimensions.
x = np.arange(2 * 3 * 2 * 5).reshape(2 * 3, 2 * 5)
Clearly you can reshape this into an array that has the blocks along a separate dimension:
x.reshape(2, 3, 2, 5)
Then you can transpose the resulting blocks:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3)
So far, none of the data has been copied. To make the copy happen, reshape back into the original shape:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3).reshape(2 * 3, 2 * 5)
Adding additional leading dimensions is as simple as increasing the number of the dimensions you want to swap:
b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
Here is a quick benchmark of the other implementations with your original array:
a = np.arange(2*4*4).reshape(2,4,4)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(2):
b[it1] = np.block([[a[it1, 0:2, 0:2], a[it1, 2:4, 0:2]], [a[it1, 0:2, 2:4], a[it1, 2:4, 2:4]]])
27.7 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
2.22 µs ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
13.6 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
1.27 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For small arrays, the differences can sometimes be attributed to overhead. Here is a more meaningful comparison with arrays of size 10x1000x1000, split into 10 500x500 blocks:
a = np.arange(10*1000*1000).reshape(10, 1000, 1000)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(10):
b[it1]= np.block([[a[it1,0:500,0:500], a[it1,500:1000,0:500]],[a[it1,0:500,500:1000], a[it1,500:1000,500:1000]]])
58 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
b = a.copy()
b[:,0:500,500:1000], b[:,500:1000,0:500] = b[:,500:1000,0:500].copy(), b[:,0:500,500:1000].copy()
41.2 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = np.block([[a[:,0:500,0:500], a[:,500:1000,0:500]],[a[:,0:500,500:1000], a[:,500:1000,500:1000]]])
27.5 ms ± 569 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
20 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it seems that using numpy's own reshaping and transposition mechanism is fastest on my computer. Also, notice that the overhead of np.block becomes less important than copying the temporary arrays as size gets bigger, so the other two implementations change places.
You can directly replace the it1 by a slice of the whole dimension:
b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Will it make it faster?
import numpy as np
a = np.arange(2*4*4).reshape(2,4,4)
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Comparison with np.block() alternative from another answer.
Option 1:
%timeit b = a.copy(); b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Output:
5.44 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Option 2
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Output:
30.6 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a large numpy 1D array with over a 100 million elements and am applying np.unique to it
import numpy as np
x = np.random.randint(0,10000, size=100_000_000)
_, index = np.unique(x, return_inverse=True)
What I actually need is the index that is returned from np.unique but I do not need the unique array at all (i.e., it is throwaway). Since, in my real use case, I need to call np.unique many times on different arrays (all with the same length), this becomes the bottleneck. I'm guessing that a lot of the time is spent on sorting the unique array.
What is the a fastest way to obtain the index for a large 1D array (it may be over a billion elements in length)?
Is there a parallelized option?
Here's a way with array-assignment + masking + indexing trickery specific to the case of positive integers only in the input array x -
def return_inverse_only(x, maxnum=None):
if maxnum is None:
maxnum = x.max()+1 # Determines extent of indexing array
p = np.zeros(maxnum, dtype=bool)
p[x] = 1
p2 = np.empty(maxnum, dtype=np.uint64)
c = p.sum()
p2[p] = np.arange(c)
out = p2[x]
return out
If max number in the input array is known before-hahnd, feed in one-added number as maxnum to boost perf. further.
Timings on large arrays -
In [146]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=100000)
In [147]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
10.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
539 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
446 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=1000000)
In [149]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
149 ms ± 5.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.1 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [150]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=10000000)
In [151]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
1.88 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.9 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
55.8 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30x+ speedup!
I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?
Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.
There is a list of tuples l = [(x,y,z), (x,y,z), (x,y,z)]
The idea is to find the fastest way to create different np.arrays for each x-s, y-s, z-s. Need help with finding the fastest solution to do it. To make speed comparison I use code attached below
import time
def myfast():
code
n = 1000000
t0 = time.time()
for i in range(n): myfast()
t1 = time.time()
total_n = t1-t0
1. np.array([i[0] for i in l])
np.array([i[1] for i in l])
np.array([i[2] for i in l])
output: 0.9980638027191162
2. array_x = np.zeros((len(l), 1), dtype="float")
array_y = np.zeros((len(l), 1), dtype="float")
array_z = np.zeros((len(l), 1), dtype="float")
for i, zxc in enumerate(l):
array_x[i] = zxc[0]
array_y[i] = zxc[1]
array_z[i] = zxc[2]
output 5.5509934425354
3. [np.array(x) for x in zip(*l)]
output 2.5070037841796875
5. array_x, array_y, array_z = np.array(list(zip(*l)))
output 2.725318431854248
There are some really good option in here, so I summarized them and compared speed:
import numpy as np
def f1(input_data):
array_x = np.array([elem[0] for elem in input_data])
array_y = np.array([elem[1] for elem in input_data])
array_z = np.array([elem[2] for elem in input_data])
return array_x, array_y, array_z
def f2(input_data):
array_x = np.zeros((len(input_data), ), dtype="float")
array_y = np.zeros((len(input_data), ), dtype="float")
array_z = np.zeros((len(input_data), ), dtype="float")
for i, elem in enumerate(input_data):
array_x[i] = elem[0]
array_y[i] = elem[1]
array_z[i] = elem[2]
return array_x, array_y, array_z
def f3(input_data):
return [np.array(elem) for elem in zip(*input_data)]
def f4(input_data):
return np.array(list(zip(*input_data)))
def f5(input_data):
return np.array(input_data).transpose()
def f6(input_data):
array_all = np.array(input_data)
array_x = array_all[:, 0]
array_y = array_all[:, 1]
array_z = array_all[:, 2]
return array_x, array_y, array_z
First I asserted that all of them return the same data (using np.array_equal()):
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for array_list in zip(f1(data), f2(data), f3(data), f4(data), f5(data), f6(data)):
# print()
# for i, arr in enumerate(array_list):
# print('array from function', i+1)
# print(arr)
for i, arr in enumerate(array_list[:-1]):
assert np.array_equal(arr, array_list[i+1])
And the time comparisson:
import timeit
for f in [f1, f2, f3, f4, f5, f6]:
t = timeit.timeit('f(data)', 'from __main__ import data, f', number=100000)
print('{:5s} {:10.4f} seconds'.format(f.__name__, t))
gives these results:
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] # 3 tuples
timeit number=100000
f1 0.3184 seconds
f2 0.4013 seconds
f3 0.2826 seconds
f4 0.2091 seconds
f5 0.1732 seconds
f6 0.2159 seconds
data = [(1, 2, 3) for _ in range(10**6)] # 1 millon tuples
timeit number=10
f1 2.2168 seconds
f2 2.8657 seconds
f3 2.0150 seconds
f4 1.9790 seconds
f5 2.6380 seconds
f6 2.6586 seconds
making f5() the fastest option for short input and f4() the fastest option for big input.
If the number of elements in each tuple will be more than 3, then only 3 functions apply to that case (the others are hardcoded for 3 elements in each tuple):
data = [tuple(range(10**4)) for _ in range(10**3)]
timeit number=10
f3 11.8396 seconds
f4 13.4672 seconds
f5 4.6251 seconds
making f5() again the fastest option for these criteria.
you could try:
import numpy
array_x, array_y, array_z = numpy.array(list(zip(*l)))
or just:
numpy.array(list(zip(*l)))
and more elegant way:
numpy.array(l).transpose()
Maybe I am missing something, but why not just pass list of tuples directly to np.array? Say if:
n = 100
l = [(0, 1, 2) for _ in range(n)]
arr = np.array(l)
x = arr[:, 0]
y = arr[:, 1]
z = arr[:, 2]
Btw, I prefer to use the following to time code:
from timeit import default_timer as timer
t0 = timer()
do_heavy_calculation()
print("Time taken [sec]:", timer() - t0)
I believe most (but not all) of the ingredients of this answer are actually present in the other answers, but on all the answers so far I have not seen a apple-to-apple comparison, in the sense that some approaches were not returning a list of np.ndarray objects, but rather a (convenient in my opinion) single np.ndarray().
It is not clear whether this is acceptable to you, so I am adding proper code for this.
Besides that the performances may be different because is some cases you are adding an extra step, while for some others you may not need to create large objects (that could reside in different memory pages).
In the end, for smaller inputs (3 x 10), the list of np.ndarray()s is just some additional burden that adds up significantly to the timing.
For larger inputs (3 x 1000) and above the extra computation is not significant any longer, but an approach involving comprehensions and avoiding the creation of a large numpy array can becomes as fast as (or even faster than) the fastest methods for smaller inputs.
Also, all the code I present work for arbitrary sizes of the tuples/list (as long as the inner tuples all have the same size, of course).
(EDIT: added a comment on the final results)
The tested methods are:
import numpy as np
def to_arrays_zip(items):
return np.array(list(zip(*items)))
def to_arrays_transpose(items):
return np.array(items).transpose()
def to_arrays_zip_split(items):
return [arr for arr in np.array(list(zip(*items)))]
def to_arrays_transpose_split(items):
return [arr for arr in np.array(items).transpose()]
def to_arrays_comprehension(items):
return [np.array([items[i][j] for i in range(len(items))]) for j in range(len(items[0]))]
def to_arrays_comprehension2(items):
return [np.array([item[j] for item in items]) for j in range(len(items[0]))]
(This is a convenient function to check that the results are the same.)
def test_equal(items1, items2):
return all(np.all(x == y) for x, y in zip(items1, items2))
For small inputs:
N = 3
M = 10
ll = [tuple(range(N)) for _ in range(M)]
print(to_arrays_comprehension2(ll))
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 2.82 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose(ll)
# 3.18 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 3.71 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose_split(ll)
# 3.97 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension(ll)
# 5.91 µs ± 96.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension2(ll)
# 5.14 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Where the podium is:
to_arrays_zip_split() (the non-_split if you are OK with a single array)
to_arrays_zip_transpose_split() (the non-_split if you are OK with a single array)
to_arrays_comprehension2()
For somewhat larger inputs:
N = 3
M = 1000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 146 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose(ll)
# 222 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 147 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose_split(ll)
# 221 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension(ll)
# 261 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension2(ll)
# 212 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The podium becomes:
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_comprehension2()
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
For even larger inputs:
N = 3
M = 1000000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 215 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose(ll)
# 220 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 218 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose_split(ll)
# 222 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension(ll)
# 248 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension2(ll)
# 186 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The podium becomes:
to_arrays_comprehension2()
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
and the the _zip and _transpose variants are pretty close to each other.
(I also tried to speed things up with Numba that didn't go well)
I've found that len(arr) is almost twice as fast as arr.shape[0] and am wondering why.
I am using Python 3.5.2, Numpy 1.14.2, IPython 6.3.1
The below code demonstrates this:
arr = np.random.randint(1, 11, size=(3, 4, 5))
%timeit len(arr)
# 62.6 ns ± 0.239 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit arr.shape[0]
# 102 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I've also done some more tests for comparison:
class Foo():
def __init__(self):
self.shape = (3, 4, 5)
foo = Foo()
%timeit arr.shape
# 75.6 ns ± 0.107 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape
# 61.2 ns ± 0.281 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape[0]
# 78.6 ns ± 1.03 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
So I have two questions:
1) Why does len(arr) works faster than arr.shape[0]? (I would have thought len would be slower because of the function call)
2) Why does foo.shape[0] work faster than arr.shape[0]? (In other words, what overhead does do numpy arrays incur in this case?)
The numpy array data structure is implemented in C. The dimensions of the array are stored in a C structure. They are not stored in a Python tuple. So each time you read the shape attribute, a new Python tuple of new Python integer objects is created. When you use arr.shape[0], that tuple is then indexed to pull out the first element, which adds a little more overhead. len(arr) only has to create a Python integer.
You can easily verify that arr.shape creates a new tuple each time it is read:
In [126]: arr = np.random.randint(1, 11, size=(3, 4, 5))
In [127]: s1 = arr.shape
In [128]: id(s1)
Out[128]: 4916019848
In [129]: s2 = arr.shape
In [130]: id(s2)
Out[130]: 4909905024
s1 and s2 have different ids; they are different tuple objects.