I am trying to implement a code which works for retrieving Millionth Fibonacci number or beyond. I am using Matrix Multiplication with Numpy for faster calculations.
According to my understanding it should take O(logN) time and worst case for a million should result in: Nearly 6secs which should be alright.
Following is my implementation:
def fib(n):
import numpy as np
matrix = np.matrix([[1, 1], [1, 0]]) ** abs(n)
if n%2 == 0 and n < 0:
return -matrix[0,1]
return matrix[0, 1]
However, leave 1million, it is not even generating correct response for 1000
Actual response:
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875
My Response:
817770325994397771
Why is Python truncating the responses generally from the docs it should be capable of calulating even 10**1000 values. Where did I go wrong?
Numpy can handle numbers and calculations in a high-performance way (both memory efficiency and computing time). So while Python can process sufficiently large numbers, Numpy can't. You can let Python do the calculation and get the result, exchanging with performance reduction.
Sample code:
import numpy as np
def fib(n):
# the difference is dtype=object, it will let python do the calculation
matrix = np.matrix([[1, 1], [1, 0]], dtype=object) ** abs(n)
if n%2 == 0 and n < 0:
return -matrix[0,1]
return matrix[0, 1]
print(fib(1000))
Output:
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875
PS: Warning
Milionth Fibonacci number is extremely large, you should make sure that Python can handle it. If not, you will have to implement/find some module to handle these large numbers.
I'm not convinced that numpy is much help here as it does not directly support Python's very large integers in vectorized operations. A basic Python implementation of an O(logN) algorithm gets the 1 millionth Fibonacci number in 0.15 sec on my laptop. An iterative (slow) approach gets it in 12 seconds.:
def slowfibo(N):
a = 0
b = 1
for _ in range(1,N): a,b = b,a+b
return a
# Nth Fibonacci number (exponential iterations) O(log(N)) time (N>=0)
def fastFibo(N):
a,b = 1,1
f0,f1 = 0,1
r,s = (1,1) if N&1 else (0,1)
N //=2
while N > 0:
a,b = f0*a+f1*b, f0*b+f1*(a+b)
f0,f1 = b-a,a
if N&1: r,s = f0*r+f1*s, f0*s+f1*(r+s)
N //= 2
return r
output:
f1K = slowFibo(1000) # 0.00009 sec
f1K = fib(1000) # 0.00011 sec (tandat's)
f1K = fastFibo(1000) # 0.00002 sec
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875
f1M = slowFibo(1_000_000) # 12.52 sec
f1M = fib(1_000_000) # 0.2769 sec (tandat's)
f1M = fastFibo(1_000_000) # 0.14718 sec
19532821287077577316...68996526838242546875
len(str(f1M)) # 208988 digits
The core of your function is
np.matrix([[1, 1], [1, 0]]) ** abs(n)
which is discussed in the Wiki article
np.matrix implements ** as __pwr__, which in turn uses np.linalg.matrix_power. Essentially that's a repeated dot matrix multiplication, with a modest enhancement by grouping the products by powers of 2.
In [319]: M=np.matrix([[1, 1], [1, 0]])
In [320]: M**10
Out[320]:
matrix([[89, 55],
[55, 34]])
The use of np.matrix is discouraged, so I can do the same with
In [321]: A = np.array(M)
In [322]: A
Out[322]:
array([[1, 1],
[1, 0]])
In [323]: np.linalg.matrix_power(A,10)
Out[323]:
array([[89, 55],
[55, 34]])
Using the (newish) # matrix multiplication operator, that's the same as:
In [324]: A#A#A#A#A#A#A#A#A#A
Out[324]:
array([[89, 55],
[55, 34]])
matrix_power does something more like:
In [325]: A2=A#A; A4=A2#A2; A8=A4#A4; A8#A2
Out[325]:
array([[89, 55],
[55, 34]])
And some comparative times:
In [326]: timeit np.linalg.matrix_power(A,10)
16.2 µs ± 58.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [327]: timeit M**10
33.5 µs ± 38.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [328]: timeit A#A#A#A#A#A#A#A#A#A
25.6 µs ± 914 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [329]: timeit A2=A#A; A4=A2#A2; A8=A4#A4; A8#A2
10.2 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy integers are implemented at int64, in c, so are limited in size. Thus we get overflow with a modest 100:
In [330]: np.linalg.matrix_power(A,100)
Out[330]:
array([[ 1298777728820984005, 3736710778780434371],
[ 3736710778780434371, -2437933049959450366]])
We can get around this by changing the dtype to object. The values are then Python ints, and can grow indefinately:
In [331]: Ao = A.astype(object)
In [332]: Ao
Out[332]:
array([[1, 1],
[1, 0]], dtype=object)
Fortunately matrix_power can cleanly handle object dtype:
In [333]: np.linalg.matrix_power(Ao,100)
Out[333]:
array([[573147844013817084101, 354224848179261915075],
[354224848179261915075, 218922995834555169026]], dtype=object)
Usually math on object dtype is slower, but not in this case:
In [334]: timeit np.linalg.matrix_power(Ao,10)
14.9 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I'm guessing it's because of the small (2,2) size of the array, where fast compiled methods aren't useful. This is basically an iterative task, where numpy doesn't have any advantages.
Scaling isn't bad - increase n by 10, and only get a 3-4x increase in time.
In [337]: np.linalg.matrix_power(Ao,1000)
Out[337]:
array([[70330367711422815821835254877183549770181269836358732742604905087154537118196933579742249494562611733487750449241765991088186363265450223647106012053374121273867339111198139373125598767690091902245245323403501,
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875],
[43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875,
26863810024485359386146727202142923967616609318986952340123175997617981700247881689338369654483356564191827856161443356312976673642210350324634850410377680367334151172899169723197082763985615764450078474174626]],
dtype=object)
In [338]: timeit np.linalg.matrix_power(Ao,1000)
53.8 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With object dtype np.matrix:
In [340]: Mo = M.astype(object)
In [344]: timeit Mo**1000
86.1 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And for million, the times aren't as bad as I anticipated:
In [352]: timeit np.linalg.matrix_power(Ao,1_000_000)
423 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For comparison, the fastFibo times on my machine are:
In [354]: fastFibo(100)
Out[354]: 354224848179261915075
In [355]: timeit fastFibo(100)
3.91 µs ± 154 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [356]: timeit fastFibo(1000)
9.37 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [357]: timeit fastFibo(1_000_000)
226 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I want to modify block elements of 3d array without for loop. Without loop because it is the bottleneck of my code.
To illustrate what I want, I draw a figure:
The code with for loop:
import numpy as np
# Create 3d array with 2x4x4 elements
a = np.arange(2*4*4).reshape(2,4,4)
b = np.zeros(np.shape(a))
# Change Block Elements
for it1 in range(2):
b[it1]= np.block([[a[it1,0:2,0:2], a[it1,2:4,0:2]],[a[it1,0:2,2:4], a[it1,2:4,2:4]]] )
First let's see if there's a way to do what you want for a 2D array using only indexing, reshape, and transpose operations. If there is, then there's a good chance that you can extend it to a larger number of dimensions.
x = np.arange(2 * 3 * 2 * 5).reshape(2 * 3, 2 * 5)
Clearly you can reshape this into an array that has the blocks along a separate dimension:
x.reshape(2, 3, 2, 5)
Then you can transpose the resulting blocks:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3)
So far, none of the data has been copied. To make the copy happen, reshape back into the original shape:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3).reshape(2 * 3, 2 * 5)
Adding additional leading dimensions is as simple as increasing the number of the dimensions you want to swap:
b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
Here is a quick benchmark of the other implementations with your original array:
a = np.arange(2*4*4).reshape(2,4,4)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(2):
b[it1] = np.block([[a[it1, 0:2, 0:2], a[it1, 2:4, 0:2]], [a[it1, 0:2, 2:4], a[it1, 2:4, 2:4]]])
27.7 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
2.22 µs ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
13.6 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
1.27 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For small arrays, the differences can sometimes be attributed to overhead. Here is a more meaningful comparison with arrays of size 10x1000x1000, split into 10 500x500 blocks:
a = np.arange(10*1000*1000).reshape(10, 1000, 1000)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(10):
b[it1]= np.block([[a[it1,0:500,0:500], a[it1,500:1000,0:500]],[a[it1,0:500,500:1000], a[it1,500:1000,500:1000]]])
58 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
b = a.copy()
b[:,0:500,500:1000], b[:,500:1000,0:500] = b[:,500:1000,0:500].copy(), b[:,0:500,500:1000].copy()
41.2 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = np.block([[a[:,0:500,0:500], a[:,500:1000,0:500]],[a[:,0:500,500:1000], a[:,500:1000,500:1000]]])
27.5 ms ± 569 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
20 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it seems that using numpy's own reshaping and transposition mechanism is fastest on my computer. Also, notice that the overhead of np.block becomes less important than copying the temporary arrays as size gets bigger, so the other two implementations change places.
You can directly replace the it1 by a slice of the whole dimension:
b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Will it make it faster?
import numpy as np
a = np.arange(2*4*4).reshape(2,4,4)
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Comparison with np.block() alternative from another answer.
Option 1:
%timeit b = a.copy(); b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Output:
5.44 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Option 2
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Output:
30.6 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have an odd case where I can see numpy.einsum speeding up a computation but can't see the same in einsum_path. I'd like to quantify/explain this possible speed-up but am missing something somewhere...
In short, I have a matrix multiplication where only the diagonal of the final product is needed.
a = np.arange(9).reshape(3,3)
print('input array')
print(a)
print('normal method')
print(np.diag(a.dot(a)))
print('einsum method')
print(np.einsum('ij,ji->i', a, a))
which produces the output:
input array
[[0 1 2]
[3 4 5]
[6 7 8]]
normal method
[ 15 54 111]
einsum method
[ 15 54 111]
When running on a large matrix, numpy.einsum is substantially faster.
A = np.random.randn(2000, 300)
B = np.random.randn(300, 2000)
print('normal method')
%timeit np.diag(A.dot(B))
print('einsum method')
%timeit np.einsum('ij,ji->i', A, B)
which produces:
normal method
17.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum method
1.02 ms ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
My intuition is that this speed up is possible as numpy.einsum is able to drop computations that would eventually be dropped by taking the diagonal - but, if I'm reading it correctly, the output of numpy.einsum_path is showing no speed up at all.
print(np.einsum_path('ij,ji->i',A,B,optimize=True)[1])
Complete contraction: ij,ji->i
Naive scaling: 2
Optimized scaling: 2
Naive FLOP count: 1.200e+06
Optimized FLOP count: 1.200e+06
Theoretical speedup: 1.000
Largest intermediate: 2.000e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
2 ji,ij->i i->i
Questions:
Why can I see a practical speed-up that isn't reflected in the computational path?
Is there a way to quantify the speed up by ij,ji->i path in numpy.einsum?
That path just looks at alternative orders when working with more than 2 arguments. With just 2 arguments that analysis does nothing. Your diag(dot)
In [113]: np.diag(a.dot(a))
Out[113]: array([ 15, 54, 111])
The equivalent using einsum is:
In [115]: np.einsum('ii->i',np.einsum('ij,jk->ik',a,a))
Out[115]: array([ 15, 54, 111])
But we can skip the intermediate step:
In [116]: np.einsum('ij,ji->i',a,a)
Out[116]: array([ 15, 54, 111])
The indexed notation is flexible enough that it doesn't need to go through the full dot calculation.
Another way to get the same result is:
In [117]: (a*a.T).sum(axis=1)
Out[117]: array([ 15, 54, 111])
With matmul we can do the calc without the diag, treating the first dimension as a 'batch'. But requires some reshaping first:
In [121]: a[:,None,:]#a.T[:,:,None]
Out[121]:
array([[[ 15]],
[[ 54]],
[[111]]])
In [122]: np.squeeze(a[:,None,:]#a.T[:,:,None])
Out[122]: array([ 15, 54, 111])
My times
normal method
135 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
einsum method
3.09 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [130]: timeit (A*B.T).sum(axis=0)
8.06 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [131]: timeit np.squeeze(A[:,None,:]#B.T[:,:,None])
3.52 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dot creates a (2000,2000) array, and then extracts 2000 elements.
In terms of element wise multiplication, the dot is:
In [136]: (a[:,:,None]*a[None,:,:]).sum(axis=1)
Out[136]:
array([[ 15, 18, 21],
[ 42, 54, 66],
[ 69, 90, 111]])
With A and B, the intermediate product would be (2000,300,2000), which is summed down to (2000,2000). The einsum does (effectively) 2000 calulations of size (300,300) reduced to (1,).
The einsum is closer to this calculation than the diag/dot, treating the size 2000 dimension as a 'batch' for 1d dot calculations:
In [140]: timeit np.array([A[i,:].dot(B[:,i]) for i in range(2000)])
9.46 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Beside the main aim of the question, if the performance is of importance, numba will be very fast and can be used in parallel scheme, limiting the calculations to the diagonal parameters only as:
import numpy as np
import numba as nb
#nb.njit(parallel=True) # , fastmath=True
def diag_dot(a, b):
res = np.zeros(a.shape[0])
for i in nb.prange(a.shape[0]):
for j in range(a.shape[1]):
res[i] += a[i, j] * b[j, i]
return res
Which took around the half of np.einsum for A and B in my tests.
I'm currently working with generators and factorials in python.
As an example:
itertools.permutations(range(100))
Meaning, I receive a generator object containing 100! values.
In reality, this code does look a bit more complicated; I'm using a list of sublists instead of range(100), with the goal to find a combination of those sublists meeting my conditions.
This is the code:
mylist = [[0, 0, 1], ..., [5, 7, 3]] # random numbers
x = True in (combination for combination in itertools.permutations(mylist)
if compare(combination))
# Compare() does return True for one or a few combination in that generator
I realized this is very time-consuming. Is there a more efficient way to do this, and, moreover, a way to compute how many time it is going to take?
I've done a few %timeit using ipython:
%timeit (combination for combination in itertools.permutations(mylist) if compare(combination))
--> 697 ns
%timeit (combination for combination in itertools.permutations(range(100)) if compare(combination))
--> 572 ns
Note: I do understand that the generator is just being created, when it's "consumed", meaning the genertor comprehension needs to be executed at first, to start the creaton of itself at all.
I've seen a lot of tutorials explaining how generators do work, but I've found nothing about the execution time.
Moreover I don't need an exact value, like timing the execution time using time-module in my program, hence I need a rough value before execution.
Edit:
I've also tested this for a smaller amount of values, for a list containing 24 sublists, 10 sublists and 5 sublists. Doing this, I receive an instant output.
This means, the program does work, it is just a matter of time.
My problem is (said more clarified): How much time is this going to take, and: Is there a less time consuming way to do it?
A comparison of generators, generator expression, lists and list comprehensions:
In [182]: range(5)
Out[182]: range(0, 5)
In [183]: list(range(5))
Out[183]: [0, 1, 2, 3, 4]
In [184]: (x for x in range(5))
Out[184]: <generator object <genexpr> at 0x7fc18cd88a98>
In [186]: [x for x in range(5)]
Out[186]: [0, 1, 2, 3, 4]
Some timings:
In [187]: timeit range(1000)
248 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [188]: timeit (x for x in range(1000))
802 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [189]: timeit [x for x in range(1000)]
43.4 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [190]: timeit list(range(1000))
23.6 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
time for setting up a generator is (practically) independent of the parameter. Populating the list scales roughly with the size.
In [193]: timeit range(100000)
252 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [194]: timeit list(range(100000))
4.41 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edit
Timings show that a in test on a generator is somewhat faster than a list, but it still scales with the len:
In [264]: timeit True in (True for x in itertools.permutations(range(15),2) if x==(14,4))
17.1 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [265]: timeit list (True for x in itertools.permutations(range(15),2) if x==(14,4))
18.5 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [266]: timeit (14,4) in itertools.permutations(range(15),2)
8.85 µs ± 8.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [267]: timeit list(itertools.permutations(range(15),2))
11.3 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python
I've found that len(arr) is almost twice as fast as arr.shape[0] and am wondering why.
I am using Python 3.5.2, Numpy 1.14.2, IPython 6.3.1
The below code demonstrates this:
arr = np.random.randint(1, 11, size=(3, 4, 5))
%timeit len(arr)
# 62.6 ns ± 0.239 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit arr.shape[0]
# 102 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I've also done some more tests for comparison:
class Foo():
def __init__(self):
self.shape = (3, 4, 5)
foo = Foo()
%timeit arr.shape
# 75.6 ns ± 0.107 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape
# 61.2 ns ± 0.281 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape[0]
# 78.6 ns ± 1.03 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
So I have two questions:
1) Why does len(arr) works faster than arr.shape[0]? (I would have thought len would be slower because of the function call)
2) Why does foo.shape[0] work faster than arr.shape[0]? (In other words, what overhead does do numpy arrays incur in this case?)
The numpy array data structure is implemented in C. The dimensions of the array are stored in a C structure. They are not stored in a Python tuple. So each time you read the shape attribute, a new Python tuple of new Python integer objects is created. When you use arr.shape[0], that tuple is then indexed to pull out the first element, which adds a little more overhead. len(arr) only has to create a Python integer.
You can easily verify that arr.shape creates a new tuple each time it is read:
In [126]: arr = np.random.randint(1, 11, size=(3, 4, 5))
In [127]: s1 = arr.shape
In [128]: id(s1)
Out[128]: 4916019848
In [129]: s2 = arr.shape
In [130]: id(s2)
Out[130]: 4909905024
s1 and s2 have different ids; they are different tuple objects.