How do I maximize efficiency with numpy arrays?

How do I maximize efficiency with numpy arrays? - python

I am just getting to know numpy, and I am impressed by its claims of C-like efficiency with memory access in its ndarrays. I wanted to see the differences between these and pythonic lists for myself, so I ran a quick timing test, performing a few of the same simple tasks with numpy without it. Numpy outclassed regular lists by an order of magnitude in the allocation of and arithmetic operations on arrays, as expected. But this segment of code, identical in both tests, took about 1/8 of a second with a regular list, and slightly over 2.5 seconds with numpy:
file = open('timing.log','w')
for num in a2:
if num % 1000 == 0:
file.write("Multiple of 1000!\r\n")
file.close()
Does anyone know why this might be, and if there is some other syntax i should be using for operations like this to take better advantage of what the ndarray can do?
Thanks...
EDIT: To answer Wayne's comment... I timed them both repeatedly and in different orders and got pretty much identical results each time, so I doubt it's another process. I put start = time() at the top of the file after the numpy import and then I have statements like print 'Time after traversal:\t',(time() - start) throughout.

a2 is a NumPy array, right? One possible reason it might be taking so long in NumPy (if other processes' activity don't account for it as Wayne Werner suggested) is that you're iterating over the array using a Python loop. At every step of the iteration, Python has to fetch a single value out of the NumPy array and convert it to a Python integer, which is not a particularly fast operation.
NumPy works much better when you are able to perform operations on the whole array as a unit. In your case, one option (maybe not even the fastest) would be
file.write("Multiple of 1000!\r\n" * (a2 % 1000 == 0).sum())
Try comparing that to the pure-Python equivalent,
file.write("Multiple of 1000!\r\n" * sum(filter(lambda i: i % 1000 == 0, a2)))
or
file.write("Multiple of 1000!\r\n" * sum(1 for i in a2 if i % 1000 == 0))

I'm not surprised that NumPy does poorly w/r/t Python built-ins when using your snippet. A large fraction of the performance benefit in NumPy arises from avoiding the loops and instead access the array by indexing:
In NumPy, it's more common to do something like this:
A = NP.random.randint(10, 100, 100).reshape(10, 10)
w = A[A % 2 == 0]
NP.save("test_file.npy", w)

Per-element access is very slow for numpy arrays. Use vector operations:
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 1.53 sec per loop
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> (a2 % 1000 == 0).sum()'
10 loops, best of 3: 22.6 msec per loop
$ python -mtimeit -s 'import numpy as np; a2= range(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 90.9 msec per loop

Related

Numpy: get array where index greater than value and condition is true

I have the following array:
a = np.array([6,5,4,3,4,5,6])
Now I want to get all elements which are greater than 4 but also have in index value greater than 2.
The way that I have found to do that was the following:
a[2:][a[2:]>4]
Is there a better or more readable way to accomplish this?
UPDATE:
This is a simplified version. In reality the indexing is done with arithmetic operation over several variables like this:
a[len(trainPredict)+(look_back*2)+1:][a[len(trainPredict)+(look_back*2)+1:]>4]
trainPredict ist a numpy array, look_back an integer.
I wanted to see if there is an established way or how others do that.

If you're worried about the complexity of the slice and/or the number of conditions, you can always separate them:
a = np.array([6,5,4,3,4,5,6])
a_slice = a[2:]
cond_1 = a_slice > 4
res = a_slice[cond_1]
Is your example very simplified? There might be better solutions for more complex manipulations.

#AlexanderCécile's answer is not only more legible than the one liner you posted, but is also removes the redundant computation of a temp array. Despite that, it does not appear to be any faster than your original approach.
The timings below are all run with a preliminary setup of
import numpy as np
np.random.seed(0xDEADBEEF)
a = np.random.randint(8, size=N)
N varies from 1e3 to 1e8 in factors of 10. I tried four variants of the code:
CodePope: result = a[2:][a[2:] > 4]
AlexanderCécile: s = a[2:]; result = s[s > 4]
MadPhysicist1: result = a[np.flatnonzero(a[2:]) + 2]
MadPhysicist2: result = a[(a > 4) & (np.arange(a.size) >= 2)]
In all cases, the timing was obtained on the command line by running
python -m timeit -s 'import numpy as np; np.random.seed(0xDEADBEEF); a = np.random.randint(8, size=N)' '<X>'
Here, N was a power of 10 between 3 and 8, and <X> one of the expressions above. Timings are as follows:
Methods #1 and #2 are virtually indistinguishable. What is surprising is that in the range between ~5e3 and ~1e6 elements, method #3 seems to be slightly, but noticeably faster. I would not normally expect that from fancy indexing. Method #4 is of course going to be the slowest.
Here is the data, for completeness:
CodePope AlexanderCécile MadPhysicist1 MadPhysicist2
1000 3.77e-06 3.69e-06 5.48e-06 6.52e-06
10000 4.6e-05 4.59e-05 3.97e-05 5.93e-05
100000 0.000484 0.000483 0.0004 0.000592
1000000 0.00513 0.00515 0.00503 0.00675
10000000 0.0529 0.0525 0.0617 0.102
100000000 0.657 0.658 0.782 1.09

pandas.Series.div() vs /=

I'm curious why pandas.Series.div() is slower than /= when applying to a pandas Series of numbers. For example:
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser /= 7'
1000 loops, best of 3: 584 usec per loop
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser.div(7)'
1000 loops, best of 3: 746 usec per loop
I assume that it's because the former changes the series in place whereas the latter returns a new Series. But if that's the case, then why bother implementing div() and mul() at all if they're not as fast as /= and */?
Even if you don't want to change the series in place, ser / 7 is still faster than .div():
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser / 7'
1000 loops, best of 3: 656 usec per loop
So what is the use of pd.Series.div() and what about it makes it slower?

Pandas .div obviously implement division similarly to / and /=.
The main reason to have a separate .div is that Pandas embraces a syntax model where operations on dataframes are described by the applications of consecutive filters, e.g. .div, .str, etc. which allows for simple concatenations:
ser.div(7).apply(lambda x: 'text: ' + str(x)).str.upper()
as well as simpler support for multiple arguments (cfr. .func(a, b, c) would not be possible to write with a binary operator).
By contrast, the same would have been written without div as:
(ser / 7).apply(lambda x: 'text: ' + str(x)).str.upper()
The / operation may be faster because there is less Python overhead associated with / operator compared to .div().
By contrast, the x /= y operator replaces the construct x = x / y.
For vectorized containers based on NumPy (like Pandas), it goes a little beyond that: it uses an in-place operation instead of creating a (potentially time- and memory-consuming) copy of x. This is the reason why /= is faster than both / and .div().
Note that, while in most cases this is equivalent, sometimes (like in this case) it may still require conversion to a different data type, which is done automatically in Pandas (but not in NumPy).

Problems in implementing Horner's method in Python

So I have written down the codes for evaluating polynomial using three different methods. Horner's method should be the fastest, while the naive method should be the slowest, right? But how come the time for computing it is not what I expect? And the time for calculation sometimes turns out to be exactly the same for itera and naive method. What's wrong with it?
import numpy.random as npr
import time
def Horner(c,x):
p=0
for i in c[-1::-1]:
p = p*x+i
return p
def naive(c,x):
n = len(c)
p = 0
for i in range(len(c)):
p += c[i]*x**i
return p
def itera(c,x):
p = 0
xi = 1
for i in range(len(c)):
p += c[i]*xi
xi *= x
return p
c=npr.uniform(size=(500,1))
x=-1.34
start_time=time.time()
print Horner(c,x)
print time.time()-start_time
start_time=time.time()
print itera(c,x)
print time.time()-start_time
start_time=time.time()
print naive(c,x)
print time.time()-start_time
here are some of the results:
[ 2.58646959e+69]
0.00699996948242
[ 2.58646959e+69]
0.00600004196167
[ 2.58646959e+69]
0.00600004196167
[ -3.30717922e+69]
0.00899982452393
[ -3.30717922e+69]
0.00600004196167
[ -3.30717922e+69]
0.00600004196167
[ -2.83469309e+69]
0.00999999046326
[ -2.83469309e+69]
0.00999999046326
[ -2.83469309e+69]
0.0120000839233

Your profiling can be much improved. Plus, we can make your code run 200-500x faster.
(1) Rinse and repeat
You can't run just one iteration of a performance test, for two reasons.
Your time resolution might not be good enough. This is why you sometimes got the same time for two implementations: the time for one run was near the resolution of your timing mechanism, so you recorded only one "tick".
There are all sorts of factors that affect performance. Your best bet for a meaningful comparison will be a lot of iterations.
You don't need gazillions of runs (though, of course, that doesn't hurt), but you estimate and adjust the number of iterations until the variance is within a level acceptable to your purpose.
timeit is a nice little module for profiling Python code.
I added this to bottom of your script.
import timeit
n = 1000
print 'Horner', timeit.timeit(
number = n,
setup='from __main__ import Horner, c, x',
stmt='Horner(c,x)'
)
print 'naive', timeit.timeit(
number = n,
setup='from __main__ import naive, c, x',
stmt='naive(c,x)',
)
print 'itera', timeit.timeit(
number = n,
setup='from __main__ import itera, c, x',
stmt='itera(c,x)',
)
Which produces
Horner 1.8656351566314697
naive 2.2408010959625244
itera 1.9751169681549072
Horner is the fastest, but it's not exactly blowing the doors off the other two.
(2) Look at what is happening...very carefully
Python has operator overloading, so it's easy to miss seeing this.
npr.uniform(size=(500,1)) is giving you a 500 x 1 numpy structure of random numbers.
So what?
Well, c[i] isn't a number. It's a numpy array with one element. Numpy overloads the operators so you can do things like multiply an array by a scalar.
That's fine, but using an array for every element is a lot of overhead, so it's harder to see the difference between the algorithms.
Instead, let's try a simple Python list:
import random
c = [random.random() for _ in range(500)]
And now,
Horner 0.034661054611206055
naive 0.12771987915039062
itera 0.07331395149230957
Whoa! All the time times just got faster (by 10-60x). Proportionally, the Horner implementation got even faster than the other two. We removed the overhead on all three, and can now see the "bare bones" difference.
Horner is 4x faster than naive and 2x faster than itera.
(3) Alternate runtimes
You're using Python 2. I assume 2.7.
Let's see how Python 3.4 fares. (Syntax adjustment: you'll need to put parenthesis around the argument list to print.)
Horner 0.03298933599944576
naive 0.13706714100044337
itera 0.06771054599812487
About the same.
Let's try PyPy, a JIT implementation of Python. (The "normal" Python implementation is called CPython.)
Horner 0.006507158279418945
naive 0.07541298866271973
itera 0.005059003829956055
Nice! Each implementation is now running 2-5x faster. Horner is now 10x the speed of naive, but slightly slower than itera.
JIT runtimes are more difficult to profile than interpreters. Let's increase the number of iterations to 50000, and try it just to make sure.
Horner 0.12749004364013672
naive 3.2823100090026855
itera 0.06546688079833984
(Note that we have 50x the iterations, but only 20x the time...the JIT hadn't taken full effect for many of the first 1000 runs.) Same conclusions, but the differences are even more pronounced.
Granted, the idea of JIT is to profile, analyze, and rewrite the program at runtime, so if your goal is to compare algorithms, this is going to add a lot of non-obvious implementation detail.
Nonetheless, comparing runtimes can be useful in giving a broader perspective.
There are a few more things. For example, your naive implementation computes a variable it never uses. You use range instead of xrange. You could try iterating backwards with an index rather than a reverse slice. Etc.
None of these changed the results much for me, but they were worth considering.

You cannot obtain accurate result by measuring things like that:
start_time=time.time()
print Horner(c,x)
print time.time()-start_time
Presumably most of the time is spend in the IO function involved by the print function. In addition, to have something significant, you should perform the measure on a large number of iteration in order to smooth errors. In the general case, you might want to perform your test on various input data as well -- as depending your algorithm, some case might coincidentally be solved more efficiently than others.
You should definitively take a look at the timeit module. Something like that, maybe:
import timeit
print 'Horner',timeit.timeit(stmt='Horner(c,x)',
setup='from __main__ import Horner, c, x',
number = 10000)
# ^^^^^
# probably not enough. Increase that once you will
# be confident
print 'naive',timeit.timeit(stmt='naive(c,x)',
setup='from __main__ import naive, c, x',
number = 10000)
print 'itera',timeit.timeit(stmt='itera(c,x)',
setup='from __main__ import itera, c, x',
number = 10000)
Producing this on my system:
Horner 23.3317809105
naive 28.305519104
itera 24.385917902
But still with variable results from on run to the other:
Horner 21.1151690483
naive 23.4374330044
itera 21.305426836
As I said before, to obtain more meaningful results, you should definitively increase the number of tests, and run that on several test case in order to smooth results.

If you are doing a lot of benchmarking, scientific computing, numpy related work and many more things using ipython will be an extremely useful tool.
To benchmark you can time the code with timeit using ipython magic where you will get more consistent results each run, it is simply a matter of using timeit then the function or code to time :
In [28]: timeit Horner(c,x)
1000 loops, best of 3: 670 µs per loop
In [29]: timeit naive(c,x)
1000 loops, best of 3: 983 µs per loop
In [30]: timeit itera(c,x)
1000 loops, best of 3: 804 µs per loop
To time code spanning more than one line you simply use %%timeit:
In [35]: %%timeit
....: for i in range(100):
....: i ** i
....:
10000 loops, best of 3: 110 µs per loop
ipython can compile cython code, f2py code and do numerous other very helpful tasks using different plugins and ipython magic commands.
builtin magic commands
Using cython and some very basic improvements we can improve the efficiency of Horner by about 25 percent:
In [166]: %%cython
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.float_t DTYPE_t
def C_Horner(c, DTYPE_t x):
cdef DTYPE_t p
for i in reversed(c):
p = p * x + i
return p
In [28]: c=npr.uniform(size=(2000,1))
In [29]: timeit Horner(c,-1.34)
100 loops, best of 3: 3.93 ms per loop
In [30]: timeit C_Horner(c,-1.34)
100 loops, best of 3: 2.21 ms per loop
In [31]: timeit itera(c,x)
100 loops, best of 3: 4.10 ms per loop
In [32]: timeit naive(c,x)
100 loops, best of 3: 4.95 ms per loop
Using the list in #Paul drapers answer our cythonised version runs twice as fast as the original function and much faster then ietra and naive:
In [214]: import random
In [215]: c = [random.random() for _ in range(500)]
In [44]: timeit C_Horner(c, -1.34)
10000 loops, best of 3: 18.9 µs per loop
In [45]: timeit Horner(c, -1.34)
10000 loops, best of 3: 44.6 µs per loop
In [46]: timeit naive(c, -1.34)
10000 loops, best of 3: 167 µs per loop
In [47]: timeit itera(c,-1.34)
10000 loops, best of 3: 75.8 µs per loop

NumPy vs Cython - nested loop so slow?

I am confused how NumPy nested loop for 3D array is so slow in comparison with Cython.
I wrote trivial example.
Python/NumPy version:
import numpy as np
def my_func(a,b,c):
s=0
for z in xrange(401):
for y in xrange(401):
for x in xrange(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s+=1
return s
a = np.zeros((401,401,401), dtype=np.float32)
b = np.zeros((401,401,401), dtype=np.uint32)
c = np.zeros((401,401,401), dtype=np.uint8)
s = my_func(a,b,c)
Cythonized version:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def my_func(np.float32_t[:,:,::1] a, np.uint32_t[:,:,::1] b, np.uint8_t[:,:,::1] c):
cdef np.uint16_t z,y,x
cdef np.uint32_t s = 0
for z in range(401):
for y in range(401):
for x in range(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s = s+1
return s
Cythonized version of my_func() runs approx. 6500x faster. Simpler function only with if-statement and array access can be even 10000x faster. Python version of my_func() takes 500.651 sec. to finish. Is iterating over relatively small 3D array so slow or I made some mistake in code?
Cython version 0.21.1, Python 2.7.5, GCC 4.8.1, Xubuntu 13.10.

Python is an interpreted language. One of the benefits of compiling to machine code is the huge speedup you get, especially with things like nested loops.
I don't know what your expectations are, but all interpreted languages will be terribly slow at the things you are trying to do (JIT compiling may help to some extent though).
The trick of getting good performance out of Numpy (or MATLAB or anything similar) is to avoid looping altogether and instead try to refactor your code into a few operations on large matrices. This way, the looping will take place in the (heavily optimized) machine code libraries instead of your Python code.

As mentioned by Krumelur, python loops are definitely slow. You can, however, use numpy to your advantage. Operations on entire arrays are quite fast, although you need a little ingenuity sometimes.
For instance, in your code, since your loop never reads the value in b after you modify it (I think? My head is a little fuzzy at the moment, so you'll definitely want to go through this), the following should be equivalent:
# Precalculate a matrix of x*y*z
tmp = np.indices(a.shape)
prod = (tmp[:,:,:,0] * tmp[:,:,:,1] * tmp[:,:,:,2]).T
# Use array-wide logical operations to compute c using a and the transpose of b
condition = np.logical_and(a == 0, b.T >= 0)
# Use condition to alter b and c only where condition is true
b[condition] = prod[condition]
c[condition] = 1
s = condition.sum()
So this does calculate x*y*z even in cases where the condition is false. You could probably avoid that if it turns out that is using lots of time, but it's likely not to be a significant factor.

For loop with numpy array in python is slow, you should use vector calculation as possible. If the algorithm need for loop for every elements in the array, here is some speedup hint.
a[z,y,x] is a numpy scalar value, calculation with numpy scalar values is very slow:
x = 3.0
%timeit x > 0
x = np.float64(3.0)
%timeit x > 0
the output on my pc with numpy 1.8.2, windows 7:
10000000 loops, best of 3: 64.3 ns per loop
1000000 loops, best of 3: 657 ns per loop
you can use item() method to get the python value directly:
if a.item(z, y, x) == 0 and b.item(x, y, z) >= 0:
...
it can speedup the for loop about 8x.

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).
It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us.
I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()'
1000 loops, best of 3: 16.5 usec per loop
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()'
1000 loops, best of 3: 13.8 usec per loop
$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1
While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.
Also, it seems I cannot see more cache misses when using column V row summation.
EDIT
So far the insight for me was that I was using the timeit module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k).
Using the script in the answer of #alim with a single loop (nloop=1) and ten trials nrep=10, and varying the size of the random array (n x n) I am measuring
n row/us col/us penalty col
1k 90 100 1
4k 100 210 2
10k* 110 350 3.5
20k* 120 1200 10
* n=10k and higher doesn't fit into the L1d cache anymore.
I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.
Perf data:
nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run.
Row sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
row sum: 103.593 us
Performance counter stats for './answer1.py':
25850670 cache-references [30.04%]
1321945 cache-misses # 5.114 % of all cache refs [20.04%]
5706371393 L1-dcache-loads [20.00%]
11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%]
2401264190 L1-dcache-stores [20.04%]
131964213 L1-dcache-store-misses [20.03%]
2007640 L1-dcache-prefetches [20.04%]
21894150686 cycles [20.02%]
24582770606 instructions # 1.12 insns per cycle [30.06%]
3534308182 branches [30.01%]
3767 faults
6 migrations
7.331092823 seconds time elapsed
Column sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
column sum: 377.059 us
Performance counter stats for './answer1.py':
26673628 cache-references [30.02%]
1409989 cache-misses # 5.286 % of all cache refs [20.07%]
5676222625 L1-dcache-loads [20.06%]
11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%]
2405281776 L1-dcache-stores [20.01%]
126425747 L1-dcache-store-misses [20.02%]
2128076 L1-dcache-prefetches [20.04%]
21876671763 cycles [20.00%]
24607897857 instructions # 1.12 insns per cycle [30.00%]
3536753654 branches [29.98%]
3763 faults
9 migrations
7.327833360 seconds time elapsed
EDIT2
I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.
Using this script to benchmark row v column sums:
#!python
import numpy as np
import timeit
print "numpy version == " + str(np.__version__)
setup = "import numpy as np; a = np.random.randn(4000, 4000)"
rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"
nloop = 1000
nrep = 3
print "row sum:\t%.3f us" % (
min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
I detect about a 50% slowdown for column sums with numpy v1.5:
$ python sum_benchmark.py
numpy version == 1.5.0
row sum: 8.472 us
column sum: 12.759 us
Compared with about a 30% slowdown with v1.8.1, which you're using:
$ python sum_benchmark.py
numpy version == 1.8.1
row sum: 12.108 us
column sum: 15.768 us
It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code
to understand exactly why this is the case.
Update
For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU # 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.
I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.
It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).
I've updated my script and results above accordingly
We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop to 1 and nrep to a reasonably small number (unless you really enjoy watching paint dry), say 10.
nloop=1, nreps=10 on a 4000x4000 array:
numpy version == 1.5.0
row sum: 47.922 us
column sum: 103.235 us
numpy version == 1.8.1
row sum: 66.996 us
column sum: 125.885 us
That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

Interesting. I can reproduce Sebastian's performance:
In [21]: np.__version__
Out[21]: '1.8.1'
In [22]: a = np.random.randn(4000, 4000)
In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop
In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop
However, if I try with a larger array:
In [25]: a = np.random.randn(10000, 10000)
In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop
In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop
but, if I try again:
In [28]: a = np.random.randn(10000, 10000)
In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop
In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop
so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?
At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU # 2.90GHz.
EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.
Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.

I wrote the summation example in C: The results are shown for CPU time measurements, and I always used gcc -O1 using-c.c to compile (gcc version: gcc version 4.9.0 20140604). Source code is below.
I chose the matrix size to be n x n. For n<2k the row and column summation do not have any measurable difference (6-7 us per run for n=2k).
Row summation
n first/us converged/us
1k 5 4
4k 19 12
10k 35 31
20k 70 61
30k 130 90
e.g n=20k
Run 0 taken 70 cycles. 0 ms 70 us
Run 1 taken 61 cycles. 0 ms 60 us # this is the minimum I've seen in all tests
Run 1 taken 61 cycles. 0 ms 61 us
<snip> (always 60/61 cycles)
Column
n first/us converged/us
1k 5 4
4k 112 14
10k 228 32
20k 550 246
30k 1000 300
e.g n=20k
Run 0 taken 552 cycles. 0 ms 552 us
Run 1 taken 358 cycles. 0 ms 358 us
Run 2 taken 291 cycles. 0 ms 291 us
Run 3 taken 264 cycles. 0 ms 264 us
Run 4 taken 252 cycles. 0 ms 252 us
Run 5 taken 275 cycles. 0 ms 275 us
Run 6 taken 262 cycles. 0 ms 262 us
Run 7 taken 249 cycles. 0 ms 249 us
Run 8 taken 249 cycles. 0 ms 249 us
Run 9 taken 246 cycles. 0 ms 246 us
discussion
Row summation is faster. I doesn't benefit much from any caching, i.e. repeated sums are not much faster than the initial sum. Column summation is much slower, but it steadily increases for 5-8 iterations. The increase is is most pronounced for n=4k to n=10k where caching helps to increase the speed about tenfold. At larger arrays, the speedup is only about a factor 2. I also observe that while row summation converges very quickly (after one or two trials), column summation convergence takes many more iterations (5 or more).
Take away lesson for me:
For large arrays (more than 2k elements) there is a difference in summation speed. I believe this is due to synergies when fetching data from RAM to L1d cache. Although I don't know the block/line size of one read, I assume it is larger than 8 bytes. So the next element to sum up is already in the cache.
Column sum speed is first and foremost limited by memory bandwidth. The CPU seems to starve for data as the spread out chunks are read from RAM.
When performing the summation repeatedly, one expects that some data doesn't need to be fetched from RAM and is already present in L2/L1d cache. For row summation, this is noticeable only for n>30k, for column summation it becomes apparent already at n>2k.
Using perf, I don't see a large difference though. But the bulk work of the C program is filling the array with random data. I don't know how I can eliminate this "setup" data...
Here Is the C code for this example:
#include <stdio.h>
#include <stdlib.h> // see `man random`
#include <time.h> // man time.h, info clock
int
main (void)
{
// seed
srandom(62);
//printf ("test %g\n", (double)random()/(double)RAND_MAX);
const size_t SIZE = 20E3;
const size_t RUNS = 10;
double (*b)[SIZE];
printf ("Array size: %dx%d, each %d bytes. slice = %f KiB\n", SIZE, SIZE,
sizeof(double), ((double)SIZE)*sizeof(double)/1024);
b = malloc(sizeof *b * SIZE);
//double a[SIZE][SIZE]; // too large!
int i,j;
for (i = 0; i< SIZE; i++) {
for (j = 0; j < SIZE; j++) {
b[i][j] = (double)random()/(double)RAND_MAX;
}
}
double sum = 0;
int run = 0;
clock_t start, diff;
int usec;
for (run = 0; run < RUNS; run++) {
start = clock();
for (i = 0; i<SIZE; i++) {
// column wise (slower?)
sum += b[i][1];
// row wise (faster?)
//sum += b[1][i];
}
diff = clock() - start;
usec = ((double) diff*1e6) / CLOCKS_PER_SEC; // https://stackoverflow.com/a/459704/543411
printf("Run %d taken %d cycles. %d ms %d us\n",run, diff, usec/1000, usec%1000);
}
printf("Sum: %g\n", sum);
return 0;
}

I'm using Numpy 1.9.0.def-ff7d5f9, and I see a 10x difference when executing the two test lines you posted. I wouldn't be surprised if your machine and what compiler you've used to build Numpy is as important to the speedup as the Numpy version is.
In practice though, I don't think it's too common to want to do a reduction of a single column or row like this. I think a better test would be to compare reducing across all rows
a.sum(axis=0)
with reducing across all columns
a.sum(axis=1)
For me, these two operations have only a small difference in speed (reducing across columns takes about 95% of the time of reducing across rows).
EDIT: In general, I'm very wary about comparing the speed of operations that take on the order of microseconds. When installing Numpy, it's very important to have a good BLAS library linked with it, since this is what does the heavy lifting when it comes to most large matrix operations (such as matrix-matrix multiplies). When comparing BLAS libraries, you definitely want to use intensive operations like matrix-matrix dot products as the point of comparison, since this is where you'll spend the vast majority of your time. I've found that sometimes, a worse BLAS library will actually have a slightly faster vector-vector dot operation than a better BLAS library. What makes it worse, though, is that operations like matrix-matrix dot products and eigenvalue decompositions take tens of times longer, and these matter much more than a cheap vector-vector dot. I think these differences often appear because you can write a reasonably fast vector-vector dot in C without much thought, but writing a good matrix-matrix dot takes a lot of thought and optimization, and is the more costly operation, so this is where good BLAS packages put their effort.
The same is true in Numpy: any optimization is going to be done on larger operations, not small ones, so don't get hung up on speed differences between small operations. Furthermore, it's hard to tell if any speed difference with a small operation is really due the time of the computation, or is just due to overhead that is put in to optimize more costly operations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.