numpy array: fast assign short array to large array with index

numpy array: fast assign short array to large array with index - python

I want to assign values to large array from short arrays with indexing. Simple codes are as follows:
import numpy as np
def assign_x():
a = np.zeros((int(3e6), 20))
index_a = np.random.randint(int(3e6), size=(int(3e6), 20))
b = np.random.randn(1000, 20)
for i in range(20):
index_b = np.random.randint(1000, size=int(3e6))
a[index_a[:, i], i] = b[index_b, i]
return a
%timeit x = assign_x()
# 2.79 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have tried other ways which may be relevant, for example np.take and numba's jit, but seems the above is the fastest way. It can also be possibly speed-up using multiprocessing. I have profile the codes, the most time is at the line below as it runs many times (20 here)
a[index_a[:, i], i] = b[index_b, i]
Any chance I can make this faster before using multiprocessing?

Why this is slow
This is slow because the memory access pattern is very inefficient. Indeed, random accesses are slow because the processor cannot predict them. As a result, it causes expensive cache misses (if the array does not fit in the L1/L2 cache) that cannot be avoided by prefetching data ahead of time. The thing is the arrays are to big to fit in caches: index_a and a takes each 457 MiB and b takes 156 KiB. As a results, access to b are typically done in the L2 cache with a higher latency and the accesses to the two other array are done in RAM. This is slow because the current DDR RAMs have huge latency of 60-100 ns on a typical PC. Even worse: this latency is likely not gonna be much smaller in a near future: the RAM latency has not changed much since the last two decades. This is called the Memory wall. Note also that modern processors fetch a full cache line of usually 64 bytes from the RAM when a value at a random location is requested (resulting in only 56/64=87.5% of the bandwidth to be wasted). Finally, generating random numbers is a quite expensive process, especially large integers, and np.random.randint can generate either 32-bit or 64-bit integers regarding the target platform.
How to improve this
The first improvement is to prefer indirection on the most contiguous dimension which is generally the last one since a[:,i] is slower than a[i,:]. You can transpose the arrays and swap the indexed values. However, the Numpy transposition function only return a view and does not actually transpose the array in memory. Thus an explicit copy in currently required. The best here is simply to directly generate the array so that accesses are efficient (rather than using expensive transpositions). Note you can use simple precision so array can better fit in caches at the expense of a lower precision.
Here is an example that returns a transposed array:
import numpy as np
def assign_x():
a = np.zeros((20, int(3e6)))
index_a = np.random.randint(int(3e6), size=(20, int(3e6)))
b = np.random.randn(20, 1000)
for i in range(20):
index_b = np.random.randint(1000, size=int(3e6))
a[i, index_a[i, :]] = b[i, index_b]
return a
%timeit x = assign_x()
The code can be improved further using Numba so to run the code in parallel (one core should not be enough to saturate the memory because of the RAM latency but many core can better use it because multiple fetches can be done concurrently). Moreover, it can help avoid the creation of big temporary arrays.
Here is an optimize Numba code:
import numpy as np
import numba as nb
import random
#nb.njit('float64[:,:]()', parallel=True)
def assign_x():
a = np.zeros((20, int(3e6)))
b = np.random.randn(20, 1000)
for i in nb.prange(20):
for j in range(3_000_000):
index_a = random.randint(0, 3_000_000)
index_b = random.randint(0, 1000)
a[i, index_a] = b[i, index_b]
return a
%timeit x = assign_x()
Here are results on a 10-core Skylake Xeon processor:
Initial code: 2798 ms
Better memory access pattern: 1741 ms
With Numba: 318 ms
Note that parallelizing the inner-most loop would theoretically be faster because one line of a is more likely to fit in the last-level cache. However, doing this will cause a race condition that can only be fixed efficiently with atomic stores not yet available in Numba (on CPU).
Note that the final code does not scale well because it is memory-bound. This is due to 87.5% of the memory throughput being wasted as explained before. Additionally, on many processors (like all Intel and AMD-Zen processors) the write allocate cache policy force data being read from memory for each store in this case. This makes the computation much more inefficient raising the wasted throughput to 93.7%... AFAIK, there is no way to prevent this in Python. In C/C++, the write allocate issue can be fixed using low-level instructions. The rule of thumb is avoid memory random access patterns on big arrays like the plague.

Related

Python: how to speed up this function and make it more scalable?

I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.

Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.

You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.

As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)

Why is JAX's `split()` so slow at first call?

jax.numpy.split can be used to segment an array into equal-length segments with a remainder in the last element. e.g. splitting an array of 5000 elements into segments of 10:
array = jnp.ones(5000)
segment_size = 10
split_indices = jnp.arange(segment_size, array.shape[0], segment_size)
segments = jnp.split(array, split_indices)
This takes around 10 seconds to execute on Google Colab and on my local machine. This seems unreasonable for such a simple task on a small array. Am I doing something wrong to make this slow?
Further Details (JIT caching, maybe?)
Subsequent calls to .split are very fast, provided an array of the same shape and the same split indices. e.g. the first iteration of the following loop is extremely slow, but all others fast. (11 seconds vs 40 milliseconds)
from timeit import default_timer as timer
import jax.numpy as jnp
array = jnp.ones(5000)
segment_size = 10
split_indices = jnp.arange(segment_size, array.shape[0], segment_size)
for k in range(5):
start = timer()
segments = jnp.split(array, split_indices)
end = timer()
print(f'call {k}: {end - start:0.2f} s')
Output:
call 0: 11.79 s
call 1: 0.04 s
call 2: 0.04 s
call 3: 0.05 s
call 4: 0.04 s
I assume that the subsequent calls are faster because JAX is caching jitted versions of split for each combination of arguments. If that's the case, then I assume split is slow (on its first such call) because of compilation overhead.
Is that true? If yes, how should I split a JAX array without incurring the performance hit?

This is slow because there are tradeoffs in the implementation of split(), and your function happens to be on the wrong side of the tradeoff.
There are several ways to compute slices in XLA, including XLA:Slice (i.e. lax.slice), XLA:DynamicSlice (i.e. lax.dynamic_slice), and XLA:Gather (i.e. lax.gather).
The main difference between these concerns whether the start and ending indices are static or dynamic. Static indices essentially mean you're specializing your computation for specific index values: this incurs some small compilation overhead on the first call, but subsequent calls can be very fast. Dynamic indices, on the other hand, don't include such specialization, so there is less compilation overhead, but each execution takes slightly longer. You may be able to guess where this is going...
jnp.split currently is implemented in terms of lax.slice (see code), meaning it uses static indices. This means that the first use of jnp.split will incur compilation cost proportional to the number of outputs, but repeated calls will execute very quickly. This seemed like the best approach for common uses of split, where a handful of arrays are produced.
In your case, you're generating hundreds of arrays, so the compilation cost far dominates over the execution.
To illustrate this, here are some timings for three approaches to the same array split, based on gather, slice, and dynamic_slice. You might wish to use one of these directly rather than using jnp.split if your program benefits from different implementations:
from timeit import default_timer as timer
from jax import lax
import jax.numpy as jnp
import jax
def f_slice(x, step=10):
return [lax.slice(x, (N,), (N + step,)) for N in range(0, x.shape[0], step)]
def f_dynamic_slice(x, step=10):
return [lax.dynamic_slice(x, (N,), (step,)) for N in range(0, x.shape[0], step)]
def f_gather(x, step=10):
step = jnp.asarray(step)
return [x[N: N + step] for N in range(0, x.shape[0], step)]
def time(f, x):
print(f.__name__)
for k in range(5):
start = timer()
segments = jax.block_until_ready(f(x))
end = timer()
print(f' call {k}: {end - start:0.2f} s')
x = jnp.ones(5000)
time(f_slice, x)
time(f_dynamic_slice, x)
time(f_gather, x)
Here's the output on a Colab CPU runtime:
f_slice
call 0: 7.78 s
call 1: 0.05 s
call 2: 0.04 s
call 3: 0.04 s
call 4: 0.04 s
f_dynamic_slice
call 0: 0.15 s
call 1: 0.12 s
call 2: 0.14 s
call 3: 0.13 s
call 4: 0.16 s
f_gather
call 0: 0.55 s
call 1: 0.54 s
call 2: 0.51 s
call 3: 0.58 s
call 4: 0.59 s
You can see here that static indices (lax.slice) lead to the fastest execution after compilation. However, for generating many slices, dynamic_slice and gather avoid repeated compilations. It may be that we should re-implement jnp.split in terms of dynamic_slice, but that wouldn't come without tradeoffs: for example, it would lead to a slowdown in the (possibly more common?) case of few splits, where lax.slice would be faster on both initial and subsequent runs. Also, dynamic_slice only avoids recompilation if each slice is the same size, so generating many slices of varying sizes would incur a large compilation overhead similar to lax.slice.
These kinds of tradeoffs are actively discussed in JAX development channels; a recent example very similar to this can be found in PR #12219. If you wish to weigh-in on this particular issue, I'd invite you to file a new jax issue on the topic.
A final note: if you're truly just interested in generating equal-length sequential slices of an array, you would be much better off just calling reshape:
out = x.reshape(len(x) // 10, 10)
The result is now a 2D array where each row corresponds to a slice from the above functions, and this will far out-perform anything that's generating a list of array slices.

Jax inbult functions are also JIT compiled
Benchmarking JAX code
JAX code is Just-In-Time (JIT) compiled. Most code written in JAX can
be written in such a way that it supports JIT compilation, which can
make it run much faster (see To JIT or not to JIT). To get maximium
performance from JAX, you should apply jax.jit() on your outer-most
function calls.
Keep in mind that the first time you run JAX code, it will be slower
because it is being compiled. This is true even if you don’t use jit
in your own code, because JAX’s builtin functions are also JIT
compiled.
So the first time you run it, it is compiling jnp.split (Or at least, compiling some of the functions used within jnp.split)
%%timeit -n1 -r1
jnp.split(array, split_indices)
1min 15s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
The second time, it is calling the compiled function
%%timeit -n1 -r1
jnp.split(array, split_indices)
131 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
It is fairly complicated, calling other jax.numpy functions, so I assume it can take quite a while to compile (1 minute on my machine!)

Is there a way to speed up indexing a vector with JAX?

I am indexing vectors and using JAX, but I have noticed a considerable slow-down compared to numpy when simply indexing arrays. For example, consider making a basic array in JAX numpy and ordinary numpy:
import jax.numpy as jnp
import numpy as onp
jax_array = jnp.ones((1000,))
numpy_array = onp.ones(1000)
Then simply indexing between two integers, for JAX (on GPU) this gives a time of:
%timeit jax_array[435:852]
1000 loops, best of 5: 1.38 ms per loop
And for numpy this gives a time of:
%timeit numpy_array[435:852]
1000000 loops, best of 5: 271 ns per loop
So numpy is 5000 times faster than JAX. When JAX is on a CPU, then
%timeit jax_array[435:852]
1000 loops, best of 5: 577 µs per loop
So faster, but still 2000 times slower than numpy. I am using Google Colab notebooks for this, so there should not be a problem with the installation/CUDA.
Am I missing something? I realise that indexing is different for JAX and numpy, as given by the JAX 'sharp edges' documentation, but I cannot find any way to perform assignment such as
new_array = jax_array[435:852]
without a considerable slowdown. I cannot avoid indexing the arrays as it is necessary in my program.

The short answer: to speed things up in JAX, use jit.
The long answer:
You should generally expect single operations using JAX in op-by-op mode to be slower than similar operations in numpy. This is because JAX execution has some amount of fixed per-python-function-call overhead involved in pushing compilations down to XLA.
Even seemingly simple operations like indexing are implemented in terms of multiple XLA operations, which (outside JIT) will each add their own call overhead. You can see this sequence using the make_jaxpr transform to inspect how the function is expressed in terms of primitive operations:
from jax import make_jaxpr
f = lambda x: x[435:852]
make_jaxpr(f)(jax_array)
# { lambda ; a.
# let b = broadcast_in_dim[ broadcast_dimensions=( )
# shape=(1,) ] 435
# c = gather[ dimension_numbers=GatherDimensionNumbers(offset_dims=(0,), collapsed_slice_dims=(), start_index_map=(0,))
# indices_are_sorted=True
# slice_sizes=(417,)
# unique_indices=True ] a b
# d = broadcast_in_dim[ broadcast_dimensions=(0,)
# shape=(417,) ] c
# in (d,) }
(See Understanding Jaxprs for info on how to read this).
Where JAX outperforms numpy is not in single small operations (in which JAX dispatch overhead dominates), but rather in sequences of operations compiled via the jit transform. So, for example, compare the JIT-compiled versus not-JIT-compiled version of the indexing:
%timeit f(jax_array).block_until_ready()
# 1000 loops, best of 5: 612 µs per loop
f_jit = jit(f)
f_jit(jax_array) # trigger compilation
%timeit f_jit(jax_array).block_until_ready()
# 100000 loops, best of 5: 4.34 µs per loop
(note that block_until_ready() is required for accurate micro-benchmarks because of JAX's asynchronous dispatch)
JIT-compiling this code gives a 150x speedup. It's still not as fast as numpy because of JAX's few-millisecond dispatch overhead, but with JIT that overhead is incurred only once. And when you move past microbenchmarks to more complicated sequences of real-world computations, those few milliseconds will no longer dominate, and the optimization provided by the XLA compiler can make JAX far faster than the equivalent numpy computation.

Why is my vectorized Numpy code taking longer than the non-vectorized code

So I am calculating Poisson distributions using large amounts of data. I have an array of shape (2666667,19) - "spikes", and an array of shape (19,100) - "placefields". I used to have a for loop that iterated through the 2666667 dimension, which took around 60 seconds to complete. Then, I learned that If I vectorize for loops, it becomes much faster, so I tried to do so. The vectorized form works and outputs the same results, however, now it takes 120 seconds :/
Here is the original loop (60s):
def compute_probability(spikes,placefields):
nTimeBins = len(spikes[0])
probability = np.empty((nTimeBins, 99)) #empty probability matrix
for i in range(nTimeBins):
nspikes = np.tile(spikes[:,i],(99))
nspikes = np.swapaxes(nspikes,0,1)
maxL = stats.poisson.pmf(nspikes,placefields)
maxL = maxL.prod(axis=0)
probability[i,:] = maxL
return probability
And here is the vectorised form (120s)
def compute_probability(spikes,placefields):
placefields = np.reshape(placefields,(19,99,1))
#prepared placefields
nspikes = np.tile(spikes, (99,1,1))
nspikes = np.swapaxes(nspikes,0,1)
#prepared nspikes
probability = stats.poisson.pmf(nspikes,placefields)
probability = np.swapaxes(probability.prod(axis=0),0,1)
return probability
Why is it SO SLOW. I think it might be that the tiled arrays created by the vectorized form are so gigantic they take up a huge amount of memory. How can I make it go faster?
download samplespikes and sampleplacefields (as suggested by the comments)- https://mega.nz/file/lpRF1IKI#YHq1HtkZ9EzYvaUdlrMtBwMg-0KEwmhFMYswxpaozXc
EDIT:
The issue was that although it was vectorized, the huge array was taking up too much RAM. I have split the calculation into chunks, and it does better now:
placefields = np.reshape(placefields,(len(placefields),99,1))
nspikes = np.swapaxes(np.tile(spikes, (xybins,1,1)),0,1)
probability = np.empty((len(spikes[0]), xybins))
chunks = len(spikes[0])//20
n = int(len(spikes[0])/chunks)
for i in range(0,len(nspikes[0][0]),n):
nspikes_chunk = nspikes[:,:,i:i+n]
probability_chunk = stats.poisson.pmf(nspikes_chunk,placefields)
probability_chunk = np.swapaxes(probability_chunk.prod(axis=0),0,1)
if len(probability_chunk)<(len(spikes)//chunks):
probability[i:] = probability_chunk
else:
probability[i:i+len(probability_chunk)] = probability_chunk

This is likely due to memory/cache effects.
The first code work on small arrays fitting in the CPU caches. It is not great because each Numpy function call take some time. The second code fix that issue. However, it allocate/fill huge arrays in memory of several GiB. It is much faster to work in CPU caches than in main memory (RAM). This is especially true when the working arrays are used only once (because of expensive OS page-faults) which seems to be the case in your code. If you do not have enough memory, the OS will read/write temporary data in SSD/HDD storage devices that are very slow compared to the RAM and the CPU caches.
The best solution is probably to work on chunks so that the operation is both vectorized (reducing the overhead of the Numpy function calls) and fit in CPU caches (reducing the cost of RAM reads/writes). Note that the size of the last level cache is typically few MiB nowadays on mainstream PC processors.
The takeaway message is that vectorization do not always make things faster. For better performance, one should care about the size of the manipulated data chunks so they fit in CPU caches.
PS: note that if you do not care too much about precision, you can use simple-precision (np.float32) instead of double-precision (np.float64) to speed up a bit the computation.

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).
It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us.
I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()'
1000 loops, best of 3: 16.5 usec per loop
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()'
1000 loops, best of 3: 13.8 usec per loop
$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1
While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.
Also, it seems I cannot see more cache misses when using column V row summation.
EDIT
So far the insight for me was that I was using the timeit module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k).
Using the script in the answer of #alim with a single loop (nloop=1) and ten trials nrep=10, and varying the size of the random array (n x n) I am measuring
n row/us col/us penalty col
1k 90 100 1
4k 100 210 2
10k* 110 350 3.5
20k* 120 1200 10
* n=10k and higher doesn't fit into the L1d cache anymore.
I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.
Perf data:
nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run.
Row sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
row sum: 103.593 us
Performance counter stats for './answer1.py':
25850670 cache-references [30.04%]
1321945 cache-misses # 5.114 % of all cache refs [20.04%]
5706371393 L1-dcache-loads [20.00%]
11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%]
2401264190 L1-dcache-stores [20.04%]
131964213 L1-dcache-store-misses [20.03%]
2007640 L1-dcache-prefetches [20.04%]
21894150686 cycles [20.02%]
24582770606 instructions # 1.12 insns per cycle [30.06%]
3534308182 branches [30.01%]
3767 faults
6 migrations
7.331092823 seconds time elapsed
Column sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
column sum: 377.059 us
Performance counter stats for './answer1.py':
26673628 cache-references [30.02%]
1409989 cache-misses # 5.286 % of all cache refs [20.07%]
5676222625 L1-dcache-loads [20.06%]
11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%]
2405281776 L1-dcache-stores [20.01%]
126425747 L1-dcache-store-misses [20.02%]
2128076 L1-dcache-prefetches [20.04%]
21876671763 cycles [20.00%]
24607897857 instructions # 1.12 insns per cycle [30.00%]
3536753654 branches [29.98%]
3763 faults
9 migrations
7.327833360 seconds time elapsed
EDIT2
I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.
Using this script to benchmark row v column sums:
#!python
import numpy as np
import timeit
print "numpy version == " + str(np.__version__)
setup = "import numpy as np; a = np.random.randn(4000, 4000)"
rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"
nloop = 1000
nrep = 3
print "row sum:\t%.3f us" % (
min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
I detect about a 50% slowdown for column sums with numpy v1.5:
$ python sum_benchmark.py
numpy version == 1.5.0
row sum: 8.472 us
column sum: 12.759 us
Compared with about a 30% slowdown with v1.8.1, which you're using:
$ python sum_benchmark.py
numpy version == 1.8.1
row sum: 12.108 us
column sum: 15.768 us
It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code
to understand exactly why this is the case.
Update
For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU # 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.
I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.
It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).
I've updated my script and results above accordingly
We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop to 1 and nrep to a reasonably small number (unless you really enjoy watching paint dry), say 10.
nloop=1, nreps=10 on a 4000x4000 array:
numpy version == 1.5.0
row sum: 47.922 us
column sum: 103.235 us
numpy version == 1.8.1
row sum: 66.996 us
column sum: 125.885 us
That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

Interesting. I can reproduce Sebastian's performance:
In [21]: np.__version__
Out[21]: '1.8.1'
In [22]: a = np.random.randn(4000, 4000)
In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop
In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop
However, if I try with a larger array:
In [25]: a = np.random.randn(10000, 10000)
In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop
In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop
but, if I try again:
In [28]: a = np.random.randn(10000, 10000)
In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop
In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop
so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?
At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU # 2.90GHz.
EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.
Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.

I wrote the summation example in C: The results are shown for CPU time measurements, and I always used gcc -O1 using-c.c to compile (gcc version: gcc version 4.9.0 20140604). Source code is below.
I chose the matrix size to be n x n. For n<2k the row and column summation do not have any measurable difference (6-7 us per run for n=2k).
Row summation
n first/us converged/us
1k 5 4
4k 19 12
10k 35 31
20k 70 61
30k 130 90
e.g n=20k
Run 0 taken 70 cycles. 0 ms 70 us
Run 1 taken 61 cycles. 0 ms 60 us # this is the minimum I've seen in all tests
Run 1 taken 61 cycles. 0 ms 61 us
<snip> (always 60/61 cycles)
Column
n first/us converged/us
1k 5 4
4k 112 14
10k 228 32
20k 550 246
30k 1000 300
e.g n=20k
Run 0 taken 552 cycles. 0 ms 552 us
Run 1 taken 358 cycles. 0 ms 358 us
Run 2 taken 291 cycles. 0 ms 291 us
Run 3 taken 264 cycles. 0 ms 264 us
Run 4 taken 252 cycles. 0 ms 252 us
Run 5 taken 275 cycles. 0 ms 275 us
Run 6 taken 262 cycles. 0 ms 262 us
Run 7 taken 249 cycles. 0 ms 249 us
Run 8 taken 249 cycles. 0 ms 249 us
Run 9 taken 246 cycles. 0 ms 246 us
discussion
Row summation is faster. I doesn't benefit much from any caching, i.e. repeated sums are not much faster than the initial sum. Column summation is much slower, but it steadily increases for 5-8 iterations. The increase is is most pronounced for n=4k to n=10k where caching helps to increase the speed about tenfold. At larger arrays, the speedup is only about a factor 2. I also observe that while row summation converges very quickly (after one or two trials), column summation convergence takes many more iterations (5 or more).
Take away lesson for me:
For large arrays (more than 2k elements) there is a difference in summation speed. I believe this is due to synergies when fetching data from RAM to L1d cache. Although I don't know the block/line size of one read, I assume it is larger than 8 bytes. So the next element to sum up is already in the cache.
Column sum speed is first and foremost limited by memory bandwidth. The CPU seems to starve for data as the spread out chunks are read from RAM.
When performing the summation repeatedly, one expects that some data doesn't need to be fetched from RAM and is already present in L2/L1d cache. For row summation, this is noticeable only for n>30k, for column summation it becomes apparent already at n>2k.
Using perf, I don't see a large difference though. But the bulk work of the C program is filling the array with random data. I don't know how I can eliminate this "setup" data...
Here Is the C code for this example:
#include <stdio.h>
#include <stdlib.h> // see `man random`
#include <time.h> // man time.h, info clock
int
main (void)
{
// seed
srandom(62);
//printf ("test %g\n", (double)random()/(double)RAND_MAX);
const size_t SIZE = 20E3;
const size_t RUNS = 10;
double (*b)[SIZE];
printf ("Array size: %dx%d, each %d bytes. slice = %f KiB\n", SIZE, SIZE,
sizeof(double), ((double)SIZE)*sizeof(double)/1024);
b = malloc(sizeof *b * SIZE);
//double a[SIZE][SIZE]; // too large!
int i,j;
for (i = 0; i< SIZE; i++) {
for (j = 0; j < SIZE; j++) {
b[i][j] = (double)random()/(double)RAND_MAX;
}
}
double sum = 0;
int run = 0;
clock_t start, diff;
int usec;
for (run = 0; run < RUNS; run++) {
start = clock();
for (i = 0; i<SIZE; i++) {
// column wise (slower?)
sum += b[i][1];
// row wise (faster?)
//sum += b[1][i];
}
diff = clock() - start;
usec = ((double) diff*1e6) / CLOCKS_PER_SEC; // https://stackoverflow.com/a/459704/543411
printf("Run %d taken %d cycles. %d ms %d us\n",run, diff, usec/1000, usec%1000);
}
printf("Sum: %g\n", sum);
return 0;
}

I'm using Numpy 1.9.0.def-ff7d5f9, and I see a 10x difference when executing the two test lines you posted. I wouldn't be surprised if your machine and what compiler you've used to build Numpy is as important to the speedup as the Numpy version is.
In practice though, I don't think it's too common to want to do a reduction of a single column or row like this. I think a better test would be to compare reducing across all rows
a.sum(axis=0)
with reducing across all columns
a.sum(axis=1)
For me, these two operations have only a small difference in speed (reducing across columns takes about 95% of the time of reducing across rows).
EDIT: In general, I'm very wary about comparing the speed of operations that take on the order of microseconds. When installing Numpy, it's very important to have a good BLAS library linked with it, since this is what does the heavy lifting when it comes to most large matrix operations (such as matrix-matrix multiplies). When comparing BLAS libraries, you definitely want to use intensive operations like matrix-matrix dot products as the point of comparison, since this is where you'll spend the vast majority of your time. I've found that sometimes, a worse BLAS library will actually have a slightly faster vector-vector dot operation than a better BLAS library. What makes it worse, though, is that operations like matrix-matrix dot products and eigenvalue decompositions take tens of times longer, and these matter much more than a cheap vector-vector dot. I think these differences often appear because you can write a reasonably fast vector-vector dot in C without much thought, but writing a good matrix-matrix dot takes a lot of thought and optimization, and is the more costly operation, so this is where good BLAS packages put their effort.
The same is true in Numpy: any optimization is going to be done on larger operations, not small ones, so don't get hung up on speed differences between small operations. Furthermore, it's hard to tell if any speed difference with a small operation is really due the time of the computation, or is just due to overhead that is put in to optimize more costly operations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.