Rolling statistics performance: pandas vs. numpy strides - python

I am interested in calculating statistics in rolling windows on large, 1D numpy arrays. For small window sizes, using numpy strides (a la numpy.lib.stride_tricks.sliding_window_view) is faster than pandas rolling window implementation, but the opposite is true for large window sizes.
Consider the following:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
import pandas as pd
data = np.random.randn(10**6)
data_pandas = pd.Series(data)
window = 2
%timeit np.mean(sliding_window_view(data, window), axis=1)
# 19.3 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data_pandas.rolling(window).mean()
# 34.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
window = 1000
%timeit np.mean(sliding_window_view(data, window), axis=1)
# 302 ms ± 8.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data_pandas.rolling(window).mean()
# 31.7 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
result_numpy = np.mean(sliding_window_view(data, window), axis=1)
result_pandas = data_pandas.rolling(window).mean()[window-1:]
np.allclose(result_numpy, result_pandas)
# True
The pandas implementation is actually faster for a larger window size, whereas the numpy implementation is much slower.
What is going on under the hood with pandas, and how can we get similar performance using numpy?
How can I get similar performance on large windows in numpy as compared to pandas?

TL;DR: The two versions use very different algorithms.
The sliding_window_view trick is good to solve the rolling average problem with a small window but this is not a clean way to do that nor an efficient way, especially with a big window. Indeed, Numpy compute a mean and note a rolling average and thus have no clear information that the user is cheating with stride to compute something else. The provided Numpy implementation runs in O(n * w) where n is the array size and w the window size. Pandas does have the information that a rolling average needs to be computed and so it uses a much more efficient algorithm. The Pandas algorithm runs in O(n) time. For more information about it please read this post.
Here is a much faster Numpy implementation:
cumsum = np.cumsum(data)
invSize = 1. / window
(cumsum[window-1:] - np.concatenate([[0], cumsum[:-window]])) * invSize
Here are the performance results on my machine:
Naive Numpy version: 193.2 ms
Pandas version: 33.1 ms
Fast Numpy version: 8.5 ms

Related

Numpy speed for computing metrics (like np.mean) over multiple axes vs. single axis

Currently, I am working with video data and therefore I am performing statistical operations on multiple frames at once. During a debugging session I observed that the computation for numpy statistics (mean computation) in this case) over multiple axes takes longer when computed directly over the desired axes compared to computing it over each axis separately one after the other. I created a simple example to explain my observations.
from timeit import default_timer as timer
import numpy as np
rnd_frames = np.random.randn(100, 128, 128, 3)
n_reps = 1000
# -----------------------------------
# mean computation over multiple axes
# -----------------------------------
# all axes at once
ts = timer()
for i in range(n_reps):
mean_1 = np.mean(rnd_frames, axis=(1, 2))
print('Mean all at once: ', (timer()-ts)/n_reps)
# one after the other
ts = timer()
for i in range(n_reps):
mean_2 = np.mean(rnd_frames, axis=1)
mean_2 = np.mean(mean_2, axis=1)
print('Mean one after the other: ', (timer()-ts)/n_reps)
print('Difference in means: ', np.sum(np.abs(mean_1-mean_2)))
The difference is very small and results from float64 precision.
Does someone have an explanation for this? As the time differences are quite significant: One after the other is 10x faster
I wonder if this is some kind of bug? Can anyone explain this.
The time for 2 axes is the same as for a 1 axis on the equivalent reshape:
In [7]: timeit mean_1 = np.mean(rnd_frames, axis=(1, 2))
54.2 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: timeit mean_3 = np.mean(rnd_frames.reshape(100,-1,3), axis=1)
54.5 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: rnd_frames.reshape(100,-1,3).shape
Out[12]: (100, 16384, 3)
As you note this is quite a bit larger than a sequential calculation:
In [13]: %%timeit
...: mean_2 = np.mean(rnd_frames, axis=1)
...: mean_2 = np.mean(mean_2, axis=1)
7.63 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Without getting deep into the woods of compiled code it's hard to say why there this difference. While "avoiding loops" is a common performance strategy in numpy, that applies mostly to "many loops on a simple task". A few loops on a complex task can be faster. I'm not sure that applies here, but I'm not surprised that there are differences like this.
We could also explore whether putting those 2 loops at the end (inner most), or beginning of the dimensions shows this difference or not.
edit
If I move the 2 axes to either the beginning, or end, the time difference is much smaller. There's something about having that small size 3 dimension at the end (inner most) that's making your example unusually slow.

Is there a better way of implementing a histogram?

I have a 2D uint16 numpy array, I want to calculate the histogram for this array.
The function I use is:
def calc_hist(source):
hist = np.zeros(2**16, dtype='uint16')
for i in range(source.shape[0]):
for j in range(source.shape[1]):
hist[source[i, j]] = hist[source[i, j]] + 1
return hist
This function takes too much time to execute.
As I understand, there's a histogram function in the numpy module but I cant figure how to use it.
I've tried:
hist,_ = np.histogram(source.flatten(), bins=range(2**16))
But I get different results then my own function.
How can I call numpy.histogram to achieve the same result? or is there any other options?
For an input with data type uint16, numpy.bincount should work well:
hist = np.bincount(source.ravel(), minlength=2**16)
Your function is doing almost exactly what bincount does, but bincount is implemented in C.
For example, the following checks that this use of bincount gives the same result as your calc_hist function:
In [159]: rng = np.random.default_rng()
In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))
In [161]: h1 = calc_hist(x)
In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)
In [163]: (h1 == h2).all() # Verify that they are the same.
Out[163]: True
Check the performance with ipython %timeit command. You can see that using bincount is much faster.
In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As Corley Brigman pointed out, passing bins=range(x) determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].
In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1).
[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram

Tensorflow `map_fn` takes long time to execute

Given tensors a of shape (n, f) and b of shape (m, f), I have created a function to calculate euclidean distances between these two tensors
import tensorflow as tf
nr = tf.reduce_sum(tf.square(a), 1)
nw = tf.reduce_sum(tf.square(b), 1)
nr = tf.reshape(nr, [-1, 1])
nw = tf.reshape(nw, [1, -1])
res = nr - 2*tf.matmul(a, b, False, True) + nw
res = tf.argmin(res, axis=1)
So far so good, the code runs slightly fast (I got better performance with cKDTree, when n= 1000, m=1600, f=4, but this is not the issue now). I will check the performance versus different input sizes later.
In this example the b tensor is a rank 2, flattened version of a rank 3 tensor. I do that to be able to evaluate the euclidean distances using two tensors with same rank (that is simpler). But after evaluate the distances I need to know where on the original tensor each one of the nearest elements are. For that I have created the custom lambda function fn to convert back to the rank 3 tensor coordinates.
fn = lambda x: (x//N, x%N)
# This map takes a enormous amount of time
out = tf.map_fn(fn, res, dtype=(tf.int64, tf.int64))
return tf.stack(out, axis=1)
But sadly this tf.map_fn takes a HUGE time to run, around 300ms.
Just for comparison, if I perform a np.apply_along_axis in a dataset that exacly the same data (but a numpy array) the footprint is barely noticiable, around 50 microseconds vs. 300ms of tensorflow equivalent.
There are better approaches in tensorflow for this mapping?
TF version 2.1.0 and CUDA is enabled and working.
Just to add some timings
%timeit eucl_dist_tf_vecmap(R_tf, W_tf)
28.1 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit eucl_dist_tf_nomap(R_tf, W_tf)
2.07 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit eucl_dist_ckdtree_applyaxis(R, W)
878 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit eucl_dist_ckdtree_noapplyaxis(R, W)
817 µs ± 51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The first two timings are using the custom function shown here, the first one with vectorized_map and the second one without vectorized_map and the stack (the overhead is on vectorized_map, tested.
And the last two times is an implementations based on scipy's cKDTree. The first one uses np.apply_along_axis exactly as used in vectorized map. We can see that overhead is much smaller in the numpy array.
You could try tf.vectorized_map. https://www.tensorflow.org/api_docs/python/tf/vectorized_map
If you need to change de data type, you can try to change parallel_iterations value in map_fn params, that is set to 1 by default in eager mode.

numpy fftn very inefficient for 2d fft of several images

I wanted to compute the fourier transform of several images.
I was therefore benchmarking numpy's fft.fftn against a brute force for loop.
This is the code I used to benchmark the 2 approaches (in a jupyter notebook):
import numpy as np
x = np.random.rand(32, 256, 256)
def iterate_fft(arr):
k = np.empty_like(arr, dtype=np.complex64)
for i, a in enumerate(arr):
k[i] = np.fft.fft2(a)
return k
k_it = iterate_fft(x)
k_np = np.fft.fftn(x, axes=(1, 2))
np.testing.assert_allclose(k_it.real, k_np.real)
np.testing.assert_allclose(k_it.imag, k_np.imag)
%%timeit
k_it = iterate_fft(x)
Output: 63.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
k_np = np.fft.fftn(x, axes=(1, 2))
Output: 122 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is there such a huge difference ?
So a person involved in the numpy fft development has answered the deep question on GitHub and it turns out that the slowdown is most likely coming from some multi dimensional array rearrangement used by pocketfft.
It will all be a memory when numpy switches to the scipy 1.4 implementation which can be shown using my benchmark to not have these drawbacks.
These routines in numpy seem to currently assume that the last dimension will always be the smallest. When this is actually true fftn is faster, sometimes by a lot.
That said, I get a much smaller difference in performance between these two methods than you (with Python 3.7.4, numpy 1.17.2). For your example, iterate_fft takes 46ms while ffn takes 50. But if I flip the axes around, to (256, 256, 32), I get 55ms and 40ms respectively. Pushing even further with a shape of (256, 256, 2) I get 21ms and 4ms respectively.
Note that if performance is really an issue, there are other FFT libraries available that perform better in some situations. Also the full fftpack in scipy can have very different performance than the more limited code in numpy.
Note that your usage of fftn basically does:
x = np.random.rand(32, 256, 256)
a = np.fft.fft(x, n=256, axis=2)
a = np.fft.fft(a, n=256, axis=1)
np.testing.assert_allclose(np.fft.fftn(x, axes=(1, 2)), a)

The necessity of LU decomposition (using numpy as an example)

I am trying to understand the necessity of LU decomposition using numpy and scipy libraries. From what I understand is that we want to solve Ax = b, we first factorize A into two triangular matrices L and U then solve LUx = b by solving Ly = b then Ux = y. By solving triangular matrices, we can reduce time compare to Gaussian Elimination.
So, I tired this idea in python using numpy and scipy.
I first construct A and b using toy examples:
A = np.array([[2, 1, 0, 5], [1, 2, 1, 2], [0, 1, 2, 4], [1, 3, 6, 4.5]])
b = np.array([9, 10, -2, 3])
Then first solve this toy example in np.solve
%timeit np.linalg.solve(A, b )
The time is
9.76 µs ± 782 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Then I use factorization to solve this system:
lu, piv = linalg.lu_factor(A)
%timeit linalg.lu_solve((lu, piv), b)
I saw the output is
18.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
, which is quiet slow compared to np.solve.
So, my question is, why np.solve is faster than linalg.lu_factor? My guess is that numpy.solve does not use Gaussian Elimination to solve the equations? A little confused with the result here.
Edit
Now, I use a much larger matrix to do the experiment (10000 x 10000).
Here is the result:
for np.linalg.solve
8.64 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each);
for scipy.linalg.lu_solve
121 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).
For lu_solve, I only count the time on solving, the decomposition part is not counted. It is now much faster!
Here is a partial answer, since I dispute one of your premises.
You write that "LU solve should be faster than Gaussian Elimination." You seem to misunderstand the purpose of LU decomposition. If you are solving just one such problem (Ax=b where matrix A and vector b are given), LU decomp is no faster than Gaussian elimination. Indeed, the decomposition's algorithm is very similar to the elimination and is no faster.
The advantage of LU decomposition comes when you are given matrix A and you want to solve the equation Ax=b for multiple different given vectors b. Gaussian elimination needs to start over from scratch, and each solution will take the same amount of time. In LU decomposition you can store the resulting matrices L and U from the first calculation, and that greatly speeds up the solutions to the succeeding equations that use different vectors b.
You can read more about this at the section in Numerical Recipes in C about LU Decomposition and Its Applications.
Look at the docstring for numpy.linalg.solve. It says in the "Notes" section "The solutions are computed using LAPACK routine _gesv". (The underscore is a place-holder for a character that corresponds to a data type. For example, dgesv uses double precision.)
The documentation for dgesv explains that it uses the LU decomposition. So you are more-or-less replicating the calculation, but you are doing more steps in Python, so your code is slower.

Categories