How to speed up pandas.rolling weighted_average or other function?

How to speed up pandas.rolling weighted_average or other function? - python

I want to calculate the weighted average for a long array.
which rolling in a fixed-size window.
For example:
In [1]: import pandas as pd
In [2]: a = pd.Series(range(int(1e8)))
In [5]: import numpy as np; w = np.array(list(range(10)));
In [6]: a.rolling(10).apply(lambda x: (x * w).sum())
as i tried, this is very slow, I read some blog, sometimes it can be speeded up by:
a.rolling(10).apply(np.argmax, engine='numba', raw=True)
but this can only be used in build-in function, for some customed function, seems not work.
do you know how to make it standable in costing time?

Solution with np.convolve
We can convolve the weights w over the series a using the valid mode convolution operation, essentially this has the same effect as calculating the rolling weighted sum.
s = np.convolve(a, w[::-1], 'valid')
s = [np.nan] * (len(w) - 1) + list(s)
Solution with sliding_window_view
Alternatively, we can also use sliding_window_view to speed up the rolling weighted sum computation
from numpy.lib.stride_tricks import sliding_window_view
s = (sliding_window_view(a, len(w)) * w).sum(1)
s = [np.nan] * (len(w) - 1) + list(s)
Timings
a = pd.Series(range(int(1e4)))
%%timeit
s = np.convolve(a, w[::-1], 'valid')
s = [np.nan] * (len(w) - 1) + list(s)
# 1000 loops, best of 5: 626 µs per loop
%%timeit
s = (sliding_window_view(a, len(w)) * w).sum(1)
s = [np.nan] * (len(w) - 1) + list(s)
# 1000 loops, best of 5: 1.2 ms per loop
%%timeit
s = a.rolling(10).apply(lambda x: (x * w).sum())
# 1 loop, best of 5: 3.6 s per loop
As it is evident from the performance test using np.convolve is about 5750x faster, while using sliding_window_view is around 2880x faster compared to the pandas rolling + apply method.

Related

How I can improve a loop for in a Python code?

I am translating this code from Matlab to Python. The code function fine but it is painfully slow in python. In Matlab, the code runs in way less then a minute, in python it took 30 min!!! Someone with mode experience in python could help me?
# P({ai})
somai = 0
for i in range(1, n):
somaj = 0
for j in range(1, n):
exponencial = math.exp(-((a[i] - a[j]) * (a[i] - a[j])) / dev_a2 - ((b[i] - b[j]) * (b[i] - b[j])) / dev_b2)
somaj = somaj + exponencial
somai = somai + somaj

As with MATLAB, I'd recommend you vectorize your code. Iterating by for-loops can be much slower than the lower level implementation of MATLAB and numpy.
Your operations (a[i] - a[j])*(a[i] - a[j]) are pairwise squared-Euclidean distance for all N data points. You can calculate a pairwise distance matrix using scipy's pdist and squareform functions -- pdist, squareform.
Then you calculate the difference between pairwise distance matrices A and B, and sum the exponential decay. So you could get a vectorized code like:
import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
# Example data
N = 1000
a = np.random.rand(N,1)
b = np.random.rand(N,1)
dev_a2 = np.random.rand()
dev_b2 = np.random.rand()
# `a` is an [N,1] matrix (i.e. column vector)
A = pdist(a, 'sqeuclidean')
# Change to pairwise distance matrix
A = squareform(A)
# Divide all elements by same divisor
A = A / dev_a2
# Then do the same for `b`'s
# `b` is an [N,1] matrix (i.e. column vector)
B = pdist(b, 'sqeuclidean')
B = squareform(B)
B = B / dev_b2
# Calculate exponential decay
expo = np.exp(-(A-B))
# Sum all elements
total = np.sum(expo)
Here's a quick timing comparison between the iterative method and this vectorized code.
N: 1000 | Iter Output: 2729989.851117 | Vect Output: 2732194.924364
Iter time: 6.759 secs | Vect time: 0.031 secs
N: 5000 | Iter Output: 24855530.997400 | Vect Output: 24864471.007726
Iter time: 171.795 secs | Vect time: 0.784 secs
Note that the final results are not exactly the same. I'm not sure why this is, it might be rounding error or math error on my part, but I'll leave that to you.

TLDR
Use numpy
Why Numpy?
Python, by default, is slow. One of the powers of python is that it plays nicely with C and has tons of libraries. The one that will help you hear is numpy. Numpy is mostly implemented in C and, when used properly, is blazing fast. The trick is to phrase the code in such a way that you keep the execution inside numpy and outside of python proper.
Code and Results
import math
import numpy as np
n = 1000
np_a = np.random.rand(n)
a = list(np_a)
np_b = np.random.rand(n)
b = list(np_b)
dev_a2, dev_b2 = (1, 1)
def old():
somai = 0.0
for i in range(0, n):
somaj = 0.0
for j in range(0, n):
tmp_1 = -((a[i] - a[j]) * (a[i] - a[j])) / dev_a2
tmp_2 = -((b[i] - b[j]) * (b[i] - b[j])) / dev_b2
exponencial = math.exp(tmp_1 + tmp_2)
somaj += exponencial
somai += somaj
return somai
def new():
tmp_1 = -np.square(np.subtract.outer(np_a, np_a)) / dev_a2
tmp_2 = -np.square(np.subtract.outer(np_b, np_b)) / dev_a2
exponential = np.exp(tmp_1 + tmp_2)
somai = np.sum(exponential)
return somai
old = 1.76 s ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new = 24.6 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is about a 70x improvement
old yields 740919.6020840995
new yields 740919.602084099
Explanation
You'll notice I broke up your code with the tmp_1 and tmp_2 a bit for clarity.
np.random.rand(n): This creates an array of length n that has random floats going from 0 to 1 (excluding 1) (documented here).
np.subtract.outer(a, b): Numpy has modules for all the operators that allow you do various things with them. Lets say you had np_a = [1, 2, 3], np.subtract.outer(np_a, np_a) would yield
array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
Here's a stackoverflow link if you want to go deeper on this. (also the word "outer" comes from "outer product" like from linear algebra)
np.square: simply squares every element in the matrix.
/: In numpy when you do arithmetic operators between scalars and matrices it does the appropriate thing and applies that operation to every element in the matrix.
np.exp: like np.square
np.sum: sums every element together and returns a scalar.

Fast inner product of two 2-d masked arrays in numpy

My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !

We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop

How to find the pairwise differences between rows of two very large matrices using numpy?

Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?

Here's another way to perform :
(a-b)^2 = a^2 + b^2 - 2ab
with np.einsum for the first two terms and dot-product for the third one -
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
Runtime test
Approaches -
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# #Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# #Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
Timings -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop

Here is a solution which avoids both the loop and the large intermediates:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
Prints:
0.546781778336
0.674743175507
0.10723400116

Another job for np.einsum
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])

Similar to #paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:
Let v be a NumPy array of size (n, d):
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.
__
There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))

Summation Evaluation in python

Given
z = np.linspace(1,10,100)
Calculate Summation over all values of z in z^k * exp((-z^2)/ 2)
import numpy as np
import math
def calc_Summation1(z, k):
ans = 0.0
for i in range(0, len(z)):`
ans += math.pow(z[i], k) * math.exp(math.pow(-z[i], 2) / 2)
return ans
def calc_Summation2(z,k):
part1 = z**k
part2 = math.exp(-z**2 / 2)
return np.dot(part1, part2.transpose())
Can someone tell me what is wrong with both calc_Summation1 and calc_Summation2?

I think this might be what you're looking for:
sum(z_i**k * math.exp(-z_i**2 / 2) for z_i in z)

If you want to vectorize calculations with numpy, you need to use numpy's ufuncs. Also, the usual way of doing you calculation would be:
import numpy as np
calc = np.sum(z**k * np.exp(-z*z / 2))
although you can keep your approach using np.dot if you call np.exp instead of math.exp:
calc = np.dot(z**k, np.exp(-z*z / 2))
It does run faster with dot:
In [1]: z = np.random.rand(1000)
In [2]: %timeit np.sum(z**5 * np.exp(-z*z / 2))
10000 loops, best of 3: 142 µs per loop
In [3]: %timeit np.dot(z**5, np.exp(-z*z / 2))
1000 loops, best of 3: 129 µs per loop
In [4]: np.allclose(np.sum(z**5 * np.exp(-z*z / 2)),
... np.dot(z**5, np.exp(-z*z / 2)))
Out[4]: True

k=1
def myfun(z_i):
return z_i**k * math.exp(-z_i**2 / 2)
sum(map(myfun,z))
We define a function for the thing we want to sum, use the map function to apply it to each value in the list and then sum all these values. Having to use an external variable k is slightly niggling.
A refinement would be to define a two argument function
def myfun2(z_i,k):
return z_i**k * math.exp(-z_i**2 / 2)
and use a lambda expression to evaluate it
sum(map(lambda x:myfun2(x,1), z))

Slow computation: could itertools.product be the culprit?

A numerical integration is taking exponentially longer than I expect it to. I would like to know if the way that I implement the iteration over the mesh could be a contributing factor. My code looks like this:
import numpy as np
import itertools as it
U = np.linspace(0, 2*np.pi)
V = np.linspace(0, np.pi)
for (u, v) in it.product(U,V):
# values = computation on each grid point, does not call any outside functions
# solution = sum(values)
return solution
I left out the computations because they are long and my question is specifically about the way that I have implemented the computation over the parameter space (u, v). I know of alternatives such as numpy.meshgrid; however, these all seem to create instances of (very large) matrices, and I would guess that storing them in memory would slow things down.
Is there an alternative to it.product that would speed up my program, or should I be looking elsewhere for the bottleneck?
Edit: Here is the for loop in question (to see if it can be vectorized).
import random
import numpy as np
import itertools as it
##########################################################################
# Initialize the inputs with random (to save space)
##########################################################################
mat1 = np.array([[random.random() for i in range(3)] for i in range(3)])
mat2 = np.array([[random.random() for i in range(3)] for i in range(3)])
a1, a2, a3 = np.array([random.random() for i in range(3)])
plane_normal = np.array([random.random() for i in range(3)])
plane_point = np.array([random.random() for i in range(3)])
d = np.dot(plane_normal, plane_point)
truthval = True
##########################################################################
# Initialize the loop
##########################################################################
N = 100
U = np.linspace(0, 2*np.pi, N + 1, endpoint = False)
V = np.linspace(0, np.pi, N + 1, endpoint = False)
U = U[1:N+1] V = V[1:N+1]
Vsum = 0
Usum = 0
##########################################################################
# The for loops starts here
##########################################################################
for (u, v) in it.product(U,V):
cart_point = np.array([a1*np.cos(u)*np.sin(v),
a2*np.sin(u)*np.sin(v),
a3*np.cos(v)])
surf_normal = np.array(
[2*x / a**2 for (x, a) in zip(cart_point, [a1,a2,a3])])
differential_area = \
np.sqrt((a1*a2*np.cos(v)*np.sin(v))**2 + \
a3**2*np.sin(v)**4 * \
((a2*np.cos(u))**2 + (a1*np.sin(u))**2)) * \
(np.pi**2 / (2*N**2))
if (np.dot(plane_normal, cart_point) - d > 0) == truthval:
perp_normal = plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Vsum += f*differential_area
else:
perp_normal = - plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Usum += f*differential_area
integral = abs(Vsum) + abs(Usum)

If U.shape == (nu,) and (V.shape == (nv,), then the following arrays vectorize most of your calculations. With numpy you get the best speed by using arrays for the largest dimensions, and looping on the small ones (e.g. 3x3).
Corrected version
A = np.cos(U)[:,None]*np.sin(V)
B = np.sin(U)[:,None]*np.sin(V)
C = np.repeat(np.cos(V)[None,:],U.size,0)
CP = np.dstack([a1*A, a2*B, a3*C])
SN = np.dstack([2*A/a1, 2*B/a2, 2*C/a3])
DA1 = (a1*a2*np.cos(V)*np.sin(V))**2
DA2 = a3*a3*np.sin(V)**4
DA3 = (a2*np.cos(U))**2 + (a1*np.sin(U))**2
DA = DA1 + DA2 * DA3[:,None]
DA = np.sqrt(DA)*(np.pi**2 / (2*Nu*Nv))
D = np.dot(CP, plane_normal)
S = np.sign(D-d)
F1 = np.dot(np.dot(SN, mat2.T), plane_normal)
F = F1 * DA
#F = F * S # apply sign
Vsum = F[S>0].sum()
Usum = F[S<=0].sum()
With the same random values, this produces the same values. On a 100x100 case, it is 10x faster. It's been fun playing with these matrices after a year.

In ipython I did simple sum calculations on your 50 x 50 gridspace
In [31]: sum(u*v for (u,v) in it.product(U,V))
Out[31]: 12337.005501361698
In [33]: UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
Out[33]: 12337.005501361693
In [34]: timeit UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
1000 loops, best of 3: 293 us per loop
In [35]: timeit sum(u*v for (u,v) in it.product(U,V))
100 loops, best of 3: 2.95 ms per loop
In [38]: timeit list(it.product(U,V))
1000 loops, best of 3: 213 us per loop
In [45]: timeit UU,VV = np.meshgrid(U,V); (UU*VV).sum().sum()
10000 loops, best of 3: 70.3 us per loop
# using numpy's own sum is even better
product is slower (by factor 10), not because product itself is slow, but because of the point by point calculation. If you can vectorize your calculations so they use the 2 (50,50) arrays (without any sort of looping) it should speed up the overall time. That's the main reason for using numpy.

[k for k in it.product(U,V)] runs in 2ms for me, and the itertool package is made to be efficient, e.g. it does not create a long array first (http://docs.python.org/2/library/itertools.html).
The culprit seems to be your code inside the iteration, or your using a lot of points in linspace.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up pandas.rolling weighted_average or other function? - python

Related

How I can improve a loop for in a Python code?

Fast inner product of two 2-d masked arrays in numpy

How to find the pairwise differences between rows of two very large matrices using numpy?

Summation Evaluation in python

Slow computation: could itertools.product be the culprit?

Categories

Resources