Compute rolling z-score in pandas dataframe - python

Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.

rolling.apply with a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
According to the definition given on this page the rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1) is used above to achieve this effect.
Below, even for a small Series (of length 100), zscore is over 5x faster than using rolling.apply. Since rolling.apply(zscore_func) calls zscore_func once for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean() and r.std() functions becomes even more apparent as the size of the loop increases.
Thus, as the length of the Series increases, the speed advantage of zscore increases.
In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 µs per loop
In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
np.random.seed(2017)
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
def zscore_using_apply(x, window):
def zscore_func(x):
return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
return x.rolling(window=window+1).apply(zscore_func)
N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())
result = zscore(x, N)
alt = zscore_using_apply(x, N)
assert not ((result - alt).abs() > 1e-8).any()

You should use native functions of pandas:
# Compute rolling zscore for column ="COL" and window=window
col_mean = df["COL"].rolling(window=window).mean()
col_std = df["COL"].rolling(window=window).std()
df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std

def zscore(arr, window):
x = arr.rolling(window = 1).mean()
u = arr.rolling(window = window).mean()
o = arr.rolling(window = window).std()
return (x-u)/o
df['zscore'] = zscore(df['value'],window)

Let us say you have a data frame called data, which looks like this:
enter image description here
then you run the following code,
data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())
enter image description here
Please note that the first row will always have NaN values as it doesn't have a standard deviation.

This can be solved in a single line of code. Given that s is the input series and wlen is the window length:
zscore = s.sub(s.rolling(wlen).mean()).div(s.rolling(wlen).std())
If you need to shift the mean and std it can still be done:
zscore = s.sub(s.rolling(wlen).mean().shift()).div(s.rolling(wlen).std().shift())

Related

Is there a way to get Pandas ewm to function on fixed windows?

I am trying to use Pandas ewm function to calculating exponentially weighted moving averages. However i've noticed that information seems to carry through your entire time series. What this means is that every data point's MA is dependant on a different number of previous data points. Therefore the ewm function at every data point is mathematically different.
I think some here had a similar question
Does Pandas calculate ewm wrong?
But i did try their method, and i am not getting functionality i want.
def EMA(arr, window):
sma = arr.rolling(window=window, min_periods=window).mean()[:window]
rest = arr[window:]
return pd.concat([sma, rest]).ewm(com=window, adjust=False).mean()
a = pd.DataFrame([x for x in range(100)])
print(list(EMA(a, 10)[0])[-1])
print(list(EMA(a[50:], 10)[0])[-1])
In this example, i have an array of 1 through 100. I calculate moving averages on this array, and array of 50-100. The last moving average should be the same, since i am using only a window of 10. But when i run this code i get two different values, indicating that ewm is indeed dependent on the entire series.
IIUC, you are asking for ewm in a rolling window, which means, every 10 rows return a single number. If that is the case, then we can use a stride trick:
Edit: update function works on series only
def EMA(arr, window=10, alpha=0.5):
ret = pd.Series(index=arr.index, name=arr.name)
arr=np.array(arr)
l = len(arr)
stride = arr.strides[0]
ret.iloc[window-1:] = (pd.DataFrame(np.lib.stride_tricks.as_strided(arr,
(l-window+1,window),
(stride,stride)))
.T.ewm(alpha)
.mean()
.iloc[-1]
.values
)
return ret
Test:
a = pd.Series([x for x in range(100)])
EMA(a).tail(2)
# 98 97.500169
# 99 98.500169
# Name: 9, dtype: float64
EMA(a[:50]).tail(2)
# 98 97.500169
# 99 98.500169
# Name: 9, dtype: float64
EMA(a, 2).tail(2)
98 97.75
99 98.75
dtype: float64
Test on random data:
a = pd.Series(np.random.uniform(0,1,10000))
fig, ax = plt.subplots(figsize=(12,6))
a.plot(ax=ax)
EMA(a,alpha=0.99, window=2).plot(ax=ax)
EMA(a,alpha=0.99, window=1500).plot(ax=ax)
plt.show()
Output: we can see that the larger window (green) is less volatile than the smaller window (orange).
This can be achieved by working with the formula for exponential smoothing by cancelling the lagged terms. The formula can be found on the ewm page.
The following code demonstrates that no memory is left after adjustment. For every point, the fixed window of information used is L=1000. And the factor f should be included if one desires to have the equivalent for the adjust=True version (for adjust=False just get rid of the f factor).
srs1=pd.Series(np.random.normal(size=100000))
alpha=0.02
em1=srs1.ewm(alpha=alpha,adjust=False).mean()
L=1000
f=1-(1-alpha)**np.clip(np.arange(em1.shape[0]),0,L)
em1_=(em1-em1.shift(L)*(1-alpha)**L)/f
S=1001
em2=srs1[S:].ewm(alpha=alpha,adjust=False).mean()
f=1-(1-alpha)**np.clip(np.arange(em2.shape[0]),0,L)
em2_=(em2-em2.shift(L)*(1-alpha)**L)/f
print((em2_[:10000]-em1_[S:S+10000]).abs().max())
This seems to be possible in pandas 1.5 with a mix of rolling, and win_type:
pd.Series.rolling(window=10, win_type='exponential').mean(tau=0.5, center=10, sym=False)
I use a non symetric exponential window centered at the same size of the window in order to have a exponential function decaying towards the past.
This yields the same results as the EMA function provided by Quang Hoang.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def EMA(arr, window=10, alpha=0.5):
ret = pd.Series(index=arr.index, name=arr.name, dtype='float64')
arr=np.array(arr)
l = len(arr)
stride = arr.strides[0]
ret.iloc[window-1:] = (pd.DataFrame(np.lib.stride_tricks.as_strided(arr,
(l-window+1,window),
(stride,stride)))
.T.ewm(alpha)
.mean()
.iloc[-1]
.values
)
return ret
a = pd.Series([x for x in range(100)])
custom=EMA(a)
builtin= a.rolling(window=10, win_type='exponential').mean(tau=0.5, center=10, sym=False)
custom=custom.plot.line(label="Custom EMA")
builtin.plot.line(label="Built-in EMA")
plt.legend()

How to discretize a signal?

If I have a function like below:
G(s)= C/(s-p) where s=jw, c and p are constant number.
Also, the available frequency is wa= 100000 rad/s. How can I discretize the signal at ∆w = 0.0001wa in Python?
Use numpy.arange to accomplish this:
import numpy as np
wa = 100000
# np.arange will generate every discrete value given the start, end and the step value
discrete_wa = np.arange(0, wa, 0.0001*wa)
# lets say you have previously defined your function
g_s = [your_function(value) for value in discrete_wa]

how to code vector to matrix hamming distance in Python?

I like to develop a query system that finds the most similar items to given one based on a binary signature extracted from the data. I probe for the most efficient way since I have runtime constraints. I tried to use scipy distance but it was too slow. Do you know any other useful library or trick to make it in a faster manner.
For being and example scenario,
I have a query vector with binary values with length 68, and I have a dataset with a matrix size 3000Kx68. I like to find the most similar item in this matrix to given query by using Hamming distance.
thanks for any comment
Nice problem, I liked the answers of Alex and Piotr. My first naive attempt resulted also in a solution time around 800ms (on my system). My second attempt, using numpy's (un)packbits, resulted in a 4x speed increase.
import numpy as np
LENGTH = 68
K = 1024
DATASIZE = 3000 * K
DATA = np.random.randint(0, 2, (DATASIZE, LENGTH)).astype(np.bool)
def RandomVect():
return np.random.randint(0, 2, (LENGTH)).astype(np.bool)
def HammingDist(vec1, vec2):
return np.sum(np.logical_xor(vec1, vec2))
def SmallestHamming(vec):
XorData = np.logical_xor(DATA, vec[np.newaxis, :])
Lengths = np.sum(XorData, axis=1)
return DATA[np.argmin(Lengths)] # returns first smallest
def main():
v1 = RandomVect()
v2 = SmallestHamming(v1)
print(HammingDist(v1, v2))
# oke, lets try make it faster... (using numpy.(un)packbits)
DATA2 = np.packbits(DATA, axis=1)
NBYTES = DATA2.shape[-1]
BYTE2ONES = np.zeros((256), dtype=np.uint8)
for i in range(0,256):
BYTE2ONES[i] = np.sum(np.unpackbits(np.uint8(i)))
def RandomVect2():
return np.packbits(RandomVect())
def HammingDist2(vec1, vec2):
v1 = np.unpackbits(vec1)
v2 = np.unpackbits(vec2)
return np.sum(np.logical_xor(v1, v2))
def SmallestHamming2(vec):
XorData = DATA2 ^ vec[np.newaxis, :]
Lengths = np.sum(BYTE2ONES[XorData], axis=1)
return DATA2[np.argmin(Lengths)] # returns first smallest
def main2():
v1 = RandomVect2()
v2 = SmallestHamming2(v1)
print(HammingDist2(v1, v2))
Use cdist from SciPy:
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'hamming')
Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean
Reference: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
I would be surprised if there was a significantly faster way than this: Put your data into a pandas DataFrame (M), each vector by columns, and your target vector into a pandas Series (x),
import numpy as np
import pandas as pd
rows = 68
columns=3000
M = pd.DataFrame(np.random.rand(rows,columns)>0.5)
x = pd.Series(np.random.rand(rows)>0.5)
then do the following
%timeit M.apply(lambda y: x==y).astype(int).sum().idxmax()
1 loop, best of 3: 746 ms per loop
Edit: Actually, I am surprised this is a much faster way
%timeit M.eq(x, axis=0).astype(int).sum().idxmax()
100 loops, best of 3: 2.68 ms per loop

Improve performance of the np.irr function through vectorization

Is it possible to improve the performance of the np.irr function such that it can applied to a 2-dimension array of cash flows without using a for-loop--either though vectorizing the np.irr function or through an alternative algorithm?
The irr function in the numpy library calculates the periodically compounded rate of return that gives a net present value of 0 for an array of cash flows. This function can only be applied to a 1-dimensional array:
x = np.array([-100,50,50,50])
r = np.irr(x)
np.irr will not work against a 2-dimensional array of cash flows, such as:
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
where each row represents a series of cash flows, and columns represent time periods. Therefore a slow implementation would be to loop over each row and apply np.irr to individual rows:
out = []
for x in cfs:
out.append(np.irr(x))
For large arrays, this is an optimization barrier. Looking at the source code of the np.irr function, I believe the main obstacle is vectorizing the np.roots function:
def irr(values):
res = np.roots(values[::-1])
mask = (res.imag == 0) & (res.real > 0)
if res.size == 0:
return np.nan
res = res[mask].real
# NPV(rate) = 0 can have more than one solution so we return
# only the solution closest to zero.
rate = 1.0/res - 1
rate = rate.item(np.argmin(np.abs(rate)))
return rate
I have found a similar implementation in R: Fast loan rate calculation for a big number of loans, but don't know how to port this into Python. Also, I don't consider np.apply_along_axis or np.vectorize to be solutions to this issue since my main concern is performance, and I understand both are wrappers for a for-loop.
Thanks!
Looking at the source of np.roots,
import inspect
print(inspect.getsource(np.roots))
We see that it works by finding the eigenvalues of the "companion matrix". It also does some special handling of coefficients that are zero. I really don't understand the mathematical background, but I do know that np.linalg.eigvals can calculate eigenvalues for multiple matrices in a vectorized way.
Merging it with the source of np.irr has resulted in the following "Frankencode":
def irr_vec(cfs):
# Create companion matrix for every row in `cfs`
M, N = cfs.shape
A = np.zeros((M, (N-1)**2))
A[:,N-1::N] = 1
A = A.reshape((M,N-1,N-1))
A[:,0,:] = cfs[:,-2::-1] / -cfs[:,-1:] # slice [-1:] to keep dims
# Calculate roots; `eigvals` is a gufunc
res = np.linalg.eigvals(A)
# Find the solution that makes the most sense...
mask = (res.imag == 0) & (res.real > 0)
res = np.ma.array(res.real, mask=~mask, fill_value=np.nan)
rate = 1.0/res - 1
idx = np.argmin(np.abs(rate), axis=1)
irr = rate[np.arange(M), idx].filled()
return irr
This does not do handling of zero coefficients and surely fails when any(cfs[:,-1] == 0). Also some input argument checking wouldn't hurt. And some other problems maybe? But for the supplied example data it achieves what we wanted (at the cost of increased memory use):
In [487]: cfs = np.zeros((10000,4))
...: cfs[:,0] = -100
...: cfs[:,1:] = 50
In [488]: %timeit [np.irr(x) for x in cfs]
1 loops, best of 3: 2.96 s per loop
In [489]: %timeit irr_vec(cfs)
10 loops, best of 3: 77.8 ms per loop
If you have the special case of loans with a fixed payback amount (like in the question you linked) you may be able do it faster using interpolation...
After I posted this question I worked on this question and came up with a vectorized solution that uses a different algorithm:
def virr(cfs, precision = 0.005, rmin = 0, rmax1 = 0.3, rmax2 = 0.5):
'''
Vectorized IRR calculator. First calculate a 3D array of the discounted
cash flows along cash flow series, time period, and discount rate. Sum over time to
collapse to a 2D array which gives the NPV along a range of discount rates
for each cash flow series. Next, find crossover where NPV is zero--corresponds
to the lowest real IRR value. For performance, negative IRRs are not calculated
-- returns "-1", and values are only calculated to an acceptable precision.
IN:
cfs - numpy 2d array - rows are cash flow series, cols are time periods
precision - level of accuracy for the inner IRR band eg 0.005%
rmin - lower bound of the inner IRR band eg 0%
rmax1 - upper bound of the inner IRR band eg 30%
rmax2 - upper bound of the outer IRR band. eg 50% Values in the outer
band are calculated to 1% precision, IRRs outside the upper band
return the rmax2 value
OUT:
r - numpy column array of IRRs for cash flow series
'''
if cfs.ndim == 1:
cfs = cfs.reshape(1,len(cfs))
# Range of time periods
years = np.arange(0,cfs.shape[1])
# Range of the discount rates
rates_length1 = int((rmax1 - rmin)/precision) + 1
rates_length2 = int((rmax2 - rmax1)/0.01)
rates = np.zeros((rates_length1 + rates_length2,))
rates[:rates_length1] = np.linspace(0,0.3,rates_length1)
rates[rates_length1:] = np.linspace(0.31,0.5,rates_length2)
# Discount rate multiplier rows are years, cols are rates
drm = (1+rates)**-years[:,np.newaxis]
# Calculate discounted cfs
discounted_cfs = cfs[:,:,np.newaxis] * drm
# Calculate NPV array by summing over discounted cashflows
npv = discounted_cfs.sum(axis = 1)
## Find where the NPV changes sign, implies an IRR solution
signs = npv < 0
# Find the pairwise differences in boolean values when sign crosses over, the
# pairwise diff will be True
crossovers = np.diff(signs,1,1)
# Extract the irr from the first crossover for each row
irr = np.min(np.ma.masked_equal(rates[1:]* crossovers,0),1)
# Error handling, negative irrs are returned as "-1", IRRs greater than rmax2 are
# returned as rmax2
negative_irrs = cfs.sum(1) < 0
r = np.where(negative_irrs,-1,irr)
r = np.where(irr.mask * (negative_irrs == False), 0.5, r)
return r
Performance:
import numpy as np
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
%timeit [np.irr(x) for x in cfs]
10 loops, best of 3: 1.06 s per loop
%timeit virr(cfs)
10 loops, best of 3: 29.5 ms per loop
pyxirr is super fast, and np.irr is deprecated, so I'd use this now:
https://pypi.org/project/pyxirr/
import pyxirr
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
df = pd.DataFrame(cfs).T
df.apply(pyxirr.irr)

Numpy: averaging many datapoints at each time step

This question is probably answered somewhere, but I cannot find where, so I will ask here:
I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.
I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.
import numpy as np
import matplotlib.pyplot as plt
def average_values(x,y):
unique_x = np.unique(x)
averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
return unique_x, averaged_y
#generate our data
times = []
samples = []
#we have some timesteps:
for time in np.linspace(0,10,101):
#and a random number of samples at each timestep:
num_samples = np.random.random_integers(1,10)
for i in range(0,num_samples):
times.append(time)
samples.append(np.sin(time)+np.random.random()*0.5)
times = np.array(times)
samples = np.array(samples)
plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()
Here is what it looks like:
A generic code to do this would do something as follows:
def average_values_bis(x, y):
unq_x, idx = np.unique(x, return_inverse=True)
count_x = np.bincount(idx)
sum_y = np.bincount(idx, weights=y)
return unq_x, sum_y / count_x
Adding the function above and following line for the plotting to your script
plt.plot(*average_values_bis(times, samples),color='g')
produces this output, with the red line hidden behind the green one:
But timing both approaches reveals the benefits of using bincount, a 30x speed-up:
%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop
%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop
May I propose a pandas solution. It is highly recommended if you are going to be working with time series.
Create test data
import pandas as pd
import numpy as np
times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')
Calculate averages
avs = s.groupby(level=0).mean()
avs.plot()

Categories