What's the difference between pandas ACF and statsmodel ACF? - python

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr function built into Pandas, and the acf function supplied by statsmodels.tsa. This is done in the following MWE:
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf
ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)
ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)
ticker_data_acf_1 = acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]
test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
What accounts for this difference, and which values should be used?

The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
autocorr does nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
The differences may get smaller for longer time series but are quite big for short ones.
Compared to Matlab, the Pandas autocorr function probably corresponds to doing Matlabs xcorr (cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
See this MWE for clarification:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")
def autocorr_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the subseries means
sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
# Normalize with the subseries stds
return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
def acf_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
# Normalize with var of whole series
return sum_product / ((len(x) - lag) * np.var(x))
x = np.linspace(0,100,101)
results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()
Statsmodels uses np.correlate to optimize this, but this is basically how it works.

As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=True to the statsmodels function. Using a random input:
import statistics
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32
def test(unbiased):
data = pd.Series(np.random.random(DATA_LEN))
data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
# return difference between results
return sum(abs(data_acf_1 - data_acf_2))
for value in (False, True):
diffs = [test(value) for _ in range(N_TESTS)]
print(value, statistics.mean(diffs))
Output:
False 0.464562410987
True 0.0820847168593

In the following example, Pandas autocorr() function gives the expected results but statmodels acf() function does not.
Consider the following series:
import pandas as pd
s = pd.Series(range(10))
We expect that there is perfect correlation between this series and any of its lagged series, and this is actually what we get with autocorr() function
[ s.autocorr(lag=i) for i in range(10) ]
# [0.9999999999999999, 1.0, 1.0, 1.0, 1.0, 0.9999999999999999, 1.0, 1.0, 0.9999999999999999, nan]
But using acf() we get a different result:
from statsmodels.tsa.stattools import acf
acf(s)
# [ 1. 0.7 0.41212121 0.14848485 -0.07878788
# -0.25757576 -0.37575758 -0.42121212 -0.38181818 -0.24545455]
If we try acf with adjusted=True the result is even more unexpected because for some lags the result is less than -1 (note that correlation has to be in [-1, 1])
acf(s, adjusted=True) # 'unbiased' is deprecated and 'adjusted' should be used instead
# [ 1. 0.77777778 0.51515152 0.21212121 -0.13131313
# -0.51515152 -0.93939394 -1.4040404 -1.90909091 -2.45454545]

Related

How to get only positive & de-trended Autocorrelation values?

I'm trying to obtain only positive autocorrelation values from a timeseries waveform using scipy.signal.correlate() which should look like the following:
But I am ending up getting the following - which has both positive and negative values and also a trend present:
Can anyone please tell how to get only positive & de-trended Autocorrelation values?
The dataset for which I'm finding the autocorrelation, is generated using the following code (which you could use as it is for your reference):
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
I'm performing Autocorrelation as follows (whose resulting output is as I've shown above, which is what I'm not satisfied with):
#DETERMINING AUTOCORRELATION AND LAG VALUES:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
Could anyone please help?

Identifying extremes in pandas timeseries

I'm looking for a way to identify the local extremes in a pandas timeseries.
A MWE would be
import math
import matplotlib.pyplot as plt
import pandas as pd
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
ts.plot(style='.')
plt.show()
and the red lines would mark the timestamps which I want to identify. Note that there are, of course, finite steps in this series.
A possible solution could be to fit a curve to it, derive it and then identify the exact place where the gradient is 0. This does seem like a big effort to program myself, and I assume such an implementation exists somewhere.
I developed a solution for the problem based on the .diff() module. The crucial attribute here is the p factor of the get_percentile function. Since the finite number of values means that the gradient will not attain value 0, the solution space must be blurred a bit. This means, the less values there are, the higher the p factor must be. In my solution 0.05 proved to be sufficient to identify the extremes, but small enough to locate the extremes with reasonable accuracy.
This is the code:
import copy
import math
import matplotlib.pyplot as plt
import pandas as pd
def get_percentile(data: list, p: float):
_data = copy.copy(data)
_data.sort()
result = _data[math.floor(len(_data) * p) - 1]
return result
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
gradient_ts = abs(ts.diff())
percentile = get_percentile(gradient_ts.values, p=0.05)
binary_ts = gradient_ts.where(gradient_ts > percentile, 1).where(gradient_ts < percentile, 0)
fig, ax = plt.subplots()
binary_ts.plot(drawstyle="steps", ax=ax)
ax.fill_between(binary_ts.index, binary_ts, facecolor='green', alpha=0.5, step='pre')
ts.plot(secondary_y=True, style='.')
plt.show()

Custom binning in sklearn.preprocessing?

I have a list of continuous variables called size_array. I've been scaling them from [0, 1] like this:
max_abs_scaler = preprocessing.MinMaxScaler()
scaled = max_abs_scaler.fit_transform(size_array)
Is there a way to scale them on the range of [-1, 1] where the median (or a percentile) is 0? My data is right skewed, so the values above the median are spread out a lot and the values to the left of the median are not spread out. I tried to scaling them with this method:
def using_median():
if x >= median:
return (x - median)/(max - median)
else:
return (median - x)/(median - min)
But that didn't work. Is there any other way to do this with sklearn.preprocessing?
I would recommend using the PowerTransformer(). It can work very well for skewed distributions.
check out this example:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
pt = preprocessing.PowerTransformer()
X_lognormal = np.random.RandomState(616)\
.lognormal(size=(300, 2))
_,ax = plt.subplots(1,2,sharey=True)
ax[0].hist(X_lognormal)
ax[1].hist(pt.fit_transform(X_lognormal))

How to calculate covariance on 2 columns out of multiple columns in python?

I've provided a sample data below. It contains 8x10 matrix which contains two-dimensional normal distribution. For ex, col1 and col2 is 1 set, col3/col4 is another and so on. I'm trying to calculate covariance of the individual set in python. So far, I've been unsuccessful and i'm new to python. However, below is what I've tried:
import pandas
import numpy
import matplotlib.pyplot as plg
data = pandas.read_excel("testfile.xlsx", header=None)
dataNpy = pandas.DataFrame.to_numpy(data)
mean = numpy.mean(dataNpy, axis=0)
dataAWithoutMean = dataNpy - mean
covB = numpy.cov(dataAWithoutMean)
print("cov is: " + str(covB))
I've been tasked to calculate 4 separate covariance matrices and plot the covariance value for each set. In addition, plot the variance of each set.
dataset:
5.583566716 -0.441667252 -0.663300181 -1.249623134 -6.530464227 -4.984165997 2.594874802 2.646629654
6.129721509 2.374902708 -2.583949571 -2.224729817 0.279965502 -0.850298098 -1.542499771 -2.686894831
5.793226266 1.133844629 -1.939493549 1.570726544 -2.125423302 -1.33966397 -0.42901856 -0.09814741
3.413049714 -0.1133744 -0.032092831 -0.122147373 2.063549449 0.685517481 5.887909556 4.056242954
-2.639701885 -0.716557389 -0.851273969 -0.522784614 -7.347432606 -2.653482175 1.043389849 0.774192416
-1.84827484 -0.636893709 -2.223488277 -1.227420764 0.253999505 0.540299783 -1.593071594 -0.70980532
0.754029441 1.427571018 5.486147486 2.956320758 2.054346142 1.939929175 -3.559875405 -3.074861749
2.009806308 1.916796155 7.820990369 2.953681659 2.071682641 0.105056782 -1.120995825 -0.036335483
1.875128481 1.785216268 -2.607698929 0.244415372 -0.793431956 -1.598343481 -2.120852679 -2.777871862
0.168442246 0.324606905 0.53741174 0.274617158 -2.99037756 -3.323958514 -3.288399345 -2.482277047
Thanks for helping in advance :)
Is this what you need?
import pandas
import numpy
import matplotlib.pyplot as plt
data = pandas.read_excel("Book1.xlsx", header=None)
mean = data.mean(axis=0)
dataAWithoutMean = data - mean
# Variance of each set
dataAWithoutMean.var()
# Covariance matrix
cov = dataAWithoutMean.cov()
plt.matshow(cov)
plt.show()

sklearn's KMeans: Cluster centers and cluster means differ. Numerical Imprecision?

I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.
For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
[[ 0.00217333]
[ 0.00223798]]
Doing the same thing for different sample sizes:
This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?
I think that it may be related to the tolerance of KMeans. The default value is 1e-4, so setting a lower value, i.e. tol=1e-8 gives:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
0
label
0 9.99200722e-16
1 1.11022302e-16
Hope it helps.

Categories