I am using pandas version: '0.23.4'
While debugging my code I realized that, std & skew is not giving correct results with rolling window.
Check the code below:
import pandas as pd
import numpy as np
import scipy.stats as sp
df = pd.DataFrame(np.random.randint(1,10,(5)))
df_w = df.rolling(window=3, min_periods=1)
m1 = df_w.apply(lambda x: np.mean(x))
m2 = df_w.mean()
s1 = df_w.apply(lambda x: np.std(x))
s2 = df_w.std()
sk1 = df_w.apply(lambda x: sp.skew(x))
sk2 = df_w.skew()
Though the results for mean is same, but not for std and skew?
Is this expected behavior or am I missing something ?
The difference is in the specified delta degrees of freedom.
Numpy uses ddof to be 0 as default, while pandas uses ddof to be 1 as default. This value impacts how your std is calculated (specifically, how you normalize it, e.g. refer here)
If you specify it to be 0 in both, results are the same
s1 = df_w.apply(lambda k: np.std(k, ddof=0), raw=True)
s2 = df_w.std(ddof=0)
>>> (s1==s2).all()
True
Similarly, for skew, pandas calculates the unbiased skewness, while scipy calculates the biased.
Therefore, to get the same results, just specify bias=False in scipy
sk1 = df_w.apply(lambda x: sp.skew(x, bias=False))
sk2 = df_w.skew()
>>> (sk1==sk2).all()
True
Related
Sorry for bothering you with this. I have a serious issue and now im on clock to solve it, so here is my question.
I have an issue where I lambdify a quantity, but the result of the quantity differs from the ".subs" result, and sometimes it's way off, or it's a NaN, where in reality there is a real number (found by subs)
Here, I have a small MWE where you can see the issue! Thanks in advance for ur time
import sympy as sy
import numpy as np
##STACK
#some quantities needed before u see the problem
r = sy.Symbol('r', real=True)
th = sy.Symbol('th', real=True)
e_c = 1e51
lf0 = 100
A = 1.6726e-24
#here are some quantities I define to go the problem
lfac = lf0+2
rd = 4*3.14/4/sy.pi/A/lfac**2
xi = r/rd #rescaled r
#now to the problem:
#QUANTITY
lfxi = xi**(-3)*(lfac+1)/2*(sy.sqrt( 1 + 4*lfac/(lfac+1)*xi**(3) + (2*xi**(3)/(lfac+1))**2) -1)
#RESULT WITH SUBS
print(lfxi.subs({th:1.00,r:1.00}).evalf())
#RESULT WITH LAMBDIFY
lfxi_l = sy.lambdify((r,th),lfxi)
lfxi_l(0.01,1.00)
##gives 0
The issue is that your mpmath precision needs to be set higher!
By default mpmath uses prec=53 and dps=15, but your expression requires a much higher resolution than this for it
# print(lfxi)
3.0256512324559e+62*(sqrt(1.09235114769539e-125*pi**6*r**6 + 6.74235013645028e-61*pi**3*r**3 + 1) - 1)/(pi**3*r**3)
...
from mpmath import mp
lfxi_l = sy.lambdify((r,th),lfxi, modules=["mpmath"])
mp.dps = 125
print(lfxi_l(1.00,1.00))
# 101.999... result
Changing a couple of the constants to "modest" values:
In [89]: e_c=1; A=1
The different methods produce essentially the same thing:
In [91]: lfxi.subs({th:1.00,r:1.00}).evalf()
Out[91]: 1.00000000461176
In [92]: lfxi_l = sy.lambdify((r,th),lfxi)
In [93]: lfxi_l(1.0,1.00)
Out[93]: 1.000000004611762
In [94]: lfxi_m = sy.lambdify((r,th),lfxi, modules=["mpmath"])
In [95]: lfxi_m(1.0,1.00)
Out[95]: mpf('1.0000000046117619')
What is the most straightforward way to perform a t-test in python and to include CIs of the difference? I've seen various posts but everything is different and when I tried to calculate the CIs myself it seemed slightly wrong... Here:
import numpy as np
from scipy import stats
g1 = np.array([48.7107107,
36.8587287,
67.7129929,
39.5538852,
35.8622661])
g2 = np.array([62.4993857,
49.7434833,
67.7516511,
54.3585559,
71.0933957])
m1, m2 = np.mean(g1), np.mean(g2)
dof = (len(g1)-1) + (len(g2)-1)
MSE = (np.var(g1) + np.var(g2)) / 2
stderr_diffs = np.sqrt((2 * MSE)/len(g1))
tcl = stats.t.ppf([.975], dof)
lower_limit = (m1-m2) - (tcl) * (stderr_diffs)
upper_limit = (m1-m2) + (tcl) * (stderr_diffs)
print(lower_limit, upper_limit)
returns:
[-30.12845447] [-0.57070077]
However, when I run the same test in SPSS, although I have the same t and p values, the CIs are -31.87286, 1.17371, and this is also the case in R. I can't seem to find the correct way to do this and would appreciate some help.
You're subtracting 1 when you compute dof, but when you compute the variance you're not using the sample variance:
MSE = (np.var(g1) + np.var(g2)) / 2
should be
MSE = (np.var(g1, ddof=1) + np.var(g2, ddof=1)) / 2
which gives me
[-31.87286426] [ 1.17370902]
That said, instead of doing the manual implementation, I'd probably use statsmodels' CompareMeans:
In [105]: import statsmodels.stats.api as sms
In [106]: r = sms.CompareMeans(sms.DescrStatsW(g1), sms.DescrStatsW(g2))
In [107]: r.tconfint_diff()
Out[107]: (-31.872864255548553, 1.1737090155485568)
(really we should be using a DataFrame here, not an ndarray, but I'm lazy).
Remember though that you're going to want to consider what assumption you want to make about the variance:
In [110]: r.tconfint_diff(usevar='pooled')
Out[110]: (-31.872864255548553, 1.1737090155485568)
In [111]: r.tconfint_diff(usevar='unequal')
Out[111]: (-32.28794665832114, 1.5887914183211436)
and if your g1 and g2 are representative, the assumption of equal variance might not be a good one.
How is one intended to use the output of the pandas.ewm.cov function. I would presume that there are functions that allow you to directly use it in the form returned for multiplication, but nothing I try seems to work.
For example, suppose I take a minimal use case, stock X and Y returns timeseries in DF1, so we estimate an ewma covariance matrix, then to get the variance estimate for a portfolio of position A and B (given in DF2) I need to compute $x^T C x$, but I can't find the command to do this without writing a for loop?
# Python 3.6, pandas 0.20
import pandas as pd
import numpy as np
np.random.seed(100)
DF1 = pd.DataFrame(dict(X = np.random.normal(size = 100), Y = np.random.normal(size = 100)))
DF2 = pd.DataFrame(dict(A = np.random.normal(size = 100), B = np.random.normal(size = 100)))
COV = DF1.ewm(10).cov()
print(DF1)
print(COV)
# All of the following are invalid
print(COV.dot(DF2))
print(DF2.dot(COV))
print(COV.multiply(DF2))
The best I can figure out is this ugly piece of code
COV.reset_index().rename(columns = dict(level_0 = "index", level_1 = "variable"), inplace = True)
DF2m = pd.melt(DF2.reset_index(), id_vars = "index").sort_values("index")
MDF = pd.merge(COV, DF2m, on=["index", "variable"])
VAR = MDF.groupby("index").apply(lambda x: np.dot(np.dot(x["value"], np.matrix([x["X"], x["Y"]])), x["value"])[0,0])
I hold out hope that there is a nice way to do this...
as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it.
Pandas tells me doom is in the works:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
in fact, if you do something like:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.
I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.
It has three core classes:
OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.
An example:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
Rolling beta with sklearn
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate
Numpy rolling regression function
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
Notes
In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols) appears to be somewhat slow
As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...
For rolling trend in one column, one can just use:
import numpy as np
def calc_trend(window:int = 30):
df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
However, in my case I wasted to find a trend with respect to date, where date was in another column. I had to create the functionality manually, but it is easy. First, convert from TimeDate to int64 representing days from t_0:
xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
Then:
def calc_trend(window:int=30):
for t in range(len(df)):
if t < window//2:
continue
i0 = t - window//2 # Start window
i1 = i0 + window # End window
xvec = xdays[i0:i1]
yvec = df['column_name'][i0:i1].values
df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]
With numpy or scipy, is there any existing method that will return the endpoints of an interval which contains a specified percent of the values in a 1D array? I realize that this is simple to write myself, but it seems like the kind of thing that might be built in, although I can't find it.
E.g:
>>> import numpy as np
>>> x = np.random.randn(100000)
>>> print(np.bounding_interval(x, 0.68))
Would give approximately (-1, 1)
You can use np.percentile:
In [29]: x = np.random.randn(100000)
In [30]: p = 0.68
In [31]: lo = 50*(1 - p)
In [32]: hi = 50*(1 + p)
In [33]: np.percentile(x, [lo, hi])
Out[33]: array([-0.99206523, 1.0006089 ])
There is also scipy.stats.scoreatpercentile:
In [34]: scoreatpercentile(x, [lo, hi])
Out[34]: array([-0.99206523, 1.0006089 ])
I don't know of a built-in function to do it, but you can write one using the math package to specify approximate indices like this:
from __future__ import division
import math
import numpy as np
def bound_interval(arr_in, interval):
lhs = (1 - interval) / 2 # Specify left-hand side chunk to exclude
rhs = 1 - lhs # and the right-hand side
sorted = np.sort(arr_in)
lower = sorted[math.floor(lhs * len(arr_in))] # use floor to get index
upper = sorted[math.floor(rhs * len(arr_in))]
return (lower, upper)
On your specified array, I got the interval (-0.99072237819851039, 0.98691691784955549). Pretty close to (-1, 1)!