Rolling linear regression on large DataFrames - python

I have two huge dataframes df_y and df_x.
df_y has columns ['date','ids','Y']. Basically each 'ids' has data for all the 'date'.
df_x has columns ['date','X1','X2','X3','X4','X5','X6'].
df_x has all the date that are in df_y. However some ids might have shorter period, i.e., either starting from a late date or ending
at an early date.
I want to run a rolling linear regression (OLS) Id ~ X1 + X2 + X3 + X4 + X5 + X6 + intercept for each 'ids' in df_y with a lookback of 200 days.
Sample dataframes:
import string, random, pandas as pd, numpy as np
ids = [''.join(random.choice(string.ascii_uppercase) for _ in range(3)) for _ in range(200)]
dates = pd.date_range('2000-01-01', '2017-07-02')
df_dates = pd.DataFrame({'date':dates, 'joinC':len(dates)*[2]})
df_ids = pd.DataFrame({'ids':ids, 'joinC':len(ids)*[2]})
df_values = pd.DataFrame({'Y':np.random.normal(size =
len(dates)*len(ids))})
df_y = df_dates.merge(df_ids, on='joinC', how="outer")
df_y = df_y[['date', 'ids']].merge(df_values, left_index=True,
right_index=True, how="inner")
df_y = df_y.sort_values(['date', 'ids'], ascending=[True, True])
df_x = pd.DataFrame({'date':dates, 'X1':np.random.normal(size = len(dates)), 'X2':np.random.normal(size = len(dates)), 'X3':np.random.normal(size = len(dates)), 'X4':np.random.normal(size = len(dates)), 'X5':np.random.normal(size = len(dates)), 'X6':np.random.normal(size = len(dates))})
My attempt:
import statsmodels.api as sm
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
for i in range(200, len(dates) +1):
for id in ids:
s_date = dates[i - 200]
e_date = dates[i - 1]
Y = df_y[(df_y['date'] >= s_date) & (df_y['date'] <= e_date) & (df_y['ids'] == id)]['Y']
Y = Y.reset_index()['Y']
X = df_x[(df_x['date'] >= s_date) & (df_x['date'] <= e_date)]
X = X.reset_index()[['X1','X2','X3','X4','X5','X6']]
X = sm.add_constant(X)
if len(X) <> len(Y):
continue
regr = sm.OLS(Y, X).fit() #Hangs here after 2 years.
X_pr = X.tail(1)
Y_hat = regr.predict(X_pr)
Y.loc[(df_y['date'] == e_date) & (df_y['ids'] == id), 'Y_hat'] = Y_hat.tolist()[0]
My attempt above seems to be working fine up until the point where it hangs (most likely at fitting step) after running for approx. 2 years. I am inclined to use statsmodels since it supports regularization (planning for future work). However, if using other library makes it faster or more elegant then I am fine with it too. Could someone please help define the fastest solution that doesn't hang midway. Thanks a lot.

I was able to get this workaround using Pandas MovingOLS
import pandas as pd
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
Y_hats = []
for id in ids:
Y = df_y[(df_y['ids'] == id)][['date', 'ids', 'Y']]
Y = Y.merge(df_x, how='left', on=['date'])
X_cols = list(df_x.columns).remove['date']
model = pd.stats.ols.MovingOLS(y=Y['Y'], x=Y[X_cols], window_type='rolling', window=250, intercept=True)
Y['intercept'] = 1
betas = model.beta
betas = betas.multiply(Y[betas.columns], axis='index')
betas = betas.sum(axis=1)
betas = betas[betas > 0]
betas = betas.to_frame()
betas.columns = [['Y_hat']]
betas = betas.merge(Y[['date', 'ids']], how='left', left_index=True, right_index=True)
Y_hats.append(betas)
Y_hats = pd.concat(Y_hats)
Y = Y.merge(Y_hats[['date', 'ids', 'Y_hat'], how='left', on=['date', 'ids']]
There is a straightforward way to use Y['Y_hat'] = model.y_predict if lets say one wants to fit Y ~ X on (y_1, y_2, ... y_n) and (x_1, x_2, ... x_n) but only wants to predict Y_(n+1) using X_(n+1).

Related

How do I create a linear regression model for a file that has about 500 columns as y variables? Working with Python

This code manually selects a column from the y table and then joins it to the X table. The program then performs linear regression. Any idea how to do this for every single column from the y table?
yDF = pd.read_csv('ytable.csv')
yDF.drop('Dates', axis = 1, inplace = True)
XDF = pd.read_csv('Xtable.csv')
ycolumnDF = yDF.iloc[:,0].to_frame()
regressionDF = pd.concat([XDF,ycolumnDF], axis=1)
X = regressionDF.iloc[:,1:20]
y = regressionDF.iloc[:,20:].squeeze()
lm = linear_model.LinearRegression()
lm.fit(X,y)
cf = lm.coef_
print(cf)
You can regress multiple y's on the same X's at the same time. Something like this should work
import numpy as np
from sklearn.linear_model import LinearRegression
df_X = pd.DataFrame(columns = ['x1','x2','x3'], data = np.random.normal(size = (10,3)))
df_y = pd.DataFrame(columns = ['y1','y2'], data = np.random.normal(size = (10,2)))
X = df_X.iloc[:,:]
y = df_y.iloc[:,:]
lm = LinearRegression().fit(X,y)
print(lm.coef_)
produces
[[ 0.16115884 0.08471495 0.39169592]
[-0.51929011 0.29160846 -0.62106353]]
The first row here ([ 0.16115884 0.08471495 0.39169592]) are the regression coefs of y1 on xs and the second are the regression coefs of y2 on xs.

How to get feature importance in RF

I am trying to get RF feature importance, I fit the random forest on the data like this:
model = RandomForestRegressor()
n = model.fit(self.X_train,self.y_train)
if n is not None:
df = pd.DataFrame(data = n , columns = ["Feature","Importance_Score"])
df["Feature_Name"] = np.array(self.X_Headers)
df = df.drop(["Feature"], axis = 1)
df[["Feature_Name","Importance_Score"]].to_csv("RF_Importances.csv", index = False)
del df
However, the n variable returns None, why is this happening?
Not very sure how model.fit(self.X_train,self.y_train) is supposed to work. Need more information about how you set up the model.
If we set this up using simulated data, it works:
np.random.seed(111)
X = pd.DataFrame(np.random.normal(0,1,(100,5)),columns=['A','B','C','D','E'])
y = np.random.normal(0,1,100)
model = RandomForestRegressor()
n = model.fit(X,y)
if n is not None:
df = pd.DataFrame({'features':X.columns,'importance':n.feature_importances_})
df
features importance
0 A 0.176091
1 B 0.183817
2 C 0.169927
3 D 0.267574
4 E 0.202591

Error calculating r squared with statsmodels for multiple yfinance data in a DataFrame

I recently began learning Python, but rather with a complex project I had already started in Excel. I have used different guides for the code I have used so far, tweaked to my needs.
I am using 'yfinance' to gather data for multiple cryptocurrencies in a specific time period from Yahoo! Finance. Also, 'stats models' to obtain alpha, beta and r squared using a DataFrame created with all cryptocurrencies and an additional column with the mkt. return (x variable).
I am having the following error: ValueError: endog and exog matrices are different sizes. I saw another question/answer regarding this error, but it did not seem to relate to my issue.
The error takes place in line 87 [model = sm.OLS(Y2,X_)] of the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()
df1 = pdr.get_data_yahoo("BTC-USD", start="2015-01-01", end="2020-01-01")
df2 = pdr.get_data_yahoo("ETH-USD", start="2015-01-01", end="2020-01-01")
df3 = pdr.get_data_yahoo("XRP-USD", start="2015-01-01", end="2020-01-01")
df4 = pdr.get_data_yahoo("BCH-USD", start="2015-01-01", end="2020-01-01")
df5 = pdr.get_data_yahoo("USDT-USD", start="2015-01-01", end="2020-01-01")
df6 = pdr.get_data_yahoo("BSV-USD", start="2015-01-01", end="2020-01-01")
df7 = pdr.get_data_yahoo("LTC-USD", start="2015-01-01", end="2020-01-01")
df8 = pdr.get_data_yahoo("BNB-USD", start="2015-01-01", end="2020-01-01")
df9 = pdr.get_data_yahoo("EOS-USD", start="2015-01-01", end="2020-01-01")
df10 = pdr.get_data_yahoo("LINK-USD", start="2015-01-01", end="2020-01-01")
df11 = pdr.get_data_yahoo("XMR-USD", start="2015-01-01", end="2020-01-01")
df12 = pdr.get_data_yahoo("BTG-USD", start="2015-01-01", end="2020-01-01")
return_btc = df1.Close.pct_change()[1:]
return_eth = df2.Close.pct_change()[1:]
return_xrp = df3.Close.pct_change()[1:]
return_bch = df4.Close.pct_change()[1:]
return_usdt = df5.Close.pct_change()[1:]
return_bsv = df6.Close.pct_change()[1:]
return_ltc = df7.Close.pct_change()[1:]
return_bnb = df8.Close.pct_change()[1:]
return_eos = df9.Close.pct_change()[1:]
return_link = df10.Close.pct_change()[1:]
return_xmr = df11.Close.pct_change()[1:]
return_btg = df12.Close.pct_change()[1:]
d = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch,
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb,
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg}
df = pd.DataFrame(d) # new data frame with all returns data
df = pd.DataFrame(d, columns=["Date", "BTC Return", "ETH Return", "XRP Return", "BCH Return", "USDT Return", "BSV Return",
"LTC Return", "BNB Return", "EOS Return", "LINK Return", "XMR Return", "BTG Return"])
avg_row = df.mean(axis=1)
return_mkt = avg_row
d1 = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch,
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb,
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg, "MKT Return":return_mkt}
df = pd.DataFrame(d1)
print(df)
import statsmodels.api as sm
from statsmodels import regression
X = return_mkt.values
Y1 = return_btc
Y2 = return_eth
#Y3 = return_xrp
def linreg(x,y):
x = sm.add_constant(x)
model = regression.linear_model.OLS(y,x).fit()
# we are removing the constant
x = x[:, 1]
return model.params[0], model.params[1]
X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y1,X_)
results = model.fit()
rsquared = results.rsquared
alpha, beta = linreg(X,Y1)
def linreg(x,y):
x = sm.add_constant(x)
model = regression.linear_model.OLS(y,x).fit()
# we are removing the constant
x = x[:, 1]
return model.params[0], model.params[1]
X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y2,X_)
results = model.fit()
rsquared = results.rsquared
alpha, beta = linreg(X,Y2)
The error is located in the second def, as I am trying to compute the previously mentioned statistics for each cryptocurrency. Thus, the 1st def is for BTC (Y1), the 2nd def is for ETH (Y2), and so on (Y3,...).
The entire code was fine when I had only the function for BTC at the end, the error occurred when I tried to add more of the same function for the others.
Fundamentally, the problem is that because Ethereum (and all other cryptos) started later than bitcoin, there are null values for the price every day for the first few years, which can't be handled. So you have to take just the values where they are not null.
However, there are many things in your code which you could factor out so that you don't repeat yourself unnecessarily. You made an attempt at that with the linreg function, but then you re-defined it for the second crypto, which shouldn't be necessary.
Here is a quick re-write which addresses both the fundamental problem and hopefully illustrates what I mean above. The output is a dataframe with the statistics you're looking for, by cryptocurrency. The goal is to write as much of the code 'generically', and then just provide a list of cryptos that you are interested in.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas_datareader import data as pdr
import datetime
import yfinance as yf
import statsmodels.api as sm
from statsmodels import regression
yf.pdr_override()
cryptos = ["BTC", "ETH", "XRP"] # Here you can specify the cryptos you want. I just used 3 for demonstration
# The rest of the code is not specific to any one crypto
def get_and_process_data(c):
raw_data = pdr.get_data_yahoo(c + '-USD', start="2015-01-01", end="2020-01-01")
return raw_data.Close.pct_change()[1:]
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
df['avg_return'] = df.mean(axis=1) # avg market return
print(df)
def model(x, y):
# Calculate r-squared
X = sm.add_constant(x) # artificially add intercept to x, as advised in the docs
model = sm.OLS(y,X).fit()
rsquared = model.rsquared
# Fit linear regression and calculate alpha and beta
X = sm.add_constant(x)
model = regression.linear_model.OLS(y,X).fit()
alpha = model.params[0]
beta = model.params[1]
return rsquared, alpha, beta
results = pd.DataFrame({c: model(df[df[c].notnull()]['avg_return'], df[df[c].notnull()][c]) for c in cryptos}).transpose()
results.columns = ['rsquared', 'alpha', 'beta']
print(results)

How to resolve Boolean value error in linear regression model in python?

I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only

Python add 2 multidimensional numpy arrays

I'm trying to collect/concat multiple numpy arrays in a single numpy array. I can do this with pandas data frame as:
df_train = pd.DataFrame()
... loop ...:
df_temp = pd.read_json(file)
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
in a loop. With this I'm able to combine various data in a single data frame.
What I want to do this is with numpy arrays. I tried the same thing as:
nump_train = np.nan
... loop ...:
nump = df_temp.values
nump_train = np.concatenate((nump_train, nump))
but I cannot concat zero-dimensional arrays as the error message says (ValueError: zero-dimensional arrays cannot be concatenated)
How can I do this like in pandas?
ps: I can solve this with a bad-written hard-coded code as:
w=1
for loop:
if w == 1:
nump1 = sc.transform(df_temp.drop(['time'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time'], axis=1))
w += 1
X_train = np.concatenate((nump1, nump2, nump3), axis = 0)
Bu this bad coding and I cannot scale this in a loop.
EDIT 1:
Actual code is this:
w = 1
for i in range(1, loop_size+1):
df_train = pd.DataFrame()
nump_train = np.nan
random_list = random.sample(file_list, selection)
for json in random_list:
json_name = json[:json.index('_')]
df_temp = pd.read_json(filedir + json)
train_period_mask = (df_temp['time'] > train_start_date) & (df_temp['time'] < train_end_date)
df_temp = df_temp.loc[train_period_mask]
df_temp.index = np.arange(0, len(df_temp))
df_temp = calc_(df_temp)
df_temp['trend'] = zg(df_temp, zg_ratio)
df_temp['trend_shifted'] = df_temp.trend.shift(-1)
df_temp = df_temp.dropna()
nump = sc.fit_transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
if w == 1:
nump1 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time_period_start','trend_shifted','trend'], axis=1))
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
nump_train.append(nump)
w += 1
drop_list = ['time_period_start']
df_train.drop(drop_list, 1, inplace = True )
start = timeit.default_timer()
sc = MinMaxScaler()
X_train = sc.fit_transform(df_train.drop(['trend','trend_shifted'], axis=1))
X_train2 = np.concatenate((nump1, nump2, nump3), axis = 0)
y_train = df_train['trend_shifted'].values
I want X_train and X_train2 to have the same shape.

Categories