Regression across multiple columns in Python - python

I'm new to coding and am having a hard time regressing multiple columns on one column.
The dataframe consists of ~200 securities. I want to regress each security on a specific column (stock1 regressed on stock4, stock2 regressed on stock4, stock3 regressed on stock4, etc.)
Then, I want a new dataframe of the regression coefficients and the securities.
regr = LinearRegression()
y_regression = np.array(df.y).reshape(-1,1)
beta = lambda x: list(regr.fit(np.array(x).reshape(-1,1), y_regression).coef_)
beta = df.apply(beta)
The code will correctly perform the calculations across all columns, but it comes out as an np.ndarray and looks like this [[1.25678]]. The only way I can get the code to work on multiple columns is if I create a list of the arrays. My new dataframe has this format:
Stock1 [[1.25678]]
Stock2 [[0.96782]]
etc.
How can I change the type so that it only gives me the inner number (1.25678,etc)?

Does this work:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
y_regression = np.array(df.y).reshape(-1,1)
beta = lambda x: list(regr.fit(np.array(x).reshape(-1,1), y_regression).coef_)[0]
beta = df.apply(beta)

Related

Question about Yfinance, IndexError, and Numpy arrays

I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.

LOOP univariate rolling window regression on entire DF Python

I have a dataframe of 24 variables (24 columns x 4580 rows) from 2008 to 2020.
My independant variable is the first one in the DF and the dependant variables are the 23 others.
I've done a test for one rolling window regression, it works well, here is my code :
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
import seaborn
seaborn.set_style('darkgrid')
pd.plotting.register_matplotlib_converters()
x = sm.add_constant(df[['DIFFSWAP']])
y = df[['CADUSD']]
rols = RollingOLS(y,x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
Now, what i want to do, i'd like to do a loop to regress (rolling window) all the dependant variables of the DF (columns 2:24) on the independant variable (column 1) and store the coefficients and the rsquareds.
My ultimate goal is to extract Rsquareds and Coefficients and put them in dataframes(or lists or whatever) and then graph them.
I'm new to Python so I'd be very gratefull for any help.
Thank you!
Can you throw it all in a loop and store the results in some other object like a dict?
Potential solution:
data = {}
for column in list(df.columns)[2:]: # iterate over columns 2 to 24
x = sm.add_constant(df[column])
y = df[['CADUSD']] ## This never changes from CADUSD, right?
rols = RollingOLS(y, x, window=60)
rres = rols.fit()
params = rres.params
r_sq = rres.rsquared
# Store results from each column's fit as a new dict entry
data[column] = {'params':params, 'r_sq':r_sq}
results_df = pd.DataFrame(data).T

Filling results of linear regression into a dataframe

I'm running a regression between two stocks:
(y=bank_matrix['EXO.MI']
and
x=bank_matrix['LDO.MI']).
My task is to update the slope coefficient every 20 days (lookback). In short, I want to have a list of slope coefficients starting from day 20 (my lookback). So I run this regression model called reg.
In the meantime, I create:
A)3 empty lists: Intercetta=[], Hedge=[], Residuals=[]
B)1 Dataframe called Regressione where I want to copy the results of my regression (Intercept,Slope and residuals) inside this dataframe columns (['Intercept','Hedge','Residuals']).
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
NOW THE FINAL QUESTION: Why in my final dataframe 'Regressione', the third column('Residuals') is an horizontal array???
so, firstly I think these 2 lines you are doing completely wrong:
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
You basically try to run linear regression for all the points starting 1 to 20, then 2 to 21, 3 to 22 etc. Then you try to fit that regression to data from observation 20 onward. So you get the model for e.g. 5 to 24 and based on it you predict observations 20 till the end, and take the difference between that prediction and actuals (mind that bank_matrix[['EXO.MI']][lookback:].to_numpy() doesn't change during your for loop).
I suppose what would make more sense here would be:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i-lookback+1:i])
Residuals.append(bank_matrix[['EXO.MI']][i-lookback+1:i].to_numpy()-y_pred)
So you would take error of the model, or:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i:])
Residuals.append(bank_matrix[['EXO.MI']][i:].to_numpy()-y_pred)
So you would try to fit prediction based on the current time span to the data going forward.
Now first option will produce lists of 19 elements per row, while the other one will produce list of 430, decreasing by 1 per row, until 1 in the last row. Because these are residuals - so you have a line, with a slope, and hedge 1 per given time span, but then you have number of observation within this range producing each different result. So depending on how do you want to express it - you can make it sum of square residuals, or maybe take mean residual - you can make it one number only by applying some further transformation to it.
Hope this helps...
From the doc:
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
You need to use df.loc for example to modify the data in your dataframe...

How to save predicted regression values inside a for loop?

I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.

How to fix .predict() function in statsmodels?

I'm trying to predict temperature at 12 UTC tomorrow in 1 location. To forecast, I use a basic linear regression model with the statmodels module. My code is hereafter:
x = ds_main
X = sm.add_constant(x)
y = ds_target_t
model = sm.OLS(y,X,missing='drop')
results = model.fit()
The summary shows that the fit is "good":
But the problem appears when I try to predict values with a new dataset that I consider to be my testset. The latter has the same columns number and the same variables names, but the .predict() function returns an array of NaN, although my testset has values ...
xnew = ts_main
Xnew = sm.add_constant(xnew)
ynewpred = results.predict(Xnew)
I really don't understand where the problem is ...
UPDATE : I think I have an explanation: my Xnew dataframe contains NaN values. Statmodels function .fit() allows to drop missing values (NaN) but not .predict() function. Thus, it returns a NaN values array ...
But this is the "why", but I still don't get the "how" reason to fix it...
statsmodels.api.OLS be default will not accept the data with NA values. So if you use this, then you need to drop your NA values first.
However, if you use statsmodels.formula.api.ols, then it will automatically drop the NA values to run regression and make predictions for you.
so you can try this:
import statsmodels.formula.api as smf
lm = smf.ols(formula = "y~X", pd.concat([y, X], axis = 1)).fit()
lm.predict(Xnew)

Categories