loop through dataframe columns to do simple linear regression? - python

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2 = pd.DataFrame({'column':y_new})
i write this code to loop through the dataframe columns to do simple linear regression and put all the predicted value in dataframe. but it is predicting only the last columns value.

df2 = pd.DataFrame({'column':y_new}) creates a column named 'column' verbatim (not the name saved in the variable column. Moreover, df2 is recreated in every iteration, each iteration it only saves the last y_new.
I think what you want is to create a new column in df2 in each iteration:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
df2 = pd.DataFrame()
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2[column] = y_new

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

How do I create a linear regression model for a file that has about 500 columns as y variables? Working with Python

This code manually selects a column from the y table and then joins it to the X table. The program then performs linear regression. Any idea how to do this for every single column from the y table?
yDF = pd.read_csv('ytable.csv')
yDF.drop('Dates', axis = 1, inplace = True)
XDF = pd.read_csv('Xtable.csv')
ycolumnDF = yDF.iloc[:,0].to_frame()
regressionDF = pd.concat([XDF,ycolumnDF], axis=1)
X = regressionDF.iloc[:,1:20]
y = regressionDF.iloc[:,20:].squeeze()
lm = linear_model.LinearRegression()
lm.fit(X,y)
cf = lm.coef_
print(cf)
You can regress multiple y's on the same X's at the same time. Something like this should work
import numpy as np
from sklearn.linear_model import LinearRegression
df_X = pd.DataFrame(columns = ['x1','x2','x3'], data = np.random.normal(size = (10,3)))
df_y = pd.DataFrame(columns = ['y1','y2'], data = np.random.normal(size = (10,2)))
X = df_X.iloc[:,:]
y = df_y.iloc[:,:]
lm = LinearRegression().fit(X,y)
print(lm.coef_)
produces
[[ 0.16115884 0.08471495 0.39169592]
[-0.51929011 0.29160846 -0.62106353]]
The first row here ([ 0.16115884 0.08471495 0.39169592]) are the regression coefs of y1 on xs and the second are the regression coefs of y2 on xs.

Summarize Loop Results in Pandas Table

I got the code that downloads tickers and runs the linear regression for each stock in the downloaded list. I am stuck on the last step: showing Prediction & Residual values for each stock, for the last date in the data.
import pandas as pd
import numpy as np
import yfinance as yf
import datetime as dt
from sklearn import linear_model
tickers = ['EXPE','MSFT']
data = yf.download(tickers, start="2012-04-03", end="2017-07-07")['Close']
data = data.reset_index()
data = data.dropna()
df = pd.DataFrame(data, columns = ["Date"])
df["Date"]=df["Date"].apply(lambda x: x.toordinal())
for ticker in tickers:
data[ticker] = pd.DataFrame(data, columns = [ticker])
X = df
y = data[ticker]
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)
residuals = y-lm.predict(X)
print (predictions[-1:])
print(residuals[-1:])
The current output looks like this:
[136.28856636]
1323 13.491432
Name: EXPE, dtype: float64
[64.19943648]
1323 5.260563
Name: MSFT, dtype: float64
But I would like it to show like this (as pandas table):
Predictions Residuals
EXPE 136.29 13.49
MSFT 64.20 5.26
You could do something like this where you store values in a list:
import pandas as pd
import numpy as np
import yfinance as yf
import datetime as dt
from sklearn import linear_model
tickers = ['EXPE','MSFT']
data = yf.download(tickers, start="2012-04-03", end="2017-07-07")['Close']
data = data.reset_index()
data = data.dropna()
df = pd.DataFrame(data, columns = ["Date"])
df["Date"]=df["Date"].apply(lambda x: x.toordinal())
predictions_output = []
residuals_output = []
for ticker in tickers:
data[ticker] = pd.DataFrame(data, columns = [ticker])
X = df
y = data[ticker]
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)
residuals = y-lm.predict(X)
predictions_output.append(float(predictions[-1:]))
residuals_output.append(float(residuals[-1:]))
expectation_df = pd.DataFrame(list(zip(predictions_output, residuals_output)),
columns =['Predictions', 'Residuals']).set_index([tickers])
print(expectation_df)
with the output being:
Predictions Residuals
EXPE 136.288566 13.491432
MSFT 64.199436 5.260563
EDIT: I went too quickly and looked back and realized tickers was already defined, so you can use that to set your index here and lose the Tickers index heading to match your desired output.
Also if you want those values rounded, you can just append these two lines in your loop:
predictions_output.append(round(float(predictions[-1:]), 2))
residuals_output.append(round(float(residuals[-1:]), 2))

How to pass a pandas dataframe to `scipy.optimize.curve_fit` or `scipy.stats.linregress`

Similar question here: Pass Pandas DataFrame to Scipy.optimize.curve_fit
I now have a dataframe with shape=(100, 4), i.e. four dependent vars Y1 to Y4. With another independent array m = [1, 2, 3, 4]. I need to make a linear model out of Ys and m, generating a predicted Y value.
How can I do it for the whole dataframe, without doing it in a for loop with each row of the dataframe?
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.stats import linregress
Y = np.random.randn(100, 4)
m = np.array([1, 2, 3, 4])
df = pd.DataFrame(Y, columns=['y1', 'y2', 'y3', 'y4'])
for index, row in df.iterrows():
slope, intercept, r_value, p_value, std_err = linregress(m, row.values)
print(slope, intercept)
First, it's good practice to format the data with the observations on the rows. That is that each observation is described by the dimensions in the other columns, the variables (x1-4). Afterwards you can pass your explanatory variables to the model function along with the response (y), which can be one column of your data frame or outside but with the same number of rows.
Apparently the linregress function only fits a single explanatory variable onto a response variable.
For >2-dimensional modelling, I would suggest using other packages such as statsmodels or sklearn.linear_model.LinearRegression
Below I go on with the former suggestion:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
data = np.random.randn(100, 4)
y = np.random.randn(100)
df = pd.DataFrame(data, columns=['x1', 'x2', 'x3', 'x4'])
x1 = df['x1']
x2 = df['x2']
x3 = df['x3']
x4 = df['x4']
model = ols("y ~ x1 + x2 + x3 + x4", df).fit()
print(model.summary())

Python Index column doesn't freeze while scrolling to the right

my problem is that I have a Dataframe of 200 rows and 200 columns, while I scroll to the right the index column stay fixed ( I can still see it) as it should be.
However when I select a column or value into the Dataframe (for example to order the values in ascending or descending order), the index column change and becomes the same as the column I selected.
I would like to still see the index column.
I am using Spyder 3.3.0 and Python 3.6
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import operator
# Importing the dataset
dataset = pd.read_csv('1992_2014.csv', index_col =0)
nations_all = dataset.iloc[:, 0].values
nations = [nations_all[0]]
for i in range(0, len(nations_all)):
if nations_all[i] not in nations:
nations.append(nations_all[i])
Year = dataset.iloc[:, 1].values
CO2 = dataset.iloc[:, 8].values
# Creating the Trend Matrix between two nations
trend_matrix = pd.DataFrame(index = nations, columns = nations)
for i in nations:
n = dataset[dataset["Nation"] == i].index.values.astype(int)
for k in nations:
kn = dataset[dataset["Nation"] == k].index.values.astype(int)
div_n = CO2[n[0]]
div_kn = CO2[kn[0]]
CO2_n = (CO2[n]/div_n)
CO2_kn = (CO2[kn]/div_kn)
trend_matrix.loc[i, k] = sum(list(map(abs,list(map(operator.sub, CO2_n, CO2_kn)))))
Thanks!

Categories