Multiple linear regression in Python - python
I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
sklearn.linear_model.LinearRegression will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.
Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas provides a convenient way to run OLS as given in this answer:
Run an OLS regression with Pandas Data Frame
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
(Another source.)
You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
Use scipy.optimize.curve_fit. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
Once you convert your data to a pandas dataframe (df),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.
I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
You can use numpy.linalg.lstsq
You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
#param x: pd.DataFrame
#param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
#param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example
Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.
Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.
try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:
Related
How to create 1 linear regression for two groups of data
I have two scatterplots that I've placed on one plot. I want to find the linear regression line for the points of y1 and y2 combined (as in the regression between x and (y1 and y2) ), but I'm having difficulty since I usually only find the regression line for y1 or y2 separately. I also want to find the r^2 value (for the combined y1 and y2). I would appreciate any help I can get! df1 = pd.DataFrame(np.random.randint(0,100,size=(15, 2)), columns=list('AB')) y1 = df1['A'] y2 = df1['B'] plt.scatter(df1.index, y1) plt.scatter(df1.index, y2) plt.show()
Sounds like you want to 'stack' columns A and B together; many ways to do it, here is one using stack: df2 = df1.stack().rename('A_and_B').reset_index(level = 1, drop = True).to_frame() Then df.head() looks like this: A_and_B 0 35 0 58 1 49 1 73 2 44 and the scatter plot: plt.scatter(df2.index, df2['A_and_B']) looks like I don't know how you do regressions, you can apply your method to df2 now. For example: import statsmodels.api as sm res = sm.OLS(df2['A_and_B'], df2.index).fit() res.summary() output: OLS Regression Results ======================================================================================= Dep. Variable: A_and_B R-squared (uncentered): 0.517 Model: OLS Adj. R-squared (uncentered): 0.501 Method: Least Squares F-statistic: 31.10 Date: Mon, 14 Mar 2022 Prob (F-statistic): 5.11e-06 Time: 23:02:47 Log-Likelihood: -152.15 No. Observations: 30 AIC: 306.3 Df Residuals: 29 BIC: 307.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x1 4.8576 0.871 5.577 0.000 3.076 6.639 ============================================================================== Omnibus: 3.466 Durbin-Watson: 1.244 Prob(Omnibus): 0.177 Jarque-Bera (JB): 1.962 Skew: -0.371 Prob(JB): 0.375 Kurtosis: 1.990 Cond. No. 1.00 ============================================================================== Notes: [1] R² is computed without centering (uncentered) since the model does not contain a constant. [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
How can I reproduce the statsmodels ARIMA filter?
I am attempting to reproduce the filters used in an ARIMA model using stastmodels recursive_filter and convolution_filter. (My end objective is to use these filters for prewhitening an exogenous series.) I begin by working with just an AR model and recursive filter. Here is the simplified experimental setup: import numpy as np import statsmodels as sm np.random.seed(42) # sample data series = sm.tsa.arima_process.arma_generate_sample(ar=(1,-0.2,-0.5), ma=(1,), nsample=100) model = sm.tsa.arima.model.ARIMA(series, order=(2,0,0)).fit() print(model.summary()) Which gracefully produces the following, which seems fair enough: SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 100 Model: ARIMA(2, 0, 0) Log Likelihood -131.991 Date: Wed, 07 Apr 2021 AIC 271.982 Time: 12:58:39 BIC 282.403 Sample: 0 HQIC 276.200 - 100 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.3136 0.266 -1.179 0.238 -0.835 0.208 ar.L1 0.2135 0.084 2.550 0.011 0.049 0.378 ar.L2 0.4467 0.101 4.427 0.000 0.249 0.645 sigma2 0.8154 0.126 6.482 0.000 0.569 1.062 =================================================================================== Ljung-Box (L1) (Q): 0.10 Jarque-Bera (JB): 0.53 Prob(Q): 0.75 Prob(JB): 0.77 Heteroskedasticity (H): 0.98 Skew: -0.16 Prob(H) (two-sided): 0.96 Kurtosis: 2.85 =================================================================================== I fit an AR(2) and obtain coefficients for lag 1 and 2 based on the SARIMAX results. My intuition for reproducing this model using statsmodels.tsa.filters.filtertools.recursive_filter is like so: filtered = sm.tsa.filters.filtertools.recursive_filter(series, ar_coeff=(-0.2135, -0.4467)) (And maybe adding in the constant from the regression results as well). However, a direct comparison of the results shows the recursive filter doesn't replicate the AR model: import matploylib.pyplot as plt # ARIMA residuals plt.plot(model.resid) # Calculated residuals using recursive filter outcome plt.plot(filtered) Am I approaching this incorrectly? Should I be using a different filter function? The next step for me is to perform the same task but on an MA model so that I can add (?) the results together to obtain a full ARMA filter for prewhitening. Note: this question may be valuable to somebody searching for "how can I prewhiten timeseries data?" particularly in Python using statsmodels.
I guess you should use convolution_filter for the AR part and recursive_filter for the MA part. Combining these sequentially will work for ARMA models. Alternatively, you can use arma_innovations for an exact approach that works with both AR and MA parts simultaneously. Here are some examples: import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.tsa.innovations import arma_innovations AR(2) np.random.seed(42) series = sm.tsa.arma_generate_sample(ar=(1, -0.2, -0.5), ma=(1,), nsample=100) res = sm.tsa.arima.ARIMA(series, order=(2, 0, 0), trend='n').fit() print(pd.DataFrame({ 'ARIMA resid': res.resid, 'arma_innovations': arma_innovations.arma_innovations( series, ar_params=res.params[:-1])[0], 'convolution filter': sm.tsa.filters.convolution_filter( series, np.r_[1, -res.params[:-1]], nsides=1)})) gives: ARIMA resid arma_innovations convolution filter 0 0.496714 0.496714 NaN 1 -0.254235 -0.254235 NaN 2 0.666326 0.666326 0.666326 3 1.493315 1.493315 1.493315 4 -0.256708 -0.256708 -0.256708 .. ... ... ... 95 -1.438670 -1.438670 -1.438670 96 0.323470 0.323470 0.323470 97 0.218243 0.218243 0.218243 98 0.012264 0.012264 0.012264 99 -0.245584 -0.245584 -0.245584 MA(1) np.random.seed(42) series = sm.tsa.arma_generate_sample(ar=(1,), ma=(1, 0.2), nsample=100) res = sm.tsa.arima.ARIMA(series, order=(0, 0, 1), trend='n').fit() print(pd.DataFrame({ 'ARIMA resid': res.resid, 'arma_innovations': arma_innovations.arma_innovations( series, ma_params=res.params[:-1])[0], 'convolution filter': sm.tsa.filters.recursive_filter(series, -res.params[:-1])})) gives: ARIMA resid arma_innovations recursive filter 0 0.496714 0.496714 0.496714 1 -0.132893 -0.132893 -0.136521 2 0.646110 0.646110 0.646861 3 1.525620 1.525620 1.525466 4 -0.229316 -0.229316 -0.229286 .. ... ... ... 95 -1.464786 -1.464786 -1.464786 96 0.291233 0.291233 0.291233 97 0.263055 0.263055 0.263055 98 0.005637 0.005637 0.005637 99 -0.234672 -0.234672 -0.234672 ARMA(1, 1) np.random.seed(42) series = sm.tsa.arma_generate_sample(ar=(1, 0.5), ma=(1, 0.2), nsample=100) res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), trend='n').fit() a = res.resid # Apply the recursive then convolution filter tmp = sm.tsa.filters.recursive_filter(series, -res.params[1:2]) filtered = sm.tsa.filters.convolution_filter(tmp, np.r_[1, -res.params[:1]], nsides=1) print(pd.DataFrame({ 'ARIMA resid': res.resid, 'arma_innovations': arma_innovations.arma_innovations( series, ar_params=res.params[:1], ma_params=res.params[1:2])[0], 'combined filters': filtered})) gives: ARIMA resid arma_innovations combined filters 0 0.496714 0.496714 NaN 1 -0.134253 -0.134253 -0.136915 2 0.668094 0.668094 0.668246 3 1.507288 1.507288 1.507279 4 -0.193560 -0.193560 -0.193559 .. ... ... ... 95 -1.448784 -1.448784 -1.448784 96 0.268421 0.268421 0.268421 97 0.212966 0.212966 0.212966 98 0.046281 0.046281 0.046281 99 -0.244725 -0.244725 -0.244725 SARIMA(1, 0, 1)x(1, 0, 0, 3) The seasonal model is a little more complicated, because it requires multiplying lag polynomials. For additional details, see this example notebook from the Statsmodels documentation. np.random.seed(42) ar_poly = [1, -0.5] sar_poly = [1, 0, 0, -0.1] ar = np.polymul(ar_poly, sar_poly) series = sm.tsa.arma_generate_sample(ar=ar, ma=(1, 0.2), nsample=100) res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), seasonal_order=(1, 0, 0, 3), trend='n').fit() a = res.resid # Apply the recursive then convolution filter tmp = sm.tsa.filters.recursive_filter(series, -res.polynomial_reduced_ma[1:]) filtered = sm.tsa.filters.convolution_filter(tmp, res.polynomial_reduced_ar, nsides=1) print(pd.DataFrame({ 'ARIMA resid': res.resid, 'arma_innovations': arma_innovations.arma_innovations( series, ar_params=-res.polynomial_reduced_ar[1:], ma_params=res.polynomial_reduced_ma[1:])[0], 'combined filters': filtered})) gives: ARIMA resid arma_innovations combined filters 0 0.496714 0.496714 NaN 1 -0.100303 -0.100303 NaN 2 0.625066 0.625066 NaN 3 1.557418 1.557418 NaN 4 -0.209256 -0.209256 -0.205201 .. ... ... ... 95 -1.476702 -1.476702 -1.476702 96 0.269118 0.269118 0.269118 97 0.230697 0.230697 0.230697 98 -0.004561 -0.004561 -0.004561 99 -0.233466 -0.233466 -0.233466
extract formula from OLS Regression Results
My Goal is: Extracting the formula (not only the coefs) after a linear regression done with statsmodel. Context : I have a pandas dataframe , df x y z 0 0.0 2.0 54.200 1 0.0 2.2 70.160 2 0.0 2.4 89.000 3 0.0 2.6 110.960 i 'am doing a linear regression using statsmodels.api (2 variables, polynomial degree=3) , i'am happy with this regression. OLS Regression Results ============================================================================== Dep. Variable: z R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 2.193e+29 Date: Sun, 31 May 2020 Prob (F-statistic): 0.00 Time: 22:04:49 Log-Likelihood: 9444.6 No. Observations: 400 AIC: -1.887e+04 Df Residuals: 390 BIC: -1.883e+04 Df Model: 9 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.2000 3.33e-11 6.01e+09 0.000 0.200 0.200 x1 2.0000 1.16e-11 1.72e+11 0.000 2.000 2.000 x2 1.0000 2.63e-11 3.8e+10 0.000 1.000 1.000 x3 4.0000 3.85e-12 1.04e+12 0.000 4.000 4.000 x4 12.0000 4.36e-12 2.75e+12 0.000 12.000 12.000 x5 3.0000 6.81e-12 4.41e+11 0.000 3.000 3.000 x6 6.0000 5.74e-13 1.05e+13 0.000 6.000 6.000 x7 13.0000 4.99e-13 2.6e+13 0.000 13.000 13.000 x8 14.0000 4.99e-13 2.81e+13 0.000 14.000 14.000 x9 5.0000 5.74e-13 8.71e+12 0.000 5.000 5.000 ============================================================================== Omnibus: 25.163 Durbin-Watson: 0.038 Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.834 Skew: -0.655 Prob(JB): 5.48e-07 Kurtosis: 2.872 Cond. No. 6.66e+03 ============================================================================== I need to implement that outside of python , (ms excel) , I would like to know the formula. I know it is polynomial deg3 , but I wondering how to know which coeff apply to which term of the equation. Something like that : For exemple : x7 coeef is the coeff for x³ ,y², x²y , ... ? Note: this is a simplify version of my problem , in reallity I have 3 variables , deg:3 so 20 coefs. This is a simpler exemple of code to get started with my case: # %% Question extract formula from linear regresion coeff #Import import numpy as np # version : '1.18.1' import pandas as pd # version'1.0.0' import statsmodels.api as sm # version : '0.10.1' from sklearn.preprocessing import PolynomialFeatures from itertools import product #%% Creating the dummies datas def function_for_df(row): x= row['x'] y= row['y'] return unknow_function(x,y) def unknow_function(x,y): """ This is to generate the datas , of course in reality I do not know the formula """ r =0.2+ \ 6*x**3+4*x**2+2*x+ \ 5*y**3+3*y**2+1*y+ \ 12*x*y + 13*x**2*y+ 14*x*y**2 return r # input data x_input = np.arange(0, 4 , 0.2) y_input = np.arange(2, 6 , 0.2) # create a simple dataframe with dummies datas df = pd.DataFrame(list(product(x_input, y_input)), columns=['x', 'y']) df['z'] = df.apply(function_for_df, axis=1) # In the reality I start from there ! #%% creating the model X = df[['x','y']].astype(float) # Y = df['z'].astype(float) polynomial_features_final= PolynomialFeatures(degree=3) X3 = polynomial_features_final.fit_transform(X) model = sm.OLS(Y, X3).fit() predictions = model.predict(X3) print_model = model.summary() print(print_model) #%% using the model to make predictions, no problem def model_predict(x_sample, y_samples): df_input = pd.DataFrame({ "x":x_sample, "y":y_samples }, index=[0]) X_input = polynomial_features_final.fit_transform(df_input) prediction = model.predict(X_input) return prediction print("prediction for x=2, y=3.2 :" ,model_predict(2 ,3.2)) # How to extract the formula for the "model" ? #Thanks Side notes: A desciption like the one given by pasty ModelDesc will be fine: from patsy import ModelDesc ModelDesc.from_formula("y ~ x") # or even better : desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2") desc.describe() But i 'am not able to make the bridge between my model and patsy.ModelDesc. Thanks for your help.
As Josef said in the comment, i had to look at : sklearn PolynomialFeature . Then I found this answer : PolynomialFeatures(degree=3).get_feature_names() In the context : #%% creating the model X = df[['x','y']].astype(float) # Y = df['z'].astype(float) polynomial_features_final= PolynomialFeatures(degree=3) #X3 = polynomial_features_final.fit_transform(X) X3 = polynomial_features_final.fit_transform(df[['x', 'y']].to_numpy()) model = sm.OLS(Y, X3).fit() predictions = model.predict(X3) print_model = model.summary() print(print_model) print("\n-- ONE SOLUTION --\n Coef and Term name :") results = list(zip(model.params, polynomial_features_final.get_feature_names())) print(results)
You can fit the model using Patsy formula language using statsmodels.formula.api For instance, you can have: import statsmodels.formula.api as smf # You do not need fit_transform to generate poly features in df # You can specify the model using vectorized functions, many transformations are supported model = smf.ols(formula='z ~ x*y + I(x**2)*I(y**2) + I(x**3)*I(y**3)', data=df).fit() print_model = model.summary() print(print_model) OLS Regression Results ============================================================================== Dep. Variable: z R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 4.193e+05 Date: Tue, 15 Feb 2022 Prob (F-statistic): 0.00 Time: 16:53:19 Log-Likelihood: -1478.1 No. Observations: 400 AIC: 2976. Df Residuals: 390 BIC: 3016. Df Model: 9 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- Intercept 229.3425 23.994 9.558 0.000 182.168 276.517 x -180.0081 11.822 -15.226 0.000 -203.251 -156.765 y -136.2619 19.039 -7.157 0.000 -173.694 -98.830 x:y 118.0431 2.864 41.210 0.000 112.411 123.675 I(x ** 2) 31.2537 4.257 7.341 0.000 22.884 39.624 I(y ** 2) 22.5973 4.986 4.532 0.000 12.795 32.399 I(x ** 2):I(y ** 2) 1.4176 0.213 6.671 0.000 1.000 1.835 I(x ** 3) 4.7562 0.595 7.991 0.000 3.586 5.926 I(y ** 3) 4.7601 0.423 11.250 0.000 3.928 5.592 I(x ** 3):I(y ** 3) 0.0166 0.006 2.915 0.004 0.005 0.028 ============================================================================== Omnibus: 28.012 Durbin-Watson: 0.182 Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.076 Skew: -0.311 Prob(JB): 3.68e-16 Kurtosis: 4.969 Cond. No. 1.32e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.32e+05. This might indicate that there are strong multicollinearity or other numerical problems. print(model.params) Intercept 229.342451 x -180.008082 y -136.261886 x:y 118.043098 I(x ** 2) 31.253705 I(y ** 2) 22.597298 I(x ** 2):I(y ** 2) 1.417551 I(x ** 3) 4.756205 I(y ** 3) 4.760144 I(x ** 3):I(y ** 3) 0.016611 dtype: float64 Then by simple print(model.params), you will have a natural bridge between the model and patsy.ModelDesc (here that is analogous to the formula definition) Notes: Raw polynomial used here in the demo. You may switch to using orthogonal polynomials. That will help explain the contribution of each term to variance in the outcome, ref to StackExchange
Utlizing two indepedant variables and one dependant variables in linear regression using NUMPY [duplicate]
I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.). For example, with this data: print 'y x1 x2 x3 x4 x5 x6 x7' for t in texts: print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" / .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7) (output for above:) y x1 x2 x3 x4 x5 x6 x7 -6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45 -5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50 -10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53 -5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60 -8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48 -3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53 -6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50 -8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49 -8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55 How would I regress these in python, to get the linear regression formula: Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
sklearn.linear_model.LinearRegression will do it: from sklearn import linear_model clf = linear_model.LinearRegression() clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts], [t.y for t in texts]) Then clf.coef_ will have the regression coefficients. sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.
Here is a little work around that I created. I checked it with R and it works correct. import numpy as np import statsmodels.api as sm y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4] x = [ [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5], [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5], [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4] ] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results Result: print reg_m(y, x).summary() Output: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.535 Model: OLS Adj. R-squared: 0.461 Method: Least Squares F-statistic: 7.281 Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191 Time: 21:51:28 Log-Likelihood: -26.025 No. Observations: 23 AIC: 60.05 Df Residuals: 19 BIC: 64.59 Df Model: 3 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.2424 0.139 1.739 0.098 -0.049 0.534 x2 0.2360 0.149 1.587 0.129 -0.075 0.547 x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241 const 1.5704 0.633 2.481 0.023 0.245 2.895 ============================================================================== Omnibus: 6.904 Durbin-Watson: 1.905 Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708 Skew: -0.849 Prob(JB): 0.0950 Kurtosis: 4.426 Cond. No. 38.6 pandas provides a convenient way to run OLS as given in this answer: Run an OLS regression with Pandas Data Frame
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference: The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature. In short: multiple linear regression: the response y is a scalar. multivariate linear regression: the response y is a vector. (Another source.)
You can use numpy.linalg.lstsq: import numpy as np y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8]) X = np.array( [ [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54], [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03], [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03], [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18], [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28], [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21], [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55], ] ) X = X.T # transpose so input vectors are along the rows X = np.c_[X, np.ones(X.shape[0])] # add bias term beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] print(beta_hat) Result: [ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066] You can see the estimated output with: print(np.dot(X,beta_hat)) Result: [ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
Use scipy.optimize.curve_fit. And not only for linear fit. from scipy.optimize import curve_fit import scipy def fn(x, a, b, c): return a + b*x[0] + c*x[1] # y(x0,x1) data: # x0=0 1 2 # ___________ # x1=0 |0 1 2 # x1=1 |1 2 3 # x1=2 |2 3 4 x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]]) y = scipy.array([0,1,2,1,2,3,2,3,4]) popt, pcov = curve_fit(fn, x, y) print popt
Once you convert your data to a pandas dataframe (df), import statsmodels.formula.api as smf lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit() print(lm.params) The intercept term is included by default. See this notebook for more examples.
I think this may the most easy way to finish this work: from random import random from pandas import DataFrame from statsmodels.api import OLS lr = lambda : [random() for i in range(100)] x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()}) x['b'] = 1 y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4 print x.head() x1 x2 x3 b 0 0.433681 0.946723 0.103422 1 1 0.400423 0.527179 0.131674 1 2 0.992441 0.900678 0.360140 1 3 0.413757 0.099319 0.825181 1 4 0.796491 0.862593 0.193554 1 print y.head() 0 6.637392 1 5.849802 2 7.874218 3 7.087938 4 7.102337 dtype: float64 model = OLS(y, x) result = model.fit() print result.summary() OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 5.859e+30 Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00 Time: 15:17:32 Log-Likelihood: 3224.9 No. Observations: 100 AIC: -6442. Df Residuals: 96 BIC: -6431. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000 x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000 x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000 b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000 ============================================================================== Omnibus: 7.675 Durbin-Watson: 1.614 Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118 Skew: 0.045 Prob(JB): 0.210 Kurtosis: 2.140 Cond. No. 6.89 ==============================================================================
Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6. Create your model as follows: from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X, y) # display coefficients print(regressor.coef_)
You can use numpy.linalg.lstsq
You can use the function below and pass it a DataFrame: def linear(x, y=None, show=True): """ #param x: pd.DataFrame #param y: pd.DataFrame or pd.Series or None if None, then use last column of x as y #param show: if show regression summary """ import statsmodels.api as sm xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1)) res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit() if show: print res.summary() return res
Scikit-learn is a machine learning library for Python which can do this job for you. Just import sklearn.linear_model module into your script. Find the code template for Multiple Linear Regression using sklearn in Python: import numpy as np import matplotlib.pyplot as plt #to plot visualizations import pandas as pd # Importing the dataset df = pd.read_csv(<Your-dataset-path>) # Assigning feature and target variables X = df.iloc[:,:-1] y = df.iloc[:,-1] # Use label encoders, if you have any categorical variable from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() X['<column-name>'] = labelencoder.fit_transform(X['<column-name>']) from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = ['<index-value>']) X = onehotencoder.fit_transform(X).toarray() # Avoiding the dummy variable trap X = X[:,1:] # Usually done by the algorithm itself #Spliting the data into test and train set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2) # Fitting the model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting the test set results y_pred = regressor.predict(X_test) That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset. For a better understanding with an example, Visit: Linear Regression with an example
Here is an alternative and basic method: from patsy import dmatrices import statsmodels.api as sm y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data) ### y_data is the name of the dependent variable in your data ### model_fit = sm.OLS(y,x) results = model_fit.fit() print(results.summary()) Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.
Finding a linear model such as this one can be handled with OpenTURNS. In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model : Y = a0 + a1.X1 + ... + an.Xn + epsilon, where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai : from __future__ import print_function import pandas as pd import openturns as ot # Assuming the data is a csv file with the given structure # Y X1 X2 .. X7 df = pd.read_csv("./data.csv", sep="\s+") # Build a sample from the pandas dataframe sample = ot.Sample(df.values) # The observation points are in the first column (dimension 1) Y = sample[:, 0] # The input vector (X1,..,X7) of dimension 7 X = sample[:, 1::] # Build a Linear model approximation result = ot.LinearModelAlgorithm(X, Y).getResult() # Get the coefficients ai print("coefficients of the linear regression model = ", result.getCoefficients()) You can then easily get the confidence intervals with the following call : # Get the confidence intervals at 90% of the ai coefficients print( "confidence intervals of the coefficients = ", ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9), ) You may find a more detailed example in the OpenTURNS examples.
try a generalized linear model with a gaussian family y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8]) X = np.array([ [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54], [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03], [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03], [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18], [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28], [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21], [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55], ]) X=zip(*reversed(X)) df=pd.DataFrame({'X':X,'y':y}) columns=7 for i in range(0,columns): df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1) df=df.drop('X',axis=1) print(df) #model_formula='y ~ X0+X1+X2+X3+X4+X5+X6' model_formula='y ~ X0' model_family = sm.families.Gaussian() model_fit = glm(formula = model_formula, data = df, family = model_family).fit() print(model_fit.summary()) # Extract coefficients from the fitted model wells_fit #print(model_fit.params) intercept, slope = model_fit.params # Print coefficients print('Intercept =', intercept) print('Slope =', slope) # Extract and print confidence intervals print(model_fit.conf_int()) df2=pd.DataFrame() df2['X0']=np.linspace(0.50,0.70,50) df3=pd.DataFrame() df3['X1']=np.linspace(0.20,0.60,50) prediction0=model_fit.predict(df2) #prediction1=model_fit.predict(df3) plt.plot(df2['X0'],prediction0,label='X0') plt.ylabel("y") plt.xlabel("X0") plt.show()
Linear Regression is a good example for start to Artificial Intelligence Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python: ##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi #### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values. import pandas as pd ##### we use sklearn library in many machine learning calculations.. from sklearn import linear_model ##### we import out dataset: housepricesdataset.csv df = pd.read_csv("housepricesdataset.csv",sep = ";") ##### The following is our feature set: ##### The following is the output(result) data: ##### we define a linear regression model here: reg = linear_model.LinearRegression() reg.fit(df[['area', 'roomcount', 'buildingage']], df['price']) # Since our model is ready, we can make predictions now: # lets predict a house with 230 square meters, 4 rooms and 10 years old building.. reg.predict([[230,4,10]]) # Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building.. reg.predict([[230,6,0]]) # Now lets predict a house with 355 square meters, 3 rooms and 20 years old building reg.predict([[355,3,20]]) # You can make as many prediction as you want.. reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]]) And my dataset is below:
Run an OLS regression with Pandas Data Frame
I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example: import pandas as pd df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.) >>> import pandas as pd >>> import statsmodels.formula.api as sm >>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) >>> result = sm.ols(formula="A ~ B + C", data=df).fit() >>> print(result.params) Intercept 14.952480 B 0.401182 C 0.000352 dtype: float64 >>> print(result.summary()) OLS Regression Results ============================================================================== Dep. Variable: A R-squared: 0.579 Model: OLS Adj. R-squared: 0.158 Method: Least Squares F-statistic: 1.375 Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421 Time: 20:04:30 Log-Likelihood: -18.178 No. Observations: 5 AIC: 42.36 Df Residuals: 2 BIC: 41.19 Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386 B 0.4012 0.650 0.617 0.600 -2.394 3.197 C 0.0004 0.001 0.650 0.583 -0.002 0.003 ============================================================================== Omnibus: nan Durbin-Watson: 1.061 Prob(Omnibus): nan Jarque-Bera (JB): 0.498 Skew: -0.123 Prob(JB): 0.780 Kurtosis: 1.474 Cond. No. 5.21e+04 ============================================================================== Warnings: [1] The condition number is large, 5.21e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Note: pandas.stats has been removed with 0.20.0 It's possible to do this with pandas.stats.ols: >>> from pandas.stats.api import ols >>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) >>> res = ols(y=df['A'], x=df[['B','C']]) >>> res -------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <B> + <C> + <intercept> Number of Observations: 5 Number of Degrees of Freedom: 3 R-squared: 0.5789 Adj R-squared: 0.1577 Rmse: 14.5108 F-stat (2, 2): 1.3746, p-value: 0.4211 Degrees of Freedom: model 2, resid 2 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746 C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014 intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705 ---------------------------------End of Summary--------------------------------- Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.
I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types. from sklearn import linear_model reg = linear_model.LinearRegression() reg.fit(df[['B', 'C']], df['A']) >>> reg.coef_ array([ 4.01182386e-01, 3.51587361e-04])
This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. No it doesn't, just convert to a NumPy array: >>> data = np.asarray(df) This takes constant time because it just creates a view on your data. Then feed it to scikit-learn: >>> from sklearn.linear_model import LinearRegression >>> lr = LinearRegression() >>> X, y = data[:, 1:], data[:, 0] >>> lr.fit(X, y) LinearRegression(copy_X=True, fit_intercept=True, normalize=False) >>> lr.coef_ array([ 4.01182386e-01, 3.51587361e-04]) >>> lr.intercept_ 14.952479503953672
Statsmodels kan build an OLS model with column references directly to a pandas dataframe. Short and sweet: model = sm.OLS(df[y], df[x]).fit() Code details and regression summary: # imports import pandas as pd import statsmodels.api as sm import numpy as np # data np.random.seed(123) df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC')) # assign dependent and independent / explanatory variables variables = list(df.columns) y = 'A' x = [var for var in variables if var not in y ] # Ordinary least squares regression model_Simple = sm.OLS(df[y], df[x]).fit() # Add a constant term like so: model = sm.OLS(df[y], sm.add_constant(df[x])).fit() model.summary() Output: OLS Regression Results ============================================================================== Dep. Variable: A R-squared: 0.019 Model: OLS Adj. R-squared: -0.001 Method: Least Squares F-statistic: 0.9409 Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.394 Time: 08:35:04 Log-Likelihood: -484.49 No. Observations: 100 AIC: 975.0 Df Residuals: 97 BIC: 982.8 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 43.4801 8.809 4.936 0.000 25.996 60.964 B 0.1241 0.105 1.188 0.238 -0.083 0.332 C -0.0752 0.110 -0.681 0.497 -0.294 0.144 ============================================================================== Omnibus: 50.990 Durbin-Watson: 2.013 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.905 Skew: 0.032 Prob(JB): 0.0317 Kurtosis: 1.714 Cond. No. 231. ============================================================================== How to directly get R-squared, Coefficients and p-value: # commands: model.params model.pvalues model.rsquared # demo: In[1]: model.params Out[1]: const 43.480106 B 0.124130 C -0.075156 dtype: float64 In[2]: model.pvalues Out[2]: const 0.000003 B 0.237924 C 0.497400 dtype: float64 Out[3]: model.rsquared Out[2]: 0.0190
B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) avg_c=df['C'].mean() sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum() countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count() avg_c2=sumC/countC df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x) print(df) model_ols = smf.ols("A ~ B+C",data=df).fit() print(model_ols.summary()) df[['B','C']].plot() plt.show() df2=pd.DataFrame() df2['B']=np.linspace(10,50,10) df2['C']=30 df3=pd.DataFrame() df3['B']=np.linspace(10,50,10) df3['C']=100 predB=model_ols.predict(df2) predC=model_ols.predict(df3) plt.plot(df2['B'],predB,label='predict B C=30') plt.plot(df3['B'],predC,label='predict B C=100') plt.legend() plt.show() print("A change in the probability of C affects the probability of B") intercept=model_ols.params.loc['Intercept'] B_slope=model_ols.params.loc['B'] C_slope=model_ols.params.loc['C'] #Intercept 11.874252 #B 0.760859 #C -0.060257 print("Intercept {}\n B slope{}\n C slope{}\n".format(intercept,B_slope,C_slope)) #lower_conf,upper_conf=np.exp(model_ols.conf_int()) #print(lower_conf,upper_conf) #print((1-(lower_conf/upper_conf))*100) model_cov=model_ols.cov_params() std_errorB = np.sqrt(model_cov.loc['B', 'B']) std_errorC = np.sqrt(model_cov.loc['C', 'C']) print('SE: ', round(std_errorB, 4),round(std_errorC, 4)) #check for statistically significant print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC))) print("B feature is more statistically significant than C") Output: A change in the probability of C affects the probability of B Intercept 11.874251554067563 B slope0.7608594144571961 C slope-0.060256845997223814 Standard Error: 0.4519 0.0793 B z value 1.683510336937001 C z value -0.7601036314930376 B feature is more statistically significant than C z>2 is statistically significant