Multiple linear regression in Python

Multiple linear regression in Python - python

I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

sklearn.linear_model.LinearRegression will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas provides a convenient way to run OLS as given in this answer:
Run an OLS regression with Pandas Data Frame

Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
(Another source.)

You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]

Use scipy.optimize.curve_fit. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

Once you convert your data to a pandas dataframe (df),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.

I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================

Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)

You can use numpy.linalg.lstsq

You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
#param x: pd.DataFrame
#param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
#param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res

Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example

Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.

Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.

try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()

Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Related

How to create 1 linear regression for two groups of data

I have two scatterplots that I've placed on one plot. I want to find the linear regression line for the points of y1 and y2 combined (as in the regression between x and (y1 and y2) ), but I'm having difficulty since I usually only find the regression line for y1 or y2 separately. I also want to find the r^2 value (for the combined y1 and y2). I would appreciate any help I can get!
df1 = pd.DataFrame(np.random.randint(0,100,size=(15, 2)), columns=list('AB'))
y1 = df1['A']
y2 = df1['B']
plt.scatter(df1.index, y1)
plt.scatter(df1.index, y2)
plt.show()

Sounds like you want to 'stack' columns A and B together; many ways to do it, here is one using stack:
df2 = df1.stack().rename('A_and_B').reset_index(level = 1, drop = True).to_frame()
Then df.head() looks like this:
A_and_B
0 35
0 58
1 49
1 73
2 44
and the scatter plot:
plt.scatter(df2.index, df2['A_and_B'])
looks like
I don't know how you do regressions, you can apply your method to df2 now. For example:
import statsmodels.api as sm
res = sm.OLS(df2['A_and_B'], df2.index).fit()
res.summary()
output:
OLS Regression Results
=======================================================================================
Dep. Variable: A_and_B R-squared (uncentered): 0.517
Model: OLS Adj. R-squared (uncentered): 0.501
Method: Least Squares F-statistic: 31.10
Date: Mon, 14 Mar 2022 Prob (F-statistic): 5.11e-06
Time: 23:02:47 Log-Likelihood: -152.15
No. Observations: 30 AIC: 306.3
Df Residuals: 29 BIC: 307.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 4.8576 0.871 5.577 0.000 3.076 6.639
==============================================================================
Omnibus: 3.466 Durbin-Watson: 1.244
Prob(Omnibus): 0.177 Jarque-Bera (JB): 1.962
Skew: -0.371 Prob(JB): 0.375
Kurtosis: 1.990 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

How can I reproduce the statsmodels ARIMA filter?

I am attempting to reproduce the filters used in an ARIMA model using stastmodels recursive_filter and convolution_filter. (My end objective is to use these filters for prewhitening an exogenous series.)
I begin by working with just an AR model and recursive filter. Here is the simplified experimental setup:
import numpy as np
import statsmodels as sm
np.random.seed(42)
# sample data
series = sm.tsa.arima_process.arma_generate_sample(ar=(1,-0.2,-0.5), ma=(1,), nsample=100)
model = sm.tsa.arima.model.ARIMA(series, order=(2,0,0)).fit()
print(model.summary())
Which gracefully produces the following, which seems fair enough:
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: ARIMA(2, 0, 0) Log Likelihood -131.991
Date: Wed, 07 Apr 2021 AIC 271.982
Time: 12:58:39 BIC 282.403
Sample: 0 HQIC 276.200
- 100
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3136 0.266 -1.179 0.238 -0.835 0.208
ar.L1 0.2135 0.084 2.550 0.011 0.049 0.378
ar.L2 0.4467 0.101 4.427 0.000 0.249 0.645
sigma2 0.8154 0.126 6.482 0.000 0.569 1.062
===================================================================================
Ljung-Box (L1) (Q): 0.10 Jarque-Bera (JB): 0.53
Prob(Q): 0.75 Prob(JB): 0.77
Heteroskedasticity (H): 0.98 Skew: -0.16
Prob(H) (two-sided): 0.96 Kurtosis: 2.85
===================================================================================
I fit an AR(2) and obtain coefficients for lag 1 and 2 based on the SARIMAX results. My intuition for reproducing this model using statsmodels.tsa.filters.filtertools.recursive_filter is like so:
filtered = sm.tsa.filters.filtertools.recursive_filter(series, ar_coeff=(-0.2135, -0.4467))
(And maybe adding in the constant from the regression results as well). However, a direct comparison of the results shows the recursive filter doesn't replicate the AR model:
import matploylib.pyplot as plt
# ARIMA residuals
plt.plot(model.resid)
# Calculated residuals using recursive filter outcome
plt.plot(filtered)
Am I approaching this incorrectly? Should I be using a different filter function? The next step for me is to perform the same task but on an MA model so that I can add (?) the results together to obtain a full ARMA filter for prewhitening.
Note: this question may be valuable to somebody searching for "how can I prewhiten timeseries data?" particularly in Python using statsmodels.

I guess you should use convolution_filter for the AR part and recursive_filter for the MA part. Combining these sequentially will work for ARMA models. Alternatively, you can use arma_innovations for an exact approach that works with both AR and MA parts simultaneously. Here are some examples:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.innovations import arma_innovations
AR(2)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, -0.2, -0.5), ma=(1,), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(2, 0, 0), trend='n').fit()
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=res.params[:-1])[0],
'convolution filter': sm.tsa.filters.convolution_filter(
series, np.r_[1, -res.params[:-1]], nsides=1)}))
gives:
ARIMA resid arma_innovations convolution filter
0 0.496714 0.496714 NaN
1 -0.254235 -0.254235 NaN
2 0.666326 0.666326 0.666326
3 1.493315 1.493315 1.493315
4 -0.256708 -0.256708 -0.256708
.. ... ... ...
95 -1.438670 -1.438670 -1.438670
96 0.323470 0.323470 0.323470
97 0.218243 0.218243 0.218243
98 0.012264 0.012264 0.012264
99 -0.245584 -0.245584 -0.245584
MA(1)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1,), ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(0, 0, 1), trend='n').fit()
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ma_params=res.params[:-1])[0],
'convolution filter': sm.tsa.filters.recursive_filter(series, -res.params[:-1])}))
gives:
ARIMA resid arma_innovations recursive filter
0 0.496714 0.496714 0.496714
1 -0.132893 -0.132893 -0.136521
2 0.646110 0.646110 0.646861
3 1.525620 1.525620 1.525466
4 -0.229316 -0.229316 -0.229286
.. ... ... ...
95 -1.464786 -1.464786 -1.464786
96 0.291233 0.291233 0.291233
97 0.263055 0.263055 0.263055
98 0.005637 0.005637 0.005637
99 -0.234672 -0.234672 -0.234672
ARMA(1, 1)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, 0.5), ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), trend='n').fit()
a = res.resid
# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.params[1:2])
filtered = sm.tsa.filters.convolution_filter(tmp, np.r_[1, -res.params[:1]], nsides=1)
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=res.params[:1], ma_params=res.params[1:2])[0],
'combined filters': filtered}))
gives:
ARIMA resid arma_innovations combined filters
0 0.496714 0.496714 NaN
1 -0.134253 -0.134253 -0.136915
2 0.668094 0.668094 0.668246
3 1.507288 1.507288 1.507279
4 -0.193560 -0.193560 -0.193559
.. ... ... ...
95 -1.448784 -1.448784 -1.448784
96 0.268421 0.268421 0.268421
97 0.212966 0.212966 0.212966
98 0.046281 0.046281 0.046281
99 -0.244725 -0.244725 -0.244725
SARIMA(1, 0, 1)x(1, 0, 0, 3)
The seasonal model is a little more complicated, because it requires multiplying lag polynomials. For additional details, see this example notebook from the Statsmodels documentation.
np.random.seed(42)
ar_poly = [1, -0.5]
sar_poly = [1, 0, 0, -0.1]
ar = np.polymul(ar_poly, sar_poly)
series = sm.tsa.arma_generate_sample(ar=ar, ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), seasonal_order=(1, 0, 0, 3), trend='n').fit()
a = res.resid
# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.polynomial_reduced_ma[1:])
filtered = sm.tsa.filters.convolution_filter(tmp, res.polynomial_reduced_ar, nsides=1)
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=-res.polynomial_reduced_ar[1:],
ma_params=res.polynomial_reduced_ma[1:])[0],
'combined filters': filtered}))
gives:
ARIMA resid arma_innovations combined filters
0 0.496714 0.496714 NaN
1 -0.100303 -0.100303 NaN
2 0.625066 0.625066 NaN
3 1.557418 1.557418 NaN
4 -0.209256 -0.209256 -0.205201
.. ... ... ...
95 -1.476702 -1.476702 -1.476702
96 0.269118 0.269118 0.269118
97 0.230697 0.230697 0.230697
98 -0.004561 -0.004561 -0.004561
99 -0.233466 -0.233466 -0.233466

extract formula from OLS Regression Results

My Goal is: Extracting the formula (not only the coefs) after a linear regression done with statsmodel.
Context :
I have a pandas dataframe ,
df
x y z
0 0.0 2.0 54.200
1 0.0 2.2 70.160
2 0.0 2.4 89.000
3 0.0 2.6 110.960
i 'am doing a linear regression using statsmodels.api (2 variables, polynomial degree=3) , i'am happy with this regression.
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.193e+29
Date: Sun, 31 May 2020 Prob (F-statistic): 0.00
Time: 22:04:49 Log-Likelihood: 9444.6
No. Observations: 400 AIC: -1.887e+04
Df Residuals: 390 BIC: -1.883e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.2000 3.33e-11 6.01e+09 0.000 0.200 0.200
x1 2.0000 1.16e-11 1.72e+11 0.000 2.000 2.000
x2 1.0000 2.63e-11 3.8e+10 0.000 1.000 1.000
x3 4.0000 3.85e-12 1.04e+12 0.000 4.000 4.000
x4 12.0000 4.36e-12 2.75e+12 0.000 12.000 12.000
x5 3.0000 6.81e-12 4.41e+11 0.000 3.000 3.000
x6 6.0000 5.74e-13 1.05e+13 0.000 6.000 6.000
x7 13.0000 4.99e-13 2.6e+13 0.000 13.000 13.000
x8 14.0000 4.99e-13 2.81e+13 0.000 14.000 14.000
x9 5.0000 5.74e-13 8.71e+12 0.000 5.000 5.000
==============================================================================
Omnibus: 25.163 Durbin-Watson: 0.038
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.834
Skew: -0.655 Prob(JB): 5.48e-07
Kurtosis: 2.872 Cond. No. 6.66e+03
==============================================================================
I need to implement that outside of python , (ms excel) , I would like to know the formula.
I know it is polynomial deg3 , but I wondering how to know which coeff apply to which term of
the equation. Something like that :
For exemple : x7 coeef is the coeff for x³ ,y², x²y , ... ?
Note: this is a simplify version of my problem , in reallity I have 3 variables , deg:3 so 20 coefs.
This is a simpler exemple of code to get started with my case:
# %% Question extract formula from linear regresion coeff
#Import
import numpy as np # version : '1.18.1'
import pandas as pd # version'1.0.0'
import statsmodels.api as sm # version : '0.10.1'
from sklearn.preprocessing import PolynomialFeatures
from itertools import product
#%% Creating the dummies datas
def function_for_df(row):
x= row['x']
y= row['y']
return unknow_function(x,y)
def unknow_function(x,y):
"""
This is to generate the datas , of course in reality I do not know the formula
"""
r =0.2+ \
6*x**3+4*x**2+2*x+ \
5*y**3+3*y**2+1*y+ \
12*x*y + 13*x**2*y+ 14*x*y**2
return r
# input data
x_input = np.arange(0, 4 , 0.2)
y_input = np.arange(2, 6 , 0.2)
# create a simple dataframe with dummies datas
df = pd.DataFrame(list(product(x_input, y_input)), columns=['x', 'y'])
df['z'] = df.apply(function_for_df, axis=1)
# In the reality I start from there !
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
X3 = polynomial_features_final.fit_transform(X)
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
#%% using the model to make predictions, no problem
def model_predict(x_sample, y_samples):
df_input = pd.DataFrame({ "x":x_sample, "y":y_samples }, index=[0])
X_input = polynomial_features_final.fit_transform(df_input)
prediction = model.predict(X_input)
return prediction
print("prediction for x=2, y=3.2 :" ,model_predict(2 ,3.2))
# How to extract the formula for the "model" ?
#Thanks
Side notes:
A desciption like the one given by pasty ModelDesc will be fine:
from patsy import ModelDesc
ModelDesc.from_formula("y ~ x")
# or even better :
desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2")
desc.describe()
But i 'am not able to make the bridge between my model and patsy.ModelDesc.
Thanks for your help.

As Josef said in the comment, i had to look at : sklearn PolynomialFeature .
Then I found this answer :
PolynomialFeatures(degree=3).get_feature_names()
In the context :
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
#X3 = polynomial_features_final.fit_transform(X)
X3 = polynomial_features_final.fit_transform(df[['x', 'y']].to_numpy())
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
print("\n-- ONE SOLUTION --\n Coef and Term name :")
results = list(zip(model.params, polynomial_features_final.get_feature_names()))
print(results)

You can fit the model using Patsy formula language using statsmodels.formula.api
For instance, you can have:
import statsmodels.formula.api as smf
# You do not need fit_transform to generate poly features in df
# You can specify the model using vectorized functions, many transformations are supported
model = smf.ols(formula='z ~ x*y + I(x**2)*I(y**2) + I(x**3)*I(y**3)', data=df).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.193e+05
Date: Tue, 15 Feb 2022 Prob (F-statistic): 0.00
Time: 16:53:19 Log-Likelihood: -1478.1
No. Observations: 400 AIC: 2976.
Df Residuals: 390 BIC: 3016.
Df Model: 9
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 229.3425 23.994 9.558 0.000 182.168 276.517
x -180.0081 11.822 -15.226 0.000 -203.251 -156.765
y -136.2619 19.039 -7.157 0.000 -173.694 -98.830
x:y 118.0431 2.864 41.210 0.000 112.411 123.675
I(x ** 2) 31.2537 4.257 7.341 0.000 22.884 39.624
I(y ** 2) 22.5973 4.986 4.532 0.000 12.795 32.399
I(x ** 2):I(y ** 2) 1.4176 0.213 6.671 0.000 1.000 1.835
I(x ** 3) 4.7562 0.595 7.991 0.000 3.586 5.926
I(y ** 3) 4.7601 0.423 11.250 0.000 3.928 5.592
I(x ** 3):I(y ** 3) 0.0166 0.006 2.915 0.004 0.005 0.028
==============================================================================
Omnibus: 28.012 Durbin-Watson: 0.182
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.076
Skew: -0.311 Prob(JB): 3.68e-16
Kurtosis: 4.969 Cond. No. 1.32e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
print(model.params)
Intercept 229.342451
x -180.008082
y -136.261886
x:y 118.043098
I(x ** 2) 31.253705
I(y ** 2) 22.597298
I(x ** 2):I(y ** 2) 1.417551
I(x ** 3) 4.756205
I(y ** 3) 4.760144
I(x ** 3):I(y ** 3) 0.016611
dtype: float64
Then by simple print(model.params), you will have a natural bridge between the model and patsy.ModelDesc (here that is analogous to the formula definition)
Notes: Raw polynomial used here in the demo. You may switch to using orthogonal polynomials. That will help explain the contribution of each term to variance in the outcome, ref to StackExchange

Utlizing two indepedant variables and one dependant variables in linear regression using NUMPY [duplicate]

I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

sklearn.linear_model.LinearRegression will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas provides a convenient way to run OLS as given in this answer:
Run an OLS regression with Pandas Data Frame

Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
(Another source.)

You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]

Use scipy.optimize.curve_fit. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

Once you convert your data to a pandas dataframe (df),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.

I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================

Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)

You can use numpy.linalg.lstsq

You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
#param x: pd.DataFrame
#param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
#param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res

Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example

Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.

Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.

try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()

Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})
Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Note: pandas.stats has been removed with 0.20.0
It's possible to do this with pandas.stats.ols:
>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.

I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])
>>> reg.coef_
array([ 4.01182386e-01, 3.51587361e-04])

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.
No it doesn't, just convert to a NumPy array:
>>> data = np.asarray(df)
This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([ 4.01182386e-01, 3.51587361e-04])
>>> lr.intercept_
14.952479503953672

Statsmodels kan build an OLS model with column references directly to a pandas dataframe.
Short and sweet:
model = sm.OLS(df[y], df[x]).fit()
Code details and regression summary:
# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np
# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))
# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]
# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()
# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
model.summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.019
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.9409
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.394
Time: 08:35:04 Log-Likelihood: -484.49
No. Observations: 100 AIC: 975.0
Df Residuals: 97 BIC: 982.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 43.4801 8.809 4.936 0.000 25.996 60.964
B 0.1241 0.105 1.188 0.238 -0.083 0.332
C -0.0752 0.110 -0.681 0.497 -0.294 0.144
==============================================================================
Omnibus: 50.990 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.905
Skew: 0.032 Prob(JB): 0.0317
Kurtosis: 1.714 Cond. No. 231.
==============================================================================
How to directly get R-squared, Coefficients and p-value:
# commands:
model.params
model.pvalues
model.rsquared
# demo:
In[1]:
model.params
Out[1]:
const 43.480106
B 0.124130
C -0.075156
dtype: float64
In[2]:
model.pvalues
Out[2]:
const 0.000003
B 0.237924
C 0.497400
dtype: float64
Out[3]:
model.rsquared
Out[2]:
0.0190

B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
avg_c=df['C'].mean()
sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum()
countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count()
avg_c2=sumC/countC
df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)
print(df)
model_ols = smf.ols("A ~ B+C",data=df).fit()
print(model_ols.summary())
df[['B','C']].plot()
plt.show()
df2=pd.DataFrame()
df2['B']=np.linspace(10,50,10)
df2['C']=30
df3=pd.DataFrame()
df3['B']=np.linspace(10,50,10)
df3['C']=100
predB=model_ols.predict(df2)
predC=model_ols.predict(df3)
plt.plot(df2['B'],predB,label='predict B C=30')
plt.plot(df3['B'],predC,label='predict B C=100')
plt.legend()
plt.show()
print("A change in the probability of C affects the probability of B")
intercept=model_ols.params.loc['Intercept']
B_slope=model_ols.params.loc['B']
C_slope=model_ols.params.loc['C']
#Intercept 11.874252
#B 0.760859
#C -0.060257
print("Intercept {}\n B slope{}\n C slope{}\n".format(intercept,B_slope,C_slope))
#lower_conf,upper_conf=np.exp(model_ols.conf_int())
#print(lower_conf,upper_conf)
#print((1-(lower_conf/upper_conf))*100)
model_cov=model_ols.cov_params()
std_errorB = np.sqrt(model_cov.loc['B', 'B'])
std_errorC = np.sqrt(model_cov.loc['C', 'C'])
print('SE: ', round(std_errorB, 4),round(std_errorC, 4))
#check for statistically significant
print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC)))
print("B feature is more statistically significant than C")
Output:
A change in the probability of C affects the probability of B
Intercept 11.874251554067563
B slope0.7608594144571961
C slope-0.060256845997223814
Standard Error: 0.4519 0.0793
B z value 1.683510336937001 C z value -0.7601036314930376
B feature is more statistically significant than C
z>2 is statistically significant

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple linear regression in Python - python

Once you convert your data to a pandas dataframe (df), import statsmodels.formula.api as smf lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit() print(lm.params) The intercept term is included by default. See this notebook for more examples.

You can use numpy.linalg.lstsq

Related

How to create 1 linear regression for two groups of data

How can I reproduce the statsmodels ARIMA filter?

extract formula from OLS Regression Results

Utlizing two indepedant variables and one dependant variables in linear regression using NUMPY [duplicate]

Run an OLS regression with Pandas Data Frame

Categories

Resources