Related
So I have a curve fit, and I'm wondering how to include the weighting based on the standard error.
Here's the df (defined as df_altered):
Temperature Growth_rate Standard_Error Result Final_results Weight
14.0 0.363 0.110 0.000 0.363 9.091
18.0 0.677 0.043 0.767 0.673 23.256
22.0 0.822 0.044 0.975 0.832 22.727
26.0 0.936 0.073 0.975 0.920 13.699
30.0 0.897 0.051 0.767 0.911 19.608
And here's the curve fit setup (I don't really know if this qualifies as a curve fit, though):
import numpy as np
import matplotlib.pyplot as plt
x = df_altered['Temperature']
y1 = df_altered['Growth_rate']
y2 = df_altered['Final_results']
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(x, y1, 'r-')
ax2.plot(x, y2, 'g-')
ax1.set_xlabel('Temperature (°C)')
ax1.set_ylabel('Observed growth rate', color='r')
ax2.set_ylabel('Optimised modelled growth rate', color='g')
plt.show()
So essentially, I'd like to use the Weight and Standard_Error columns as a set of parameters to determine the weighting of the plots (the smaller the standard error, the greater the weighting). I've already set this up with popt2:
popt2, pcov2 = curve_fit(boatman_temperature_function_optimised, xdata = df_altered.Temperature, ydata = df_altered.Growth_rate, p0 = array_of_maxima, sigma=df_altered.Weight, bounds = (array_of_minima, array_of_maxima), absolute_sigma=True)
Any thoughts?
I am attempting to reproduce the filters used in an ARIMA model using stastmodels recursive_filter and convolution_filter. (My end objective is to use these filters for prewhitening an exogenous series.)
I begin by working with just an AR model and recursive filter. Here is the simplified experimental setup:
import numpy as np
import statsmodels as sm
np.random.seed(42)
# sample data
series = sm.tsa.arima_process.arma_generate_sample(ar=(1,-0.2,-0.5), ma=(1,), nsample=100)
model = sm.tsa.arima.model.ARIMA(series, order=(2,0,0)).fit()
print(model.summary())
Which gracefully produces the following, which seems fair enough:
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: ARIMA(2, 0, 0) Log Likelihood -131.991
Date: Wed, 07 Apr 2021 AIC 271.982
Time: 12:58:39 BIC 282.403
Sample: 0 HQIC 276.200
- 100
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3136 0.266 -1.179 0.238 -0.835 0.208
ar.L1 0.2135 0.084 2.550 0.011 0.049 0.378
ar.L2 0.4467 0.101 4.427 0.000 0.249 0.645
sigma2 0.8154 0.126 6.482 0.000 0.569 1.062
===================================================================================
Ljung-Box (L1) (Q): 0.10 Jarque-Bera (JB): 0.53
Prob(Q): 0.75 Prob(JB): 0.77
Heteroskedasticity (H): 0.98 Skew: -0.16
Prob(H) (two-sided): 0.96 Kurtosis: 2.85
===================================================================================
I fit an AR(2) and obtain coefficients for lag 1 and 2 based on the SARIMAX results. My intuition for reproducing this model using statsmodels.tsa.filters.filtertools.recursive_filter is like so:
filtered = sm.tsa.filters.filtertools.recursive_filter(series, ar_coeff=(-0.2135, -0.4467))
(And maybe adding in the constant from the regression results as well). However, a direct comparison of the results shows the recursive filter doesn't replicate the AR model:
import matploylib.pyplot as plt
# ARIMA residuals
plt.plot(model.resid)
# Calculated residuals using recursive filter outcome
plt.plot(filtered)
Am I approaching this incorrectly? Should I be using a different filter function? The next step for me is to perform the same task but on an MA model so that I can add (?) the results together to obtain a full ARMA filter for prewhitening.
Note: this question may be valuable to somebody searching for "how can I prewhiten timeseries data?" particularly in Python using statsmodels.
I guess you should use convolution_filter for the AR part and recursive_filter for the MA part. Combining these sequentially will work for ARMA models. Alternatively, you can use arma_innovations for an exact approach that works with both AR and MA parts simultaneously. Here are some examples:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.innovations import arma_innovations
AR(2)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, -0.2, -0.5), ma=(1,), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(2, 0, 0), trend='n').fit()
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=res.params[:-1])[0],
'convolution filter': sm.tsa.filters.convolution_filter(
series, np.r_[1, -res.params[:-1]], nsides=1)}))
gives:
ARIMA resid arma_innovations convolution filter
0 0.496714 0.496714 NaN
1 -0.254235 -0.254235 NaN
2 0.666326 0.666326 0.666326
3 1.493315 1.493315 1.493315
4 -0.256708 -0.256708 -0.256708
.. ... ... ...
95 -1.438670 -1.438670 -1.438670
96 0.323470 0.323470 0.323470
97 0.218243 0.218243 0.218243
98 0.012264 0.012264 0.012264
99 -0.245584 -0.245584 -0.245584
MA(1)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1,), ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(0, 0, 1), trend='n').fit()
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ma_params=res.params[:-1])[0],
'convolution filter': sm.tsa.filters.recursive_filter(series, -res.params[:-1])}))
gives:
ARIMA resid arma_innovations recursive filter
0 0.496714 0.496714 0.496714
1 -0.132893 -0.132893 -0.136521
2 0.646110 0.646110 0.646861
3 1.525620 1.525620 1.525466
4 -0.229316 -0.229316 -0.229286
.. ... ... ...
95 -1.464786 -1.464786 -1.464786
96 0.291233 0.291233 0.291233
97 0.263055 0.263055 0.263055
98 0.005637 0.005637 0.005637
99 -0.234672 -0.234672 -0.234672
ARMA(1, 1)
np.random.seed(42)
series = sm.tsa.arma_generate_sample(ar=(1, 0.5), ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), trend='n').fit()
a = res.resid
# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.params[1:2])
filtered = sm.tsa.filters.convolution_filter(tmp, np.r_[1, -res.params[:1]], nsides=1)
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=res.params[:1], ma_params=res.params[1:2])[0],
'combined filters': filtered}))
gives:
ARIMA resid arma_innovations combined filters
0 0.496714 0.496714 NaN
1 -0.134253 -0.134253 -0.136915
2 0.668094 0.668094 0.668246
3 1.507288 1.507288 1.507279
4 -0.193560 -0.193560 -0.193559
.. ... ... ...
95 -1.448784 -1.448784 -1.448784
96 0.268421 0.268421 0.268421
97 0.212966 0.212966 0.212966
98 0.046281 0.046281 0.046281
99 -0.244725 -0.244725 -0.244725
SARIMA(1, 0, 1)x(1, 0, 0, 3)
The seasonal model is a little more complicated, because it requires multiplying lag polynomials. For additional details, see this example notebook from the Statsmodels documentation.
np.random.seed(42)
ar_poly = [1, -0.5]
sar_poly = [1, 0, 0, -0.1]
ar = np.polymul(ar_poly, sar_poly)
series = sm.tsa.arma_generate_sample(ar=ar, ma=(1, 0.2), nsample=100)
res = sm.tsa.arima.ARIMA(series, order=(1, 0, 1), seasonal_order=(1, 0, 0, 3), trend='n').fit()
a = res.resid
# Apply the recursive then convolution filter
tmp = sm.tsa.filters.recursive_filter(series, -res.polynomial_reduced_ma[1:])
filtered = sm.tsa.filters.convolution_filter(tmp, res.polynomial_reduced_ar, nsides=1)
print(pd.DataFrame({
'ARIMA resid': res.resid,
'arma_innovations': arma_innovations.arma_innovations(
series, ar_params=-res.polynomial_reduced_ar[1:],
ma_params=res.polynomial_reduced_ma[1:])[0],
'combined filters': filtered}))
gives:
ARIMA resid arma_innovations combined filters
0 0.496714 0.496714 NaN
1 -0.100303 -0.100303 NaN
2 0.625066 0.625066 NaN
3 1.557418 1.557418 NaN
4 -0.209256 -0.209256 -0.205201
.. ... ... ...
95 -1.476702 -1.476702 -1.476702
96 0.269118 0.269118 0.269118
97 0.230697 0.230697 0.230697
98 -0.004561 -0.004561 -0.004561
99 -0.233466 -0.233466 -0.233466
My Goal is: Extracting the formula (not only the coefs) after a linear regression done with statsmodel.
Context :
I have a pandas dataframe ,
df
x y z
0 0.0 2.0 54.200
1 0.0 2.2 70.160
2 0.0 2.4 89.000
3 0.0 2.6 110.960
i 'am doing a linear regression using statsmodels.api (2 variables, polynomial degree=3) , i'am happy with this regression.
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.193e+29
Date: Sun, 31 May 2020 Prob (F-statistic): 0.00
Time: 22:04:49 Log-Likelihood: 9444.6
No. Observations: 400 AIC: -1.887e+04
Df Residuals: 390 BIC: -1.883e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.2000 3.33e-11 6.01e+09 0.000 0.200 0.200
x1 2.0000 1.16e-11 1.72e+11 0.000 2.000 2.000
x2 1.0000 2.63e-11 3.8e+10 0.000 1.000 1.000
x3 4.0000 3.85e-12 1.04e+12 0.000 4.000 4.000
x4 12.0000 4.36e-12 2.75e+12 0.000 12.000 12.000
x5 3.0000 6.81e-12 4.41e+11 0.000 3.000 3.000
x6 6.0000 5.74e-13 1.05e+13 0.000 6.000 6.000
x7 13.0000 4.99e-13 2.6e+13 0.000 13.000 13.000
x8 14.0000 4.99e-13 2.81e+13 0.000 14.000 14.000
x9 5.0000 5.74e-13 8.71e+12 0.000 5.000 5.000
==============================================================================
Omnibus: 25.163 Durbin-Watson: 0.038
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.834
Skew: -0.655 Prob(JB): 5.48e-07
Kurtosis: 2.872 Cond. No. 6.66e+03
==============================================================================
I need to implement that outside of python , (ms excel) , I would like to know the formula.
I know it is polynomial deg3 , but I wondering how to know which coeff apply to which term of
the equation. Something like that :
For exemple : x7 coeef is the coeff for x³ ,y², x²y , ... ?
Note: this is a simplify version of my problem , in reallity I have 3 variables , deg:3 so 20 coefs.
This is a simpler exemple of code to get started with my case:
# %% Question extract formula from linear regresion coeff
#Import
import numpy as np # version : '1.18.1'
import pandas as pd # version'1.0.0'
import statsmodels.api as sm # version : '0.10.1'
from sklearn.preprocessing import PolynomialFeatures
from itertools import product
#%% Creating the dummies datas
def function_for_df(row):
x= row['x']
y= row['y']
return unknow_function(x,y)
def unknow_function(x,y):
"""
This is to generate the datas , of course in reality I do not know the formula
"""
r =0.2+ \
6*x**3+4*x**2+2*x+ \
5*y**3+3*y**2+1*y+ \
12*x*y + 13*x**2*y+ 14*x*y**2
return r
# input data
x_input = np.arange(0, 4 , 0.2)
y_input = np.arange(2, 6 , 0.2)
# create a simple dataframe with dummies datas
df = pd.DataFrame(list(product(x_input, y_input)), columns=['x', 'y'])
df['z'] = df.apply(function_for_df, axis=1)
# In the reality I start from there !
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
X3 = polynomial_features_final.fit_transform(X)
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
#%% using the model to make predictions, no problem
def model_predict(x_sample, y_samples):
df_input = pd.DataFrame({ "x":x_sample, "y":y_samples }, index=[0])
X_input = polynomial_features_final.fit_transform(df_input)
prediction = model.predict(X_input)
return prediction
print("prediction for x=2, y=3.2 :" ,model_predict(2 ,3.2))
# How to extract the formula for the "model" ?
#Thanks
Side notes:
A desciption like the one given by pasty ModelDesc will be fine:
from patsy import ModelDesc
ModelDesc.from_formula("y ~ x")
# or even better :
desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2")
desc.describe()
But i 'am not able to make the bridge between my model and patsy.ModelDesc.
Thanks for your help.
As Josef said in the comment, i had to look at : sklearn PolynomialFeature .
Then I found this answer :
PolynomialFeatures(degree=3).get_feature_names()
In the context :
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
#X3 = polynomial_features_final.fit_transform(X)
X3 = polynomial_features_final.fit_transform(df[['x', 'y']].to_numpy())
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
print("\n-- ONE SOLUTION --\n Coef and Term name :")
results = list(zip(model.params, polynomial_features_final.get_feature_names()))
print(results)
You can fit the model using Patsy formula language using statsmodels.formula.api
For instance, you can have:
import statsmodels.formula.api as smf
# You do not need fit_transform to generate poly features in df
# You can specify the model using vectorized functions, many transformations are supported
model = smf.ols(formula='z ~ x*y + I(x**2)*I(y**2) + I(x**3)*I(y**3)', data=df).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.193e+05
Date: Tue, 15 Feb 2022 Prob (F-statistic): 0.00
Time: 16:53:19 Log-Likelihood: -1478.1
No. Observations: 400 AIC: 2976.
Df Residuals: 390 BIC: 3016.
Df Model: 9
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 229.3425 23.994 9.558 0.000 182.168 276.517
x -180.0081 11.822 -15.226 0.000 -203.251 -156.765
y -136.2619 19.039 -7.157 0.000 -173.694 -98.830
x:y 118.0431 2.864 41.210 0.000 112.411 123.675
I(x ** 2) 31.2537 4.257 7.341 0.000 22.884 39.624
I(y ** 2) 22.5973 4.986 4.532 0.000 12.795 32.399
I(x ** 2):I(y ** 2) 1.4176 0.213 6.671 0.000 1.000 1.835
I(x ** 3) 4.7562 0.595 7.991 0.000 3.586 5.926
I(y ** 3) 4.7601 0.423 11.250 0.000 3.928 5.592
I(x ** 3):I(y ** 3) 0.0166 0.006 2.915 0.004 0.005 0.028
==============================================================================
Omnibus: 28.012 Durbin-Watson: 0.182
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.076
Skew: -0.311 Prob(JB): 3.68e-16
Kurtosis: 4.969 Cond. No. 1.32e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
print(model.params)
Intercept 229.342451
x -180.008082
y -136.261886
x:y 118.043098
I(x ** 2) 31.253705
I(y ** 2) 22.597298
I(x ** 2):I(y ** 2) 1.417551
I(x ** 3) 4.756205
I(y ** 3) 4.760144
I(x ** 3):I(y ** 3) 0.016611
dtype: float64
Then by simple print(model.params), you will have a natural bridge between the model and patsy.ModelDesc (here that is analogous to the formula definition)
Notes: Raw polynomial used here in the demo. You may switch to using orthogonal polynomials. That will help explain the contribution of each term to variance in the outcome, ref to StackExchange
I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
sklearn.linear_model.LinearRegression will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.
Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas provides a convenient way to run OLS as given in this answer:
Run an OLS regression with Pandas Data Frame
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
(Another source.)
You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
Use scipy.optimize.curve_fit. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
Once you convert your data to a pandas dataframe (df),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.
I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
You can use numpy.linalg.lstsq
You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
#param x: pd.DataFrame
#param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
#param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example
Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.
Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.
try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:
I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
sklearn.linear_model.LinearRegression will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.
Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas provides a convenient way to run OLS as given in this answer:
Run an OLS regression with Pandas Data Frame
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
(Another source.)
You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
Use scipy.optimize.curve_fit. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
Once you convert your data to a pandas dataframe (df),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.
I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
You can use numpy.linalg.lstsq
You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
#param x: pd.DataFrame
#param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
#param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example
Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.
Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.
try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below: