I wanted to use multivariate quantile regression with spline to analyze the data. The data contains three independent variables and one dependent variable. I divided the data into training set and validation set, and fitted the model on the training set and the validation set to verify the model. I used quantreg()from statsmodels.formula.api and thebs() from the patsy to achieve this. But quickly an error occurred using predict().
1.I don't know if this is the right way to implement my idea.
2.How to use predict in the above situation.
import pandas as pd
import statsmodels.formula.api as smf
import patsy
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data.iloc[:,:3],
data.total, test_size=0.1, random_state = 1)
train=train_x.join(train_y)
vel = train['vel']
salmean = train['salmean']
em = train['em']
total = train['total']
model = smf.quantreg('total ~ bs(vel, df=3, degree=3) + bs(salmean, df=3,
degree=3) + bs(em, df=3, degree=3) ', train).fit(0.9)
y_pre =model.predict(valid_x)
The information of the error:
PatsyError: predict requires that you use a DataFrame when predicting from a model that was created using the formula api.
The original error message returned by patsy is:
Error evaluating factor: NotImplementedError: some data points fall outside the outermost knots, and I'm not sure how to handle them. (Patches accepted!)
total ~ bs(vel, df=3, degree=3) + bs(salmean, df=3, degree=3) + bs(em, df=3, degree=3)
Related
I tried to practice linear regression model with iris dataset.
from sklearn import datasets
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
# load iris data
train = sns.load_dataset('iris')
train
# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded
train = pd.concat([train, species_encoded], axis = 1)
train
# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target = ["petal_width"]
X_train = train[feature]
y_train = train[target]
case 1 : statsmodels
# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit()
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])
result :
const : 0.253251
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.337843
speceis_versicolor 0.094816
speceis_virginica 0.496278
case 2 : scikit-learn
# model
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))
result :
const : 0.337668
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.422260
speceis_versicolor 0.010399
speceis_virginica 0.411861
Why are the results of statsmodels and sklearn different?
Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.
You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.
Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense. The implementations differ a bit between packages (sklearn relies on scipy.stats.lstsq, statsmodels has some custom procedure statsmodels.tools.pinv_extended, which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.
If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant. The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I am trying to fit a regression model to a time series data in Python (basically to predict the trend). I have applied seasonal decomposition using statsmodels earlier which extracts data to its three components including the data trend.
However, I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data.
I would appreciate if anyone can help me with this or even introduces a tutorial for this issue.
Since you mentioned:
I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data.
Maybe ARIMA (Auto Regressive Integrated Moving Average) model with given setup (P,D,Q), which can learn on history and predict()/forecast(). Please notice that split data into train and test are for sake of evaluation with approach of walk-forward validation:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
# load dataset
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('/content/shampoo.txt', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
series.index = series.index.to_period('M')
# split into train and test sets
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(5,1,0))
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
rmse_ = 'Test RMSE: %.3f' % rmse
# plot forecasts against actual outcomes
pyplot.plot(test, label='test')
pyplot.plot(predictions, color='red', label='predict')
pyplot.xlabel('Months')
pyplot.ylabel('Sale')
pyplot.title(f'ARIMA model performance with {rmse_}')
pyplot.legend()
pyplot.show()
I used the same library package you mentioned with following outputs including Root Mean Square Error (RMSE) evaluation:
import statsmodels as sm
sm.__version__ # '0.10.2'
Please see other post1 & post2 for further info. Maybe you can add trend line too
I'm currently using TensorFlow and SkLearn to to try to make a model that can predict the amount of sales for a certain product, X, based on the outdoor temperature in celcius.
I took my datasets for the temperature and set it equal to the x variable, and the amount of sales to as a y variable. As seen on the picture below, there is some sort of correlation between the temperature and the amount of sales:
First and foremost, I tried to do linear regression to see how well it'd fit. This is the code for that:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train) #fit tries to fit the x variable and y variable.
#Let's try to plot it out.
y_pred = model.predict(x_train)
plt.scatter(x_train,y_train)
plt.plot(x_train,y_pred,'r')
plt.legend(['Predicted Line', 'Observed data'])
plt.show()
This resulted in a predicted line that had a pretty poor fit:
A very nice feature from sklearn however is that you can try to predict an value based on a temperature, so if I were to write
model.predict(15)
i'd get the output
array([6949.05567873])
This is exactly what I want, I just wanted to line to fit better so instead I tried polynoimal regression with sklearn by doing following:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this actually predicts x...?
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
The line seems to fit better now:
My problem is not that I can't use new_model.predict(x) since it'll result in "ValueError: shapes (1,1) and (8,) not aligned: 1 (dim 1) != 8 (dim 0)". I understand that this is because I'm using a 8-degree polynomium, but is there any way for me to predict the y-axsis based on ONE temperature using the polynomial regression model?
Try using new_model.predict([x**a for a in range(1,9)])
or according to your previously used code, you can do new_model.predict(poly.fit_transform(x))
Since you fit a line
y = ax^1 + bx^2 + ... + h*x^8
you, need to transform your input in the same manner i.e. turn it into a polynomial without the intercept and slope terms. This was what you passed into Linear Regression training function. It learns the slope terms for that polynomial. The plot you've shown only contains the x^1 term you indexed into (x_new[:,0]) which means that the data you're using has more columns.
One last note: always make sure your training data and future/validation data undergo the same preprocessing steps to ensure your model works.
Here's some detail :
Let's start by running your code, on synthetic data.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from numpy.random import rand
x_train = rand(1000,1)
y_train = rand(1000,1)
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this predicts y
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
Now we can predict y value by transforming an x-value into a polynomial of degree 8 without an intercept
print(new_model.predict(poly.fit_transform(0.25)))
[[0.47974408]]
I am reading calibration methods for two days but did not actually make it that how it works. Two types of calibration are there;
Platt scaling - prediction space parted into bins & for each bin mean predicted value is plotted against true fraction of positive cases
Isotonic regression - Mathematically it tries to fit a weighted least-squares via Quadratic Programming, subject to next observation is always non-decreasing with respect to previous observation.
I have written a python module on calibration based on logistic regression (though I know LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss, I built it to check my understanding)
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from pandas import DataFrame
class logistic_Calibration:
def __init__(self, data, response):
self.data = data
self.response = response
def Calibration(self):
Xtrain, Xtest, ytrain, ytest = train_test_split(self.data, self.response, test_size=0.20, random_state=36)
logreg = linear_model.LogisticRegression()
logreg.fit(Xtrain, np.array(ytrain).flatten())
PredWO_calibration = logreg.predict_proba(Xtest)
lossWO_calibration = log_loss(ytest, PredWO_calibration)
clf_sigmoid = CalibratedClassifierCV(logreg, cv=5, method='sigmoid')
clf_sigmoid.fit(Xtrain, np.array(ytrain).flatten())
PredWITH_calibration = clf_sigmoid.predict_proba(Xtest)
lossWITH_calibration = log_loss(ytest, PredWITH_calibration)
Loss_difference_WO_minus_W = lossWO_calibration - lossWITH_calibration
return [lossWO_calibration, lossWITH_calibration, Loss_difference_WO_minus_W]
But still I am unclear on the following parts,
How isotonic regression maps the scores to probabilities?
Platt scaling does not work for real time data as for that we do not have any class assigned, that means Brier score can not be calculated. If after fitting a model, i have predicted probability & actual outcome on training data, how can i use calibration using only these two inputs to assign classes for test data? That is the most important part i would like to know.
Please guide.