I want to compute AIC for linear models to compare their complexity. I did it as follows:
regr = linear_model.LinearRegression()
regr.fit(X, y)
aic_intercept_slope = aic(y, regr.coef_[0] * X.as_matrix() + regr.intercept_, k=1)
def aic(y, y_pred, k):
resid = y - y_pred.ravel()
sse = sum(resid ** 2)
AIC = 2*k - 2*np.log(sse)
return AIC
But I receive a divide by zero encountered in log error.
sklearn's LinearRegression is good for prediction but pretty barebones as you've discovered. (It's often said that sklearn stays away from all things statistical inference.)
statsmodels.regression.linear_model.OLS has a property attribute AIC and a number of other pre-canned attributes.
However, note that you'll need to manually add a unit vector to your X matrix to include an intercept in your model.
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
regr = OLS(y, add_constant(X)).fit()
print(regr.aic)
Source is here if you are looking for an alternative way to write manually while still using sklearn.
Related
(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link
It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?
Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:
A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable
I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)
I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
According to the documentation it is possible to specify different loss functions to SGDClassifier. And as far as I understand log loss is a cross-entropy loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].
The question is: is it possible to use SGDClassifier with log loss function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?
UPDATE:
The way target is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.
So the question is probably:
How to specify my own loss function for SGDClassifier, for example. It seems scikit-learn doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources
I recently had this problem and came up with a nice fix that seems to work.
Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.
So say we have soft targets/labels y ∈ (0, 1) (make sure to clamp the targets to say [1e-8, 1 - 1e-8] to avoid instability issues when we take logs).
We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X):
y = np.clip(y, 1e-8, 1 - 1e-8) # numerical stability
inv_sig_y = np.log(y / (1 - y)) # transform to log-odds-ratio space
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)
Then to make predictions:
def sigmoid(x):
ex = np.exp(x)
return ex / (1 + ex)
preds = sigmoid(lr.predict(X_new))
This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.
Bonus: this also seems to work with other regression models in sklearn, e.g. RandomForestRegressor.
According to the docs,
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.
for those interested, i've implemented a custom class that behaves like a normal classifier, but takes a any regressor in the cosntructor to perform the transformation suggested by #nlml:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array
from scipy.special import softmax
import numpy as np
def _log_odds_ratio_scale(X):
X = np.clip(X, 1e-8, 1 - 1e-8) # numerical stability
X = np.log(X / (1 - X)) # transform to log-odds-ratio space
return X
class FuzzyTargetClassifier(ClassifierMixin, BaseEstimator):
def __init__(self, regressor):
'''
Fits regressor in the log odds ratio space (inverse crossentropy) of target variable.
during transform, rescales back to probability space with softmax function
Parameters
---------
regressor: Sklearn Regressor
base regressor to fit log odds ratio space. Any valid sklearn regressor can be used here.
'''
self.regressor = regressor
return
def fit(self, X, y=None, **kwargs):
#ensure passed y is onehotencoded-like
y = check_array(y, accept_sparse=True, dtype = 'numeric', ensure_min_features=1)
self.regressors_ = [clone(self.regressor) for _ in range(y.shape[1])]
for i in range(y.shape[1]):
self._fit_single_regressor(self.regressors_[i], X, y[:,i], **kwargs)
return self
def _fit_single_regressor(self, regressor, X, ysub, **kwargs):
ysub = _log_odds_ratio_scale(ysub)
regressor.fit(X, ysub, **kwargs)
return regressor
def decision_function(self,X):
all_results = []
for reg in self.regressors_:
results = reg.predict(X)
if results.ndim < 2:
results = results.reshape(-1,1)
all_results.append(results)
results = np.hstack(all_results)
return results
def predict_proba(self, X):
results = self.decision_function(X)
results = softmax(results, axis = 1)
return results
def predict(self, X):
results = self.decision_function(X)
results = results.argmax(1)
return results
I'm doing a simple linear model. I have
fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores
which yields
[ 0.00000000e+00 0.00000000e+00 -8.27299054e+02 -5.80431382e+00
-1.04444147e-01 -1.19367785e+00 -1.24843536e+00 -3.39950443e-01
1.95018287e-02 -9.73940970e-02]
How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?
There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.
In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100) # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
This should result in negative r^2 values.
In [23]: scores
Out[23]:
array([-240.17927358, -5.51819556, -14.06815196, -67.87003867,
-64.14367035])
The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline.
Next you may want to try Ridge regression.
Just because R^2 can be negative does not mean it should be.
Possibility 1: a bug in your code.
A common bug that you should double check is that you are passing in parameters correctly:
r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!
Possibility 2: small datasets
If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score() does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.
Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x n dataset (where m is the number of samples and n is the number of features) should be of a shape where
m > n^2
and when you using cross validation with f as the number of folds, you should aim for
m/f > n^2
R² = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))² and TSS is the total sum of squares ∑(y - mean(y))². Now for R² ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:
>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581
If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.