In extension of: scikit learn coefficients polynomialfeatures
What is a straightforward way of doing multivariate polynomial regression for python?
Say, we have N samples with each 3 features and we have for each sample 40 (may as well be any number, of course, but it is 40 in my case) response variables. We want to make a function that relates the 3 independent variables to the 40 response variables. For this, we train a polynomial model on N-1 of our samples, and estimate the 40 response variables of the remaining one sample. The dimensionalities of independent variable (X) and response variable (y) training and test data:
X_train = [(N-1) * 3], y_train = [(N-1) * 40], X_test = [1 * 3], y_test = [1 * 40]
As I would expect, such an approach should yield:
y = intercept + a x1 + b x1^2 + c x2 + d x2^2 + e x3 + f x3^3 + g x1 x2 + h x1 x3 + i x2 x3
Which is a total of 9 coefficients plus one intercept for every sample to describe the polynomial. If I use the method proposed earlier by David Maust in 2015:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import *
model = make_pipeline(PolynomialFeatures(degree=2),LinearRegression())
y_poly = model.fit(X_train,y_train)
coefficients = model.steps[1][1].coef_
intercepts = model.steps[1][1].intercept_
coefficients.shape
[Output: (40, 10)]
For every response variable, it appears we end up with 10 coefficients + one intercept, which is one more coefficient than I would expect. Therefore it is unclear to me what these coefficients mean and how to make up the polynomial that describes our response variable. I really hope StackOverflow could help me out! Hopefully I defined my problem well enough.
As you pointed out there are 9 coefficients and a bias term after the polynomial transformation. However when you pass this N by 10 matrix to sklearn's LinearRegression this is interpreted as a 10 dimensional dataset. In addition, by default, sklearn fits the regression line with an intercept, therefore you have 10 coefficients and one intercept. I think the first coefficient will most likely be 0 though (at least that is what I obtained after testing my answers below with the data from here).
To get your expected behaviour I think you have two options:
disable the bias term in PolynomialFeatures.
model = make_pipeline(PolynomialFeatures(degree=2,include_bias=False), LinearRegression())
tell LinearRegression not to fit an intercept, and instead your first coefficient (coefficient of the bias term) will be the intercept. In this case your intercept is model.steps[1][1].coef_[0].
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept=False))
I hope this helps! Out of curiosity what is the value you get for model.steps[1][1].coef_[0]?
Related
(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link
It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?
Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:
A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable
I need to fit a logistic regression with sklearn, but with no x vector, just the model with intercept, how can it be done? I cannot find any working solution.
Thanks
Edit: I want to find alternative solution in sklearn for R's regression y ~ 1.
I did not find a way to run a logit Only on the intercept, so, I created one constant column and ran the model without the intercept.
import nmpy as np
from sklearn.linear_model import LogisticRegression
### Create the data
a = np.array([1] * 20 + [0] * 180)
df = pd.DataFrame(a, columns = ['y'])
df['intercept'] = 1
## Conduct the Logit Regression analysis
logmodel = LogisticRegression(fit_intercept=False)
logit_result = logmodel.fit(df.loc[:, ~df.columns.isin(['y'])],df['y'])
#### Print the coefficient
print(logit_result.intercept_)
print(logit_result.coef_)
I'm currently using TensorFlow and SkLearn to to try to make a model that can predict the amount of sales for a certain product, X, based on the outdoor temperature in celcius.
I took my datasets for the temperature and set it equal to the x variable, and the amount of sales to as a y variable. As seen on the picture below, there is some sort of correlation between the temperature and the amount of sales:
First and foremost, I tried to do linear regression to see how well it'd fit. This is the code for that:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train) #fit tries to fit the x variable and y variable.
#Let's try to plot it out.
y_pred = model.predict(x_train)
plt.scatter(x_train,y_train)
plt.plot(x_train,y_pred,'r')
plt.legend(['Predicted Line', 'Observed data'])
plt.show()
This resulted in a predicted line that had a pretty poor fit:
A very nice feature from sklearn however is that you can try to predict an value based on a temperature, so if I were to write
model.predict(15)
i'd get the output
array([6949.05567873])
This is exactly what I want, I just wanted to line to fit better so instead I tried polynoimal regression with sklearn by doing following:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this actually predicts x...?
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
The line seems to fit better now:
My problem is not that I can't use new_model.predict(x) since it'll result in "ValueError: shapes (1,1) and (8,) not aligned: 1 (dim 1) != 8 (dim 0)". I understand that this is because I'm using a 8-degree polynomium, but is there any way for me to predict the y-axsis based on ONE temperature using the polynomial regression model?
Try using new_model.predict([x**a for a in range(1,9)])
or according to your previously used code, you can do new_model.predict(poly.fit_transform(x))
Since you fit a line
y = ax^1 + bx^2 + ... + h*x^8
you, need to transform your input in the same manner i.e. turn it into a polynomial without the intercept and slope terms. This was what you passed into Linear Regression training function. It learns the slope terms for that polynomial. The plot you've shown only contains the x^1 term you indexed into (x_new[:,0]) which means that the data you're using has more columns.
One last note: always make sure your training data and future/validation data undergo the same preprocessing steps to ensure your model works.
Here's some detail :
Let's start by running your code, on synthetic data.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from numpy.random import rand
x_train = rand(1000,1)
y_train = rand(1000,1)
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this predicts y
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
Now we can predict y value by transforming an x-value into a polynomial of degree 8 without an intercept
print(new_model.predict(poly.fit_transform(0.25)))
[[0.47974408]]
I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.
I'm doing a simple linear model. I have
fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores
which yields
[ 0.00000000e+00 0.00000000e+00 -8.27299054e+02 -5.80431382e+00
-1.04444147e-01 -1.19367785e+00 -1.24843536e+00 -3.39950443e-01
1.95018287e-02 -9.73940970e-02]
How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?
There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.
In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100) # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
This should result in negative r^2 values.
In [23]: scores
Out[23]:
array([-240.17927358, -5.51819556, -14.06815196, -67.87003867,
-64.14367035])
The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline.
Next you may want to try Ridge regression.
Just because R^2 can be negative does not mean it should be.
Possibility 1: a bug in your code.
A common bug that you should double check is that you are passing in parameters correctly:
r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!
Possibility 2: small datasets
If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score() does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.
Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x n dataset (where m is the number of samples and n is the number of features) should be of a shape where
m > n^2
and when you using cross validation with f as the number of folds, you should aim for
m/f > n^2
R² = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))² and TSS is the total sum of squares ∑(y - mean(y))². Now for R² ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:
>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581
If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.