I've been able to calculate the coefficients of a linear regression. But is there a way to get the associated errors of the coefficients? My code shown below.
from scipy.interpolate import *
from numpy import *
x = np.array([4, 12, 56, 58.6,67, 89])
y = np.array([5, 6, 7, 16,18, 19])
degrees = [0,1] # list of degrees of x to use
matrix = np.stack([x**d for d in degrees], axis=-1)
coeff = np.linalg.lstsq(matrix, y)[0]
print("Coefficients", coeff)
fit = np.dot(matrix, coeff)
print("Linear regression", fit)
p1=polyfit(x,y,1)
Output:
Coefficients for y=a +bx [3.70720668 0.17012128]
Linear fit [ 4.38769182 5.74866209 13.23399857 13.67631391 15.10533269 18.84800093]
Errors are not shown! How to calculate the errors?
You can generate the "predicted" values for y, let's call it y_pred, and compare them to y to get the errors.
predicted_line = poly1d(coeff)
y_pred = predicted_line(x)
errors = y-y_pred
Althorugh I like the solution of David Moseler, if you want an error to evaluate the goodness of your regression, you could use the R2 score (which use the squared error), already implemented in sklearn:
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([4, 12, 56, 58.6,67, 89]).reshape(-1, 1)
y = np.array([5, 6, 7, 16,18, 19])
reg = LinearRegression().fit(x, y)
reg.score(x, y) # R2 score
# 0.7481301984276703
If the R2 value is near 1, the model is a good one
Related
The deviance of Y and its expected value E(Y), estimated by the model constructed in c), measures the goodness of fit of the model. The lower the deviance, the better is the model. Below is the equation of how it should be calculated.
𝐷=2∑𝑛𝑖=1{𝑌𝑙𝑜𝑔[𝑌𝔼(𝑌)]−[𝑌−𝔼(𝑌)]}
If Y = 0, the expression log[Y/exp(E(Y))] will be taken as zero. Employ your own Python program to compute D without using the score() function of the scikit-learn package.
How do I go about doing this question? Please helppp!!
What you have is the deviance for a model fitted assuming poisson distribution, you can check wiki for how this definition is derived. Using example from the poisson regressor
from sklearn import linear_model
import numpy as np
clf = linear_model.PoissonRegressor()
X = [[1, 2], [2, 3], [3, 4], [4, 3]]
y = [12, 17, 22, 21]
clf.fit(X, y)
The deviance is:
def calculate_dev(y_true,y_pred):
return (2*(y_true * np.log(y_true/y_pred) - (y_true-y_pred))).sum()
D = calculate_dev(y,pred)
D
0.03453083031027196
Compare with the score() function, where it is defined as 1 - dev(model)/ dev(null), as in the documentation:
clf.score(X, y)
0.99048551488916
nullD = calculate_dev(y,np.mean(y))
1 - D / nullD
0.99048551488916
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
data=pd.DataFrame(
{"input":
[0.001,0.015,0.066,0.151,0.266,0.402,0.45,0.499,0.598,0.646,0.738,0.782,0.86,0.894,0.924,0.95],
"output":[0.5263157894736842,0.5789473684210524,0.6315789473684206,0.6842105263157897,
0.6315789473684206, 0.7894736842105263, 0.8421052631578945, 0.7894736842105263, 0.736842105263158,
0.6842105263157897, 0.736842105263158, 0.736842105263158,0.6842105263157897, 0.6842105263157897,
0.6315789473684206,0.5789473684210524]})
I have the above data that includes input and output data and ı want to make a curve that properly fits this data. Firstly plotting of input and output values are here :
I have made this code:
X=data.iloc[:,0].to_numpy()
X=X.reshape(-1,1)
y=data.iloc[:,1].to_numpy()
y=y.reshape(-1,1)
poly=PolynomialFeatures(degree=2)
poly.fit(X,y)
X_poly=poly.transform(X)
reg=LinearRegression().fit(X_poly,y)
plt.scatter(X,y,color="blue")
plt.plot(X,reg.predict(X_poly),color="orange",label="Polynomial Linear Regression")
plt.xlabel("Temperature")
plt.ylabel("Pressure")
plt.legend(loc="upper left")
plot is:
But ı don't find the above curve's equation (orange curve) how can ı find?
Your plot actually corresponds to your code run with
poly=PolynomialFeatures(degree=7)
and not to degree=2. Indeed, running your code with the above change, we get:
Now, your polynomial features are:
poly.get_feature_names()
# ['1', 'x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7']
and the respective coefficients of your linear regression are:
reg.coef_
# array([[ 0. , 5.43894411, -68.14277256, 364.28508827,
# -941.70924401, 1254.89358662, -831.27091422, 216.43304954]])
plus the intercept:
reg.intercept_
# array([0.51228593])
Given the above, and setting
coef = reg.coef_[0]
since here we have a single feature in the initial data, your regression equation is:
y = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
For visual verification, we can plot the above function with some x data in [0, 1]
x = np.linspace(0, 1, 15)
Running the above expression for y and
plt.plot(x, y)
gives:
Using some randomly generated data x, we can verify that the results of the equation y_eq are indeed equal to the results produced by the regression model y_reg within the limits of numerical precision:
x = np.random.rand(1,10)
y_eq = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
y_reg = np.concatenate(reg.predict(poly.transform(x.reshape(-1,1))))
y_eq
# array([[0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ]])
y_reg
# array([0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ])
np.allclose(y_reg, y_eq)
# True
Irrelevant to the question, I guess you already know that trying to fit such high order polynomials to so few data points is not a good idea, and you probably should remain to a low degree of 2 or 3...
Note sure how you produced the plot shown in the question. When I ran your code I got the following (degree=2) polynomial fitted to the data as expected:
Now that you have fitted the data you can see the coefficients of the model thus:
print(reg.coef_)
print(reg.intercept_)
# [[ 0. 0.85962436 -0.83796885]]
# [0.5523586]
Note that the data that was used to fit this model is equivalent to the following:
X_poly = np.concatenate([np.ones((16,1)), X, X**2], axis=1)
Therefore a single data point is a vector created as follows:
temp = 0.5
x = np.array([1, temp, temp**2]).reshape((1,3))
Your polynomial model is simply a linear model of the polynomial features:
y = A.x + B
or
y = reg.coef_.dot(x.T) + reg.intercept_
print(y) # [[0.77267856]]
Verification:
print(reg.predict(x)) # array([[0.77267856]])
Is it possible to inverse transform the intercept and coefficients in LASSO regression, after fitting the model on scaled data using Robust Scaler?
I'm using LASSO regression to predict values on data that is not normalized and doesn't perform well with LASSO unless it's scaled beforehand. After scaling the data and fitting the LASSO model, I ideally want to be able to see what the model intercept and coefficients are but in the original units (not the scaled versions). I asked a similar question here and it doesn't appear this is possible. If not, why? Can someone explain this to me? I'm trying to broaden my understanding of how LASSO and Robust Scaler work.
Below is the code I was using. Here I was trying to inverse transform the coefficients using transformer_x and the intercept using transformer_y. However, it sounds like this is incorrect.
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso
df = pd.DataFrame({'Y':[5, -10, 10, .5, 2.5, 15], 'X1':[1., -2., 2., .1, .5, 3], 'X2':[1, 1, 2, 1, 1, 1],
'X3':[6, 6, 6, 5, 6, 4], 'X4':[6, 5, 4, 3, 2, 1]})
X = df[['X1','X2', 'X3' ,'X4']]
y = df[['Y']]
#Scaling
transformer_x = RobustScaler().fit(X)
transformer_y = RobustScaler().fit(y)
X_scal = transformer_x.transform(X)
y_scal = transformer_y.transform(y)
#LASSO
lasso = Lasso()
lasso = lasso.fit(X_scal, y_scal)
def pred_val(X1,X2,X3,X4):
print('X1 entered: ', X1)
#Scale X value that user entered - by hand
med_X = X.median()
Q1_X = X.quantile(0.25)
Q3_X = X.quantile(0.75)
IQR_X = Q3_X - Q1_X
X_scaled = (X1 - med_X)/IQR_X
print('X1 scaled by hand: ', X_scaled[0].round(2))
#Scale X value that user entered - by function
X_scaled2 = transformer_x.transform(np.array([[X1,X2]]))
print('X1 scaled by function: ', X_scaled2[0][0].round(2))
#Intercept by hand
med_y = y.median()
Q1_y = y.quantile(0.25)
Q3_y = y.quantile(0.75)
IQR_y = Q3_y - Q1_y
inv_int = med_y + IQR_y*lasso.intercept_[0]
#Intercept by function
inv_int2 = transformer_y.inverse_transform(lasso.intercept_.reshape(-1, 1))[0][0]
#Coefficient by hand
inv_coef = lasso.coef_[0]*IQR_y
#Coefficient by function
inv_coef2 = transformer_x.inverse_transform(reg.coef_.reshape(1,-1))[0]
#Prediction by hand
preds = inv_int + inv_coef*X_scaled[0]
#Prediction by function
preds_inner = lasso.predict(X_scaled2)
preds_f = transformer_y.inverse_transform(preds_inner.reshape(-1, 1))[0][0]
print('\nIntercept by hand: ', inv_int[0].round(2))
print('Intercept by function: ', inv_int2.round(2))
print('\nCoefficients by hand: ', inv_coef[0].round(2))
print('Coefficients by function: ', inv_coef2[0].round(2))
print('\nYour predicted value by hand is: ', preds[0].round(2))
print('Your predicted value by function is: ', preds_f.round(2))
print('Perfect Prediction would be 80')
pred_val(10,1,1,1)
Update: I've updated my code to show the type of prediction function I'm trying to create. I'm just trying to create a function that does exactly what .predict does, but also shows the intercept and coefficients in their unscaled units.
Current output:
Out[1]:
X1 entered: 10
X1 scaled by hand: 5.97
X1 scaled by function: 5.97
Intercept by hand: 34.19
Intercept by function: 34.19
Coefficients by hand: 7.6
Coefficients by function: 8.5
Your predicted value by hand is: 79.54
Your predicted value by function is: 79.54
Perfect Prediction would be 80
Ideal output:
Out[1]:
X1 entered: 10
X1 scaled by hand: 5.97
X1 scaled by function: 5.97
Intercept by hand: 34.19
Intercept by function: 34.19
Coefficients by hand: 7.6
Coefficients by function: 7.6
Your predicted value by hand is: 79.54
Your predicted value by function is: 79.54
Perfect Prediction would be 80
Based on the linked SO thread, all you want to do is to get the unscaled prediction value. Is that right?
If yes, then all you need to do is:
# Scale the test dataset
X_test_scaled = transformer_x.transform(X_test)
# Predict with the trained model
prediction = lasso.predict(X_test_scaled)
# Inverse transform the prediction
prediction_in_dollars = transformer_y.inverse_transform(prediction)
UPDATE:
Suppose the train data contain just a single feature named X. Here is what the RobustScaler will do:
X_scaled = (X - median(X))/IQR(X)
y_scaled = (y - median(y))/IQR(y)
Then, the lasso regression will give a prediction like this:
a * X_scaled + b = y_scaled
You have to work out the equations to see what model coefficient on the unscaled data:
# Substituting X_scaled and y_scaled from the 1st equation
# In this equation `median(X), IQR(X), median(y) and IQR(y) are plain numbers you already know from the training phase
a * (X - median(X))/IQR(X) + b = (y - median(y))/IQR(y)
If you try to make a a_new * x + b_new = y-like equation out of this, you end up with:
a_new = (a * (X - median(X)) / (X * IQR(X))) * IQR(y)
b_new = b * IQR(y) + median(y)
a_new * X + b_new = y
You can see that the unscaled coefficient (a_new) depends on X. So, you can use the unscaled X to make predictions directly but in between you are applying the transformation indirectly.
UPDATE 2
I've adapted your code and it now shows how you can get the coefficients in the original scale. The script is just the implementation of the formulas I'm showing above.
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso
df = pd.DataFrame({'Y':[5, -10, 10, .5, 2.5, 15], 'X1':[1., -2., 2., .1, .5, 3], 'X2':[1, 1, 2, 1, 1, 1],
'X3':[6, 6, 6, 5, 6, 4], 'X4':[6, 5, 4, 3, 2, 1]})
X = df[['X1','X2','X3','X4']]
y = df[['Y']]
#Scaling
transformer_x = RobustScaler().fit(X)
transformer_y = RobustScaler().fit(y)
X_scal = transformer_x.transform(X)
y_scal = transformer_y.transform(y)
#LASSO
lasso = Lasso()
lasso = lasso.fit(X_scal, y_scal)
def pred_val(X_test):
print('X entered: ',)
print (X_test.values[0])
#Scale X value that user entered - by hand
med_X = X.median()
Q1_X = X.quantile(0.25)
Q3_X = X.quantile(0.75)
IQR_X = Q3_X - Q1_X
X_scaled = ((X_test - med_X)/IQR_X).fillna(0).values
print('X_test scaled by hand: ',)
print (X_scaled[0])
#Scale X value that user entered - by function
X_scaled2 = transformer_x.transform(X_test)
print('X_test scaled by function: ',)
print (X_scaled2[0])
#Intercept by hand
med_y = y.median()
Q1_y = y.quantile(0.25)
Q3_y = y.quantile(0.75)
IQR_y = Q3_y - Q1_y
a = lasso.coef_
coef_new = ((a * (X_test - med_X).values) / (X_test * IQR_X).values) * float(IQR_y)
coef_new = np.nan_to_num(coef_new)[0]
b = lasso.intercept_[0]
intercept_new = b * float(IQR_y) + float(med_y)
custom_pred = sum((coef_new * X_test.values)[0]) + intercept_new
pred = lasso.predict(X_scaled2)
final_pred = transformer_y.inverse_transform(pred.reshape(-1, 1))[0][0]
print('Original intercept: ', lasso.intercept_[0].round(2))
print('New intercept: ', intercept_new.round(2))
print('Original coefficients: ', lasso.coef_.round(2))
print('New coefficients: ', coef_new.round(2))
print('Your predicted value by function is: ', final_pred.round(2))
print('Your predicted value by hand is: ', custom_pred.round(2))
X_test = pd.DataFrame([10,1,1,1]).T
X_test.columns = ['X1', 'X2', 'X3', 'X4']
pred_val(X_test)
You can see the the custom prediction uses the original values (X_test.values).
Result:
X entered:
[10 1 1 1]
X_test scaled by hand:
[ 5.96774194 0. -6.66666667 -1. ]
X_test scaled by function:
[ 5.96774194 0. -6.66666667 -1. ]
Original intercept: 0.01
New intercept: 3.83
Original coefficients: [ 0.02 0. -0. -0. ]
New coefficients: [0.1 0. 0. 0. ]
Your predicted value by function is: 4.83
Your predicted value by hand is: 4.83
As I explained above, the new coefficients depend on X_test. This means that you cannot use their current values with another test sample. Their values will be different for different inputs.
I am doing linear regression with two dimensional variables:
filtered[['p_tag_x', 'p_tag_y', 's_tag_x', 's_tag_y']].head()
p_tag_x p_tag_y s_tag_x s_tag_y
35 589.665646 1405.580171 517.5 1636.5
36 589.665646 1405.580171 679.5 1665.5
100 610.546851 2425.303250 569.5 2722.0
101 610.546851 2425.303250 728.0 2710.0
102 717.237730 1411.842428 820.0 1616.5
clt = linear_model.LinearRegression()
clt.fit(filtered[['p_tag_x', 'p_tag_y']], filtered[['s_tag_x', 's_tag_y']])
I am getting following coefficients of the regression:
clt.coef_
array([[ 0.4529769 , -0.22406594],
[-0.00859452, -0.00816968]])
And the residues (X_0, and Y_0)
clt.residues_
array([ 1452.97816371, 69.12754694])
How I should I understand the above coefficients matrix in terms of the regression line ?
As i already explained in the comments, you got an extra-dimension in your coef_ as well as intercept_ because you got 2 targets (y.shape(n_samples, n_targets)). In this case sklearn will fit 2 independent regressors, one for each target.
You then can just take those n regressors apart and handle each one on it's own.
The formula of your regression line is still:
y(w, x) = intercept_ + coef_[0] * x[0] + coef_[1] * x[1] ...
Sadly your example is a bit harder to visualize because of the dimensionality.
Consider this a demo, with a lot of ugly hard-coding for this specific case (and bad example data!):
Code:
# Warning: ugly demo-like code using a lot of hard-coding!!!!!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import linear_model
X = np.array([[589.665646, 1405.580171],
[589.665646, 1405.580171],
[610.546851, 2425.303250],
[610.546851, 2425.303250],
[717.237730, 1411.842428]])
y = np.array([[517.5, 1636.5],
[679.5, 1665.5],
[569.5, 2722.0],
[728.0, 2710.0],
[820.0, 1616.5]])
clt = linear_model.LinearRegression()
clt.fit(X, y)
print(clt.coef_)
print(clt.residues_)
def curve_0(x, y): # target 0; single-point evaluation hardcoded for 2 features!
return clt.intercept_[0] + x * clt.coef_[0, 0] + y * clt.coef_[0, 1]
def curve_1(x, y): # target 1; single-point evaluation hardcoded for 2 features!
return clt.intercept_[1] + x * clt.coef_[1, 0] + y * clt.coef_[1, 1]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = [np.amin(X[:, 0]), np.amax(X[:, 0])]
ys = [np.amin(X[:, 1]), np.amax(X[:, 1])]
# regressor 0
ax.scatter(X[:, 0], X[:, 1], y[:, 0], c='blue')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_0(xs[0], ys[0]), curve_0(xs[1], ys[1])], c='cyan')
# regressor 1
ax.scatter(X[:, 0], X[:, 1], y[:, 1], c='red')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_1(xs[0], ys[0]), curve_1(xs[1], ys[1])], c='magenta')
ax.set_xlabel('X[:, 0] feature 0')
ax.set_ylabel('X[:, 1] feature 1')
ax.set_zlabel('Y')
plt.show()
Output:
Remarks:
You don't have to calculate the formula by yourself: clt.predict() will do that!
The code-lines involving ax.plot(...) use the assumption, that our line is defined by just 2 points (linear)!
I have implemented LinearSVC and SVC from the sklearn-framework for text classification.
I am using TfidfVectorizer to get sparse representation of the input data that consists of two different classes(benign data and malicious data). This part is working pretty fine but now i wanted to implement some kind of anomaly detection by using the OneClassSVM classificator and training a model with only one class (outliers detection...). Unfortunately it is not working with sparse-data. Some developers are working on a patch (https://github.com/scikit-learn/scikit-learn/pull/1586) but there a some bugs so there is no solution yet for using the OneClassSVM-implementation.
Are there any other methods in the sklearn-framework for doing something like that? I am looking over the examples but nothing seems to fit.
Thanks!
A bit late, but in case anyone else is looking for information on this... There's a third-party anomaly detection module for sklearn here: http://www.cit.mak.ac.ug/staff/jquinn/software/lsanomaly.html, based on least-squares methods. It should be a plug-in replacement for OneClassSVM.
Unfortunately, scikit-learn currently implements only one-class SVM and robust covariance estimator for outlier detection
You can try a comparision of these methods (as provided in the doc) by examining differences on the 2d data:
import numpy as np
import pylab as pl
import matplotlib.font_manager
from scipy import stats
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"robust covariance estimator": EllipticEnvelope(contamination=.1)}
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = 0
# Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset
X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model with the One-Class SVM
pl.figure(figsize=(10, 5))
for i, (clf_name, clf) in enumerate(classifiers.iteritems()):
# fit the data and tag outliers
clf.fit(X)
y_pred = clf.decision_function(X).ravel()
threshold = stats.scoreatpercentile(y_pred,
100 * outliers_fraction)
y_pred = y_pred > threshold
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = pl.subplot(1, 2, i + 1)
subplot.set_title("Outlier detection")
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=pl.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold],
linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=11))
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
pl.show()