Linear regression with outliers for Machine Learning - python

Python (jupyter notebook to be exact), using numpy and sklearn only
np.random.seed(16)
x = np.arange(100) 
yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3))
np.random.seed(12)
# Choose how many outliers
out = np.random.choice(100,15)
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]
# With outliers
plt.scatter(x,yp_wo)
# Without outliers
plt.scatter(x,yp)
For the data above (wo means "with outliers"), I need to find:
The best coefficients for two more losses: the MAE and the MAPE (Median Absolute Percentage Error)
Plot the best fit line for the MSE loss, the MAE loss, and the MAPE loss.
Apply Ridge Regression to the same data, and use cross validation to choose the optimal parameter alpha (you can use values of alpha = 10^-5, 10^-4, 10^-3, ... 10^3). Which value gives you the lowest MSE?
What confuses me is having to plot the best line fit for two or more losses.
I can follow the code from class and try to get the values, but I don't know what's meant by coefficients.
Any help / guidance?
This is for a homework I am trying to figure out (no I am not asking for the solutions)
Please excuse any formatting errors, I am very new to Stack Overflow.

Related

Can a good model have a low R square value?

I made linear regression using scikit learn
when I see my mean squared error on the test data then it's very low (0.09)
when I see my r2_score on my test data then it's also very less (0.05)
as per i know when mean squared error is low that present model is good but r2_score is very less that tells us model is not good
I don't understand that my regression model is good or not
Can a good model has a low R square value or can a bad model has a low mean square error value?
R^2 is measure of, how good your fit is representing the data.
Let's say your data has a linear trend and some noise on it. We can construct the data and see how the R^2 is changing:
Data
I'm going to create some data using numpy:
xs = np.random.randint(10, 1000, 2000)
ys = (3 * xs + 8) + np.random.randint(5, 10, 2000)
Fit
Now we can create a fit object usinh scikit
reg = LinearRegression().fit(xs.reshape(-1, 1), ys.reshape(-1, 1))
And we can get the score from this fit.
reg.score(xs.reshape(-1, 1), ys.reshape(-1, 1))
My R^2 was: 0.9999971914416896
Bad data
Let's say we have a set of more scattered data (have more noise on it).
ys2 = (3 * xs + 8) + np.random.randint(500, 1000, 2000)
Now we can calculate the score of the ys2 to understand how good our fit represent the xs, ys2 data:
reg.score(xs.reshape(-1, 1), ys2.reshape(-1, 1))
My R^2 was: 0.2377175028951054
The score is low. we know the trend of the data did not change. It still is 3x+8 + (noise). But ys2 are further away from the fit.
So, R^2 is an inductor of how good your fit is representing the data. But the condition of the data itself is important. Maybe even with low score the best possible fit is what you get. Since the data is scattered due to noise.

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

Understanding differences in Coefficients between Multivariate OLS, PLS and Ridge-Regression when having Multicollinearity

Hello Guys, I’m new in python and i have a problem with a multicollinearity inside a multivariate Regression Model.
I have date of two Conveyor Belts one after the other with ‘load’ ‘speed’ ‘Energy’ and so on per hour. I want to understand the Energy performance. First, I tried a normal Ordinary least squares Model to get Coefficients. But I could also see that the coeffs are different between the conveyors. The point is that one of the belts is a few meters smaller and it has a to bring the load a few meters up ways. I calculated a slope of 0.09. Now I want to get information about it. So, I put a separate Column in every belt and append them. I did a Ridge regression about it knowing, when alpha is zero I have the OLS regression again. But the Coefficients I get now are surprising. Like before Load has a big influence and even the new slope as expect, but the speed of the belt has now negative impact of the Energy performance. It would be great but cannot be possible, that when the speed of the engine increases the energy decrease…
In my opinion it could be a result of multicollinearity. So i used a Correlation Matrix but between Slope and Speed are no correlations. So I tried to do a Partial least squares but the coefficients I get there are near zero but on the other hand the PLS Model give me back X and Y_loading with values i expect as my Coefficients would look like.
I know that PLS estimated the Coefs by y = x*coef +ERR.
I want to know if there is a possibility to get the ERR values? Can it be that an ERR Value are too big to get “good” Coefficients?
Is it possible to get much lower Coefficients by PLS than OLS ? What are the y_loadings Values inside the PLS Model?
and is there another model you would use to check Energy performance?
Thanks for your help.
########## Partial Least Square Regression ######
PLSRegr = PLSRegression(n_components=2)
pls = PLSRegr.fit(X_train, Y_train)
pls_pred = pls.predict(X_test)
pls_meanSquaredError = mean_squared_error(Y_test, pls_pred)
print("PLS MSE:", pls_meanSquaredError)
pls_rootMeanSquaredError = sqrt(pls_meanSquaredError)
print("PLS RMSE:", pls_rootMeanSquaredError)
pls_mean = mean_absolute_error(Y_test, pls_pred)
print("PLS Mean_absolute Error:",pls_mean)
pls_r2 = r2_score(Y_test,pls_pred)
print("PLS R²", pls_r2)
print('PLS Coefficients: \n', PLSRegr.coef_)
print('PLS loadings: \n', PLSRegr.y_loadings_)
print('PLS loadings: \n', PLSRegr.x_loadings_)
##### Ridge Regression
n_alphas = 10
alphas = np.logspace(-1.5, 2.5, n_alphas)
coefs = []
errors = []
error_pred = []
Rsquared = []
Rsquared_pred = []
scores = []
p = 6 # Number of Predictors
N = 14266 # Total sample Size
for a in alphas:
ridge = KernelRidge(alpha=a, kernel='linear', coef0=0)
ridge.fit(X_train, Y_train)
KRR_pred = ridge.predict(X_train) # Prediction Train
rgr_pred = ridge.predict(X_test) # Prediction Test
print(KRR_pred)
print(ridge.dual_coef_)
print(np.dot(X_train.transpose(),ridge.dual_coef_))
coefs.append(np.dot(X_train.transpose(),ridge.dual_coef_))
Rsquared.append(ridge.score(X_train, Y_train))
print("R² of Trainset:",Rsquared)
Rsquared_pred.append(r2_score(Y_test,rgr_pred))
print("R² of Prediction:", Rsquared_pred)
Rsquaredadj = 1 - (((1-(r2_score(Y_test,rgr_pred)))*(N-1))/(N-p-1))
print("Adj R²",Rsquaredadj)
errors.append(mean_squared_error(ridge.dual_coef_,KRR_pred))
errors2.append(mean_squared_error(ridge.dual_coef_,rgr_pred))
print('MSE of bias:', errors)
error_pred.append(mean_squared_error(Y_test, rgr_pred))
print("RGR MSE:", error_pred)
mse = np.mean((rgr_pred - Y_test) ** 2)
print("MSE check", mse)
coefs = np.array(coefs)
coefs = coefs.reshape(n_alphas, 6)
print('Coefficients: \n', coefs)
print('Alphas: \n',alphas)
print(KRR_pred)
print(ridge.dual_coef_)
Temp, Load, Tension, speed, Slope
PLS Results : 0.00, 0.11, -0.01, 0.02, 0.04
OLS/Ridge(Alpha=Zero) Results: -0.038,1.37,-0.067,-0.11,0.33
OLS Result without slope: -0.011, 1.11, -0.33, 0.40
I expect values like without slope but "smaller" in Ridge with positive speed Coefficient and higher results in PLS

How to determine which regression curve fits better? PYTHON

Well, community:
Recently I have asked how to do exponential regression (Exponential regression function Python) thinking that for that data set the optimal regression was the Hyperbolic.
x_data = np.arange(0, 51)
y_data = np.array([0.001, 0.199, 0.394, 0.556, 0.797, 0.891, 1.171, 1.128, 1.437,
1.525, 1.720, 1.703, 1.895, 2.003, 2.108, 2.408, 2.424,2.537,
2.647, 2.740, 2.957, 2.58, 3.156, 3.051, 3.043, 3.353, 3.400,
3.606, 3.659, 3.671, 3.750, 3.827, 3.902, 3.976, 4.048, 4.018,
4.286, 4.353, 4.418, 4.382, 4.444, 4.485, 4.465, 4.600, 4.681,
4.737, 4.792, 4.845, 4.909, 4.919, 5.100])
Now, I'm doubting:
The first is an exponential fit. The second is hyperbolic. I don't know which is better... How to determine? Which criteria should I follow? Is there some python function?
Thanks in advance!
One common fit statistic is R-squared (R2), which can be calculated as "R2 = 1.0 - (absolute_error_variance / dependent_data_variance)" and it tells you what fraction of the dependent data variance is explained by your model. For example, if the R-squared value is 0.95 then your model explains 95% of the dependent data variance. Since you are using numpy, the R-squared value is trivially calculated as "R2 = 1.0 - (abs_err.var() / dep_data.var())" since numpy arrays have a var() method to calculate variance. When fitting your data to the Michaelis-Menten equation "y = ax / (b + x)" with parameter values of a = 1.0232217656373191E+01 and b = 5.2016057362771100E+01 I calculate an R-squared value of 0.9967, which means that 99.67 percent of the variance in the "y" data is explained by this model. Howver, there is no silver bullet and it is always good to verify other fit statistics and visually inspect the model. Here is my plot for the example I used:
You can take the 2-norm between the function and line of fit. Python has the function np.linalg.norm The R squared value is for linear regression.
Well, you should calculate an error function which measures how good your fit actually is. There are many different error functions you could use but for the start the mean-squared-error should work (if you're interested in further metrics, have a look at http://scikit-learn.org/stable/modules/model_evaluation.html).
You can manually implement mean-squared-error, once you determined the coefficients for your regression problem:
from sklearn.metrics import mean_squared_error
f = lambda x: a * np.exp(b * x) + c
mse = mean_squared_error(y_data, f(x_data))

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Categories