Can a good model have a low R square value?

Can a good model have a low R square value? - python

I made linear regression using scikit learn
when I see my mean squared error on the test data then it's very low (0.09)
when I see my r2_score on my test data then it's also very less (0.05)
as per i know when mean squared error is low that present model is good but r2_score is very less that tells us model is not good
I don't understand that my regression model is good or not
Can a good model has a low R square value or can a bad model has a low mean square error value?

R^2 is measure of, how good your fit is representing the data.
Let's say your data has a linear trend and some noise on it. We can construct the data and see how the R^2 is changing:
Data
I'm going to create some data using numpy:
xs = np.random.randint(10, 1000, 2000)
ys = (3 * xs + 8) + np.random.randint(5, 10, 2000)
Fit
Now we can create a fit object usinh scikit
reg = LinearRegression().fit(xs.reshape(-1, 1), ys.reshape(-1, 1))
And we can get the score from this fit.
reg.score(xs.reshape(-1, 1), ys.reshape(-1, 1))
My R^2 was: 0.9999971914416896
Bad data
Let's say we have a set of more scattered data (have more noise on it).
ys2 = (3 * xs + 8) + np.random.randint(500, 1000, 2000)
Now we can calculate the score of the ys2 to understand how good our fit represent the xs, ys2 data:
reg.score(xs.reshape(-1, 1), ys2.reshape(-1, 1))
My R^2 was: 0.2377175028951054
The score is low. we know the trend of the data did not change. It still is 3x+8 + (noise). But ys2 are further away from the fit.
So, R^2 is an inductor of how good your fit is representing the data. But the condition of the data itself is important. Maybe even with low score the best possible fit is what you get. Since the data is scattered due to noise.

Related

How variable alpha changes SGDRegressor behavior for outlier?

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have:
from sklearn.linear_model import SGDRegressor
out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)]
alpha=[0.0001, 1, 100]
N= len(out)
plt.figure(figsize=(20,15))
j=1
for i in alpha:
X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included X and Y in this section
Y= a * np.cos(phi)
for num in range(N):
plt.subplot(3, N, j)
X=np.append(X,out[num][0]) # Appending outlier to main X
Y=np.append(Y,out[num][1]) # Appending outlier to main Y
j=j+1 # Increasing J so we move on to next plot
model=SGDRegressor(alpha=i, eta0=0.001, learning_rate='constant',random_state=0)
model.fit(X.reshape(-1, 1), Y) # Fitting the model
plt.scatter(X,Y)
plt.title("alpha = "+ str(i) + " | " + "Slope :" + str(round(model.coef_[0], 4))) #Adding title to each plot
abline(model.coef_[0],model.intercept_) # Plotting the line using abline function
plt.show()
As shown above I had the main datset of X and Y and in each iteration, I am adding a point as an outlier to the main dataset and train the model and plot regression line (hyperplane). Below you can see the result for different values of alpha:
I am looking at results and am still confused and can't make solid conclusion as how alhpa parameter changes the model? what's the effect of alpha? is it causing overfitting? underfitting?

From scikit-learn:
alpha : float, default=0.0001
Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when set to learning_rate is set to 'optimal'.
As for regularization, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. If there is noise (not "true" data) in the training data, then the model's estimated coefficients won’t generalize well to the future (test) data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.
From Towards Data Science (paraphrased):
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization significantly reduces the variance of the model, without substantial increase in its bias. The tuning parameter alpha controls the impact on bias and variance. As the value of alpha rises, it reduces the value of coefficients, thus reducing the variance.Till a point, this increase in alpha is beneficial as it is only reducing the variance (hence avoiding overfitting), without losing any important properties in the data. But after certain value, the model starts losing important properties, giving rise to bias in the model and thus underfitting.
In your example, comparing the rows of the third column highlights this effect (slope).

Linear regression with outliers for Machine Learning

Python (jupyter notebook to be exact), using numpy and sklearn only
np.random.seed(16)
x = np.arange(100) 
yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3))
np.random.seed(12)
# Choose how many outliers
out = np.random.choice(100,15)
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]
# With outliers
plt.scatter(x,yp_wo)
# Without outliers
plt.scatter(x,yp)
For the data above (wo means "with outliers"), I need to find:
The best coefficients for two more losses: the MAE and the MAPE (Median Absolute Percentage Error)
Plot the best fit line for the MSE loss, the MAE loss, and the MAPE loss.
Apply Ridge Regression to the same data, and use cross validation to choose the optimal parameter alpha (you can use values of alpha = 10^-5, 10^-4, 10^-3, ... 10^3). Which value gives you the lowest MSE?
What confuses me is having to plot the best line fit for two or more losses.
I can follow the code from class and try to get the values, but I don't know what's meant by coefficients.
Any help / guidance?
This is for a homework I am trying to figure out (no I am not asking for the solutions)
Please excuse any formatting errors, I am very new to Stack Overflow.

How to determine which regression curve fits better? PYTHON

Well, community:
Recently I have asked how to do exponential regression (Exponential regression function Python) thinking that for that data set the optimal regression was the Hyperbolic.
x_data = np.arange(0, 51)
y_data = np.array([0.001, 0.199, 0.394, 0.556, 0.797, 0.891, 1.171, 1.128, 1.437,
1.525, 1.720, 1.703, 1.895, 2.003, 2.108, 2.408, 2.424,2.537,
2.647, 2.740, 2.957, 2.58, 3.156, 3.051, 3.043, 3.353, 3.400,
3.606, 3.659, 3.671, 3.750, 3.827, 3.902, 3.976, 4.048, 4.018,
4.286, 4.353, 4.418, 4.382, 4.444, 4.485, 4.465, 4.600, 4.681,
4.737, 4.792, 4.845, 4.909, 4.919, 5.100])
Now, I'm doubting:
The first is an exponential fit. The second is hyperbolic. I don't know which is better... How to determine? Which criteria should I follow? Is there some python function?
Thanks in advance!

One common fit statistic is R-squared (R2), which can be calculated as "R2 = 1.0 - (absolute_error_variance / dependent_data_variance)" and it tells you what fraction of the dependent data variance is explained by your model. For example, if the R-squared value is 0.95 then your model explains 95% of the dependent data variance. Since you are using numpy, the R-squared value is trivially calculated as "R2 = 1.0 - (abs_err.var() / dep_data.var())" since numpy arrays have a var() method to calculate variance. When fitting your data to the Michaelis-Menten equation "y = ax / (b + x)" with parameter values of a = 1.0232217656373191E+01 and b = 5.2016057362771100E+01 I calculate an R-squared value of 0.9967, which means that 99.67 percent of the variance in the "y" data is explained by this model. Howver, there is no silver bullet and it is always good to verify other fit statistics and visually inspect the model. Here is my plot for the example I used:

You can take the 2-norm between the function and line of fit. Python has the function np.linalg.norm The R squared value is for linear regression.

Well, you should calculate an error function which measures how good your fit actually is. There are many different error functions you could use but for the start the mean-squared-error should work (if you're interested in further metrics, have a look at http://scikit-learn.org/stable/modules/model_evaluation.html).
You can manually implement mean-squared-error, once you determined the coefficients for your regression problem:
from sklearn.metrics import mean_squared_error
f = lambda x: a * np.exp(b * x) + c
mse = mean_squared_error(y_data, f(x_data))

Gaussian Process with scikitlearn - 95% confidence interval

I have two arrays : X (382 samples x 37 features) and Y (382 samples x 8 values). I fit a sklearn gaussian process on it.
from sklearn import gaussian_process
gp = gaussian_process.GaussianProcess()
gp.fit(X_part1, Y_part1)
In a second time, I want to predict Y values related to other X values. I'm particulary interested in the 95% confidence interval, in order to have a plot like in this example.
y_pred, sigma2_pred = gp.predict(X_part2, eval_MSE=True)
sigma = np.sqrt(sigma2_pred)
print X_part2.shape
print y_pred.shape
print sigma.shape
(382, 37)
(382, 8)
(382,)
But the fact is that the MSE (and therefore the sigma) is a 1-D array and I don't understand why. As soon as my Y is a 2d-array, I thought the MSE would be the same, as written in the documentation :
An array with shape (n_eval, ) or (n_eval, n_targets) as with y, with the Mean Squared Error at x.
And as a result, I don't know how to deal with this MSE to get the filled 95% confidence interval...

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.

Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE

These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_

I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))

The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can a good model have a low R square value? - python

Related

How variable alpha changes SGDRegressor behavior for outlier?

Linear regression with outliers for Machine Learning

How to determine which regression curve fits better? PYTHON

Gaussian Process with scikitlearn - 95% confidence interval

Standard errors for multivariate regression coefficients

Categories

Resources