Gaussian Process with scikitlearn - 95% confidence interval - python

I have two arrays : X (382 samples x 37 features) and Y (382 samples x 8 values). I fit a sklearn gaussian process on it.
from sklearn import gaussian_process
gp = gaussian_process.GaussianProcess()
gp.fit(X_part1, Y_part1)
In a second time, I want to predict Y values related to other X values. I'm particulary interested in the 95% confidence interval, in order to have a plot like in this example.
y_pred, sigma2_pred = gp.predict(X_part2, eval_MSE=True)
sigma = np.sqrt(sigma2_pred)
print X_part2.shape
print y_pred.shape
print sigma.shape
(382, 37)
(382, 8)
(382,)
But the fact is that the MSE (and therefore the sigma) is a 1-D array and I don't understand why. As soon as my Y is a 2d-array, I thought the MSE would be the same, as written in the documentation :
An array with shape (n_eval, ) or (n_eval, n_targets) as with y, with the Mean Squared Error at x.
And as a result, I don't know how to deal with this MSE to get the filled 95% confidence interval...

Related

Getting a negative R-squared value with curve_fit()

I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.
Here is my code:
def powerlaw(x, a, b):
'''Generic power law function.'''
return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2) # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2) # total sum of squares
r_squared = 1 - (ss_res / ss_tot) # r-squared value
print("R-squared of power-law fit = ", str(r_squared))
I got an R-squared value of -0.057....
From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?
See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.
Basically, we have two problems:
nonlinear models do not have an intercept term, at least, not in the usual sense;
the equality SStot=SSreg+SSres may not hold.
The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.
To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.
For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:
pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2) # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total
Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.

Can a good model have a low R square value?

I made linear regression using scikit learn
when I see my mean squared error on the test data then it's very low (0.09)
when I see my r2_score on my test data then it's also very less (0.05)
as per i know when mean squared error is low that present model is good but r2_score is very less that tells us model is not good
I don't understand that my regression model is good or not
Can a good model has a low R square value or can a bad model has a low mean square error value?
R^2 is measure of, how good your fit is representing the data.
Let's say your data has a linear trend and some noise on it. We can construct the data and see how the R^2 is changing:
Data
I'm going to create some data using numpy:
xs = np.random.randint(10, 1000, 2000)
ys = (3 * xs + 8) + np.random.randint(5, 10, 2000)
Fit
Now we can create a fit object usinh scikit
reg = LinearRegression().fit(xs.reshape(-1, 1), ys.reshape(-1, 1))
And we can get the score from this fit.
reg.score(xs.reshape(-1, 1), ys.reshape(-1, 1))
My R^2 was: 0.9999971914416896
Bad data
Let's say we have a set of more scattered data (have more noise on it).
ys2 = (3 * xs + 8) + np.random.randint(500, 1000, 2000)
Now we can calculate the score of the ys2 to understand how good our fit represent the xs, ys2 data:
reg.score(xs.reshape(-1, 1), ys2.reshape(-1, 1))
My R^2 was: 0.2377175028951054
The score is low. we know the trend of the data did not change. It still is 3x+8 + (noise). But ys2 are further away from the fit.
So, R^2 is an inductor of how good your fit is representing the data. But the condition of the data itself is important. Maybe even with low score the best possible fit is what you get. Since the data is scattered due to noise.

Calculating Mean Squared Error with Sample Mean

I was given this assignment, and I'm not sure if i understand the question correctly.
We considered the sample-mean estimator for the distribution mean. Another estimator for the distribution mean is the min-max-mean estimator that takes the mean (average) of the smallest and largest observed values. For example, for the sample {1, 2, 1, 5, 1}, the sample mean is (1+2+1+5+1)/5=2 while the min-max-mean is (1+5)/2=3. In this problem we ask you to run a simulation that approximates the mean squared error (MSE) of the two estimators for a uniform distribution.
Take a continuous uniform distribution between a and b - given as parameters. Draw a 10-observation sample from this distribution, and calculate the sample-mean and the min-max-mean. Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates.
Sample Input: Sample_Mean_MSE(1, 5)
Sample Output: 0.1343368663225577
This code below is me trying to:
Draw a sample of size 10 from a uniform distribution of a and b
calculate MSE, with mean calculated with Sample Mean method.
Repeat 100,000 times, and store the result MSEs in an array
Return the mean of the MSEs array, as the final result
However, the result I get was quite far from the sample output above.
Can someone clarify the assignment, around the part "Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates"? Thanks
import numpy as np
def Sample_Mean_MSE(a, b):
# inputs: bounds for uniform distribution a and b
# sample size is 10
# number of experiments is 100,000
# output: MSE for sample mean estimator with sample size 10
mse_s = np.array([])
k = 0
while k in range(100000):
sample = np.random.randint(low=a, high=b, size=10)
squared_errors = np.array([])
for i, value in enumerate(sample):
error = value - sample.mean()
squared_errors = np.append(squared_errors, error ** 2)
k += 1
mse_s = np.append(mse_s, squared_errors.mean())
return mse_s.mean()
print(Sample_Mean_MSE(1, 5))
To get the expected result we first need to understand what the Mean squared error (MSE) of an estimator is. Take the sample-mean estimator for example (min-max-mean estimator is basically the same):
MSE assesses the average squared difference between the observed and predicted values - in this case, is between the distribution mean and the sample mean. We can break it down as below:
Draw a sample of size 10 and calculate the sample mean (ŷ), repeat 100,000 times
Calculate the distribution mean: y = (a + b)/2
Calculate and return the MSE: MSE = 1/n * Σ(y - ŷ)^2

Return an array containing the squared errors between all predicted_prices and the actual prices (from the dataset)

I'm doing an exercise and I'm stuck.
Here's what I have to do:
I've been given a function to implement which has 4 arguments.
def squared_errors(slope, intercept, surfaces, prices
And I tried with a friend to get that function to work but none of us found the solution.
Basically, I have been given a dataset and I have to make sure that our estimator line is the best possible one, we need to compute the Mean Squared Error between price and
predicted_price (slope * surface + intercept). The dataset is a vector of shape(1000,1).
for each row, we should evaluate the squared_error (predicted_price - price)**2
But my brain is just numb and I can't come to a solution, and help would be greatly appreciate. !
Given the slope and the intercept, for any given data point x you can get its prediction as slope*x+intercept (or generally as slope^T.X+intrecept when it is vetorized)
Now that we have the predictions, if we have the actual ground truth then we can measure how good/bad our predictions are using squared loss which is nothing but just the square root of the mean of the squared difference between the prediction and the corresponding actual ground truth.
Sample (documented inline)
import numpy as np
# Actual slope
slope = 2
# Actual intercept
intercept = 0
# Some data
X = np.random.rand(10,1)
# The ground truth
prices = slope*X + intercept
# Loss
def squared_errors(slope, intercept, surfaces, prices):
y_hat = slope*surfaces + intercept
return np.sqrt(np.mean((y_hat - prices)**2))
# prefect prediction
print (squared_errors(2, 0, X, prices))
# Non prefect prediction
print (squared_errors(2, 0.5, X, prices))
print (squared_errors(1, 0, X, prices))
Output:
0.0
0.5
0.6286343914881158
As you can see for the prefect prediction the error is 0 and non zero for the rest based on how far way on average the predictions are from the ground truth.

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Categories