Calculating Mean Squared Error with Sample Mean

Calculating Mean Squared Error with Sample Mean - python

I was given this assignment, and I'm not sure if i understand the question correctly.
We considered the sample-mean estimator for the distribution mean. Another estimator for the distribution mean is the min-max-mean estimator that takes the mean (average) of the smallest and largest observed values. For example, for the sample {1, 2, 1, 5, 1}, the sample mean is (1+2+1+5+1)/5=2 while the min-max-mean is (1+5)/2=3. In this problem we ask you to run a simulation that approximates the mean squared error (MSE) of the two estimators for a uniform distribution.
Take a continuous uniform distribution between a and b - given as parameters. Draw a 10-observation sample from this distribution, and calculate the sample-mean and the min-max-mean. Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates.
Sample Input: Sample_Mean_MSE(1, 5)
Sample Output: 0.1343368663225577
This code below is me trying to:
Draw a sample of size 10 from a uniform distribution of a and b
calculate MSE, with mean calculated with Sample Mean method.
Repeat 100,000 times, and store the result MSEs in an array
Return the mean of the MSEs array, as the final result
However, the result I get was quite far from the sample output above.
Can someone clarify the assignment, around the part "Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates"? Thanks
import numpy as np
def Sample_Mean_MSE(a, b):
# inputs: bounds for uniform distribution a and b
# sample size is 10
# number of experiments is 100,000
# output: MSE for sample mean estimator with sample size 10
mse_s = np.array([])
k = 0
while k in range(100000):
sample = np.random.randint(low=a, high=b, size=10)
squared_errors = np.array([])
for i, value in enumerate(sample):
error = value - sample.mean()
squared_errors = np.append(squared_errors, error ** 2)
k += 1
mse_s = np.append(mse_s, squared_errors.mean())
return mse_s.mean()
print(Sample_Mean_MSE(1, 5))

To get the expected result we first need to understand what the Mean squared error (MSE) of an estimator is. Take the sample-mean estimator for example (min-max-mean estimator is basically the same):
MSE assesses the average squared difference between the observed and predicted values - in this case, is between the distribution mean and the sample mean. We can break it down as below:
Draw a sample of size 10 and calculate the sample mean (ŷ), repeat 100,000 times
Calculate the distribution mean: y = (a + b)/2
Calculate and return the MSE: MSE = 1/n * Σ(y - ŷ)^2

Related

Discretize normal distribution to get prob of a random variable

Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".

here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.

Calculate number of samples needed to estimate accurate mean

I'm sampling values from a distribution. Here it is shown how to create a confidence interval for given data.
I want to continue sampling until the confidence interval is smaller than a given interval max_error. Is there a way to estimate how many more samples I will need?
sample_list = []
max_error = 10
while True:
list.append(get_sample())
// See https://stackoverflow.com/a/34474255
interval = scipy.stats.t.interval(0.95, len(sample_list) - 1, loc=np.mean(sample_list), scale=scipy.stats.sem(sample_list))
estimated_error = abs(interval[1]-interval[0])/2
estimated_required_samples = ??? // How to calculate this?
print(f"{len(sample_list)}/{estimated_required_samples} measurements, mean: {mean(sample_list)} +/- {estimated_error}")
if estimated_error <= max_error:
return mean(sample_list)
Some formula is given on Wikipedia, but it requires knowledge of the variance, which is still be estimated while sampling.

numpy weighted average for calculating weighted mean squared error

I am trying to compute weighted mean squared error for my regression problem. I have y_true, y_predicted, and y_wts numpy arrays. Each array is shaped (N,1) where N is the number of samples. I don't understand why the following 2 pieces of code give different answers:
1st code segment
import numpy as np
sq_error = (y_true-y_predicted)**2
wtd_sq_error = np.multiply(sq_error,y_wts)
wtd_mse = np.mean(wtd_sq_error)
2nd code segment taken from sklearn metrics mean_squared_error function
wtd_mse_sklearn = np.average((y_true - y_predicted)**2, axis =0,
weights=y_wts)
I came to test this owing to mis-match between tensorflow weighted mean squared error and sklearn metrics mean squared error (with weight column specified). Note that this mismatch doesnt occur when I don't specify a weight column.
Thanks for your help!

Because you forgot about weight:
np.mean = sum(error_i * weight_i ∀ i) / len(error_i ∀ i)
while
np.average = sum(error_i * weight_i ∀ i) / sum(weight_i ∀ i)

You are having the formula for the weighted average in your 1st code segment wrong, it should be:
wtd_mse = np.sum(sq_error * y_wts) / np.sum(y_wts)
instead of:
wtd_mse = np.mean(wtd_sq_error)

Gaussian Process with scikitlearn - 95% confidence interval

I have two arrays : X (382 samples x 37 features) and Y (382 samples x 8 values). I fit a sklearn gaussian process on it.
from sklearn import gaussian_process
gp = gaussian_process.GaussianProcess()
gp.fit(X_part1, Y_part1)
In a second time, I want to predict Y values related to other X values. I'm particulary interested in the 95% confidence interval, in order to have a plot like in this example.
y_pred, sigma2_pred = gp.predict(X_part2, eval_MSE=True)
sigma = np.sqrt(sigma2_pred)
print X_part2.shape
print y_pred.shape
print sigma.shape
(382, 37)
(382, 8)
(382,)
But the fact is that the MSE (and therefore the sigma) is a 1-D array and I don't understand why. As soon as my Y is a 2d-array, I thought the MSE would be the same, as written in the documentation :
An array with shape (n_eval, ) or (n_eval, n_targets) as with y, with the Mean Squared Error at x.
And as a result, I don't know how to deal with this MSE to get the filled 95% confidence interval...

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.

Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE

These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_

I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))

The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.