numpy weighted average for calculating weighted mean squared error - python

I am trying to compute weighted mean squared error for my regression problem. I have y_true, y_predicted, and y_wts numpy arrays. Each array is shaped (N,1) where N is the number of samples. I don't understand why the following 2 pieces of code give different answers:
1st code segment
import numpy as np
sq_error = (y_true-y_predicted)**2
wtd_sq_error = np.multiply(sq_error,y_wts)
wtd_mse = np.mean(wtd_sq_error)
2nd code segment taken from sklearn metrics mean_squared_error function
wtd_mse_sklearn = np.average((y_true - y_predicted)**2, axis =0,
weights=y_wts)
I came to test this owing to mis-match between tensorflow weighted mean squared error and sklearn metrics mean squared error (with weight column specified). Note that this mismatch doesnt occur when I don't specify a weight column.
Thanks for your help!

Because you forgot about weight:
np.mean = sum(error_i * weight_i ∀ i) / len(error_i ∀ i)
while
np.average = sum(error_i * weight_i ∀ i) / sum(weight_i ∀ i)

You are having the formula for the weighted average in your 1st code segment wrong, it should be:
wtd_mse = np.sum(sq_error * y_wts) / np.sum(y_wts)
instead of:
wtd_mse = np.mean(wtd_sq_error)

Related

Calculating Mean Squared Error with Sample Mean

I was given this assignment, and I'm not sure if i understand the question correctly.
We considered the sample-mean estimator for the distribution mean. Another estimator for the distribution mean is the min-max-mean estimator that takes the mean (average) of the smallest and largest observed values. For example, for the sample {1, 2, 1, 5, 1}, the sample mean is (1+2+1+5+1)/5=2 while the min-max-mean is (1+5)/2=3. In this problem we ask you to run a simulation that approximates the mean squared error (MSE) of the two estimators for a uniform distribution.
Take a continuous uniform distribution between a and b - given as parameters. Draw a 10-observation sample from this distribution, and calculate the sample-mean and the min-max-mean. Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates.
Sample Input: Sample_Mean_MSE(1, 5)
Sample Output: 0.1343368663225577
This code below is me trying to:
Draw a sample of size 10 from a uniform distribution of a and b
calculate MSE, with mean calculated with Sample Mean method.
Repeat 100,000 times, and store the result MSEs in an array
Return the mean of the MSEs array, as the final result
However, the result I get was quite far from the sample output above.
Can someone clarify the assignment, around the part "Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates"? Thanks
import numpy as np
def Sample_Mean_MSE(a, b):
# inputs: bounds for uniform distribution a and b
# sample size is 10
# number of experiments is 100,000
# output: MSE for sample mean estimator with sample size 10
mse_s = np.array([])
k = 0
while k in range(100000):
sample = np.random.randint(low=a, high=b, size=10)
squared_errors = np.array([])
for i, value in enumerate(sample):
error = value - sample.mean()
squared_errors = np.append(squared_errors, error ** 2)
k += 1
mse_s = np.append(mse_s, squared_errors.mean())
return mse_s.mean()
print(Sample_Mean_MSE(1, 5))
To get the expected result we first need to understand what the Mean squared error (MSE) of an estimator is. Take the sample-mean estimator for example (min-max-mean estimator is basically the same):
MSE assesses the average squared difference between the observed and predicted values - in this case, is between the distribution mean and the sample mean. We can break it down as below:
Draw a sample of size 10 and calculate the sample mean (ŷ), repeat 100,000 times
Calculate the distribution mean: y = (a + b)/2
Calculate and return the MSE: MSE = 1/n * Σ(y - ŷ)^2

How to pre-process the data to calculate Root Mean Squared Logarithmic Error?

I'm trying to calculate the Root Mean Squared Logarithmic Error for which I have found few options, one is to use the sklearn metric: mean_squared_log_error and take its square root
np.sqrt(mean_squared_log_error( target, predicted_y ))
But I get the following error:
Mean Squared Logarithmic Error cannot be used when targets contain negative values
I have also tried a solution from a Kaggle post:
import math
#A function to calculate Root Mean Squared Logarithmic Error (RMSLE)
def rmsle(y, y_pred):
assert len(y) == len(y_pred)
terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
return (sum(terms_to_sum) * (1.0/len(y))) ** 0.5
Same issue, this time I get a domain error.
In the same post they comment the following regarding the negative log issue:
You're right. You have to transform y_pred and y_test to make sure they don't carry negative values.
In my case, when predicting weather temperature (originally in Celsius degrees), the solution was to convert them to Kelvin degrees before calculating the RMSLE:
rmsle(data.temp_pred + 273.15, data.temp_real + 273.15)
Is there any standard form of use this metric that allows to work with negative values?
Normalize both the arrays to range 0 to 1
If you're using scikit you can use sklearn.preprocessing.minmax_scale:
minmax_scale(arr, feature_range=(0,1))
Before you do this save the max and min value of arr. You could get back the actual value.
Eg:
normalized = (value - arr.min()) / (arr.max() - arr.min()) # Illustration
There is no standard form that allows negative values because the log of a negative number is undefined. You either have to transform your data like the temperature example (set your lowest value to 0 and scale), or consider why you are using RMSLE and if it really is the right metric.
I had a similar problem, one of the predictions was negative, although all of the training target values were positive. I narrowed this down to outliers and solved it by using the RobustScaler from sklearn. Which not only scales the data but also deals with outliers
Scale features using statistics that are robust to outliers.
Feature scaling should be a good option here, such that the minimum value is >= 0.
Use min-max scaler to scale your value between (0, x], where x is anything you choose. Then use this for better result.

Can't figure out how to print the least squares error

I wrote some code to find the best fitting line for a couple of data points using the analytical solution to least squares. Now I would like to print the error between the actual data and my estimated line, but I have no idea how to compute it. Here is my code:
import numpy as np
import matplotlib.pyplot as plt
A = np.array(((0,1),
(1,1),
(2,1),
(3,1)))
b = np.array((1,2,0,3), ndmin = 2 ).T
xstar = np.matmul( np.matmul( np.linalg.inv( np.matmul(A.T, A) ), A.T), b)
print(xstar)
plt.scatter(A.T[0], b)
u = np.linspace(0,3,20)
plt.plot(u, u * xstar[0] + xstar[1], 'b-')
You have already plotted the predictions from the linear regression. So from the value of the prediction, you can calculate the "sum of square errors (SSE)" or the "mean square error (MSE)" as follows:
y_prediction = u * xstar[0] + xstar[1]
SSE = np.sum(np.square(y_prediction - b))
MSE = np.mean(np.square(y_prediction - b))
print(SSE)
print(MSE)
An aside note. You might want to use np.linalg.pinv as that is a more numerically stable matrix inverse operator.
Note that numpy has a function for it, calles lstsq (i.e. least-squares), that returns the residuals as well as the solution, so you don't have to implement it yourself:
xstar, residuals = np.linalg.lstsq(A,b)
MSE = np.mean(residuals)
SSE = np.sum(residuals)
try it!

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Numpy stateing that invalid value while calculating normalized mahalanobis distance

Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt
return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data
Probably too late, but check out page 64-65 in our textbook "Introduction to Data Mining". There's a section called "Normalization or Standardization", which explains the concept of normalized data that Hearne is looking for.
Basically, standardized data set x' = (x - mean(x)) / standardDeviation(x)
Since I see you're using python, here's how to do it using SciPy:
normalizedData = (data - data.mean(axis=0)) / data.std(axis=0, ddof=1)
Source: http://mail.scipy.org/pipermail/numpy-discussion/2011-April/056023.html
You can use pdist() to do the distance calculation without for loop:
from sklearn import datasets
iris = datasets.load_iris()
from scipy.spatial.distance import pdist, squareform
print squareform(pdist(iris.data, 'mahalanobis'))
Normalization in this context probably does mean subtracting the mean and scaling so the data has a unit covariance matrix.
However, to scale every vector in your dataset to unit norm use: norm_data=data/np.sqrt(np.sum(data*data,1))[:,None].
You need to divide by the L2 norm of each vector, which means squaring the value of each element, then taking the square root of the sum. Broadcasting allows you to avoid explicitly coding the loop (see the answer to the question you cited: https://stackoverflow.com/a/8904762/1149913).

Categories