RMSE for Multidimensional Data - python

For data that are one-dimensional or consist of a single column, calculating the error (RMSE) is simple. We can use python library, for instance
from sklearn.metrics import mean_squared_error
RMSE = mean_squared_error(y_actual, y_predicted, squared=TRUE)
Models can have multiple output columns 2, 10 or even 100..
How to calculate RMSE if the data has few columns?
For example:
y_act = np.array([1.022,0.94,1.278,2.096,1.404,
2.035,1.622,2.348,1.909,1.678,
1.638,1.742,2.279,1.878,2.045] )
y_actual = y_act.reshape((5,3))
y_pred = np.array([1.021,0.84,1.111,2.091,1.314,
2.131,1.622,2.348,1.888,1.178,
1.238,1.632,2.119,1.677,2.145] )
y_predicted = y_pred.reshape((5,3))
RMSE(y_actual - y_predicted)?
How the formula of the error changes?

The formula remains the same. How you use it will depend on your use case.
In your case you have 5 samples with 3 outputs(columns). Probably you ran some model on 3 different algorithms and got these results from them.
The difference from the 1D version is how you want to treat each output. You could do three things with it. In 1D version, you have just one value, here you have 3.
Leave it as it is: In this case you have an MSE value each for all 3 outputs.
Uniformly Average Them: Take the average of three MSE value you get to get one MSE value.
Weighted Average: You can do a weighted average of the three MSE values.
These functionalities are available in the sklearn under the 'multi-output' parameter.
Here are examples to help you with manual computation v/s package computation

Related

How to calculate mean absolute error horizontally(row-wise) on 2D numpy.array with sklearn.metrics?

I've tried to calculate mean_absolute error of all rows of an 2-D array. Here is my code:
from sklearn.metrics import mean_absolute_error as mae
arr = np.array([[1.7, 3.1], [2.1, 2.7], [0.9, 0.7], [0.3, 0.8]])
result_arr = np.apply_along_axis(mae, 0, arr[:, 0], arr[:,1])
However, I got result like that:
array(0.675)
I want to get mae values like this(row-wise):
array([[mae_value1],
[mae_value2],
[mae_value3],
[mae_value4]])
By the way, I have to calculate mae with sklearn.metrics and without loop. Is there any efficient way to do this?
I think that you may be confused about the mean absolute error metrics. mae calculates mean error for two sets of items (predictions and y values of a test set), and this is why it returns one number.
Your desired result indicates that you are looking other some metrics on each row of your array, not the mean of all these rows. What do you want to see there? A difference between two numbers, a difference between absolute of these two numbers, etc.
If you really want to get mae you should understand that this metrics is used in sklearn in the context of machine learning. When you feed data to sklearn.mae, you should feed data from two sources: your predictions from your machine learning algorithm, and y values from your test set.
If machine learning is not your context, then you need to provide more information on what you are actually doing, in order for someone to help you.

Autoencoder Custom Error Metric not working as intended

I have created a custom error metric to measure the success of an autoencoder implemented in Keras and sklearn. At a high level, it computes the loss for each of the output neurons separately (num_features = num_input_neurons = num_output_neurons), and returns the highest and lowest loss of the features over the whole testing dataset, as well as the names of the two features themselves.
The implementation of the custom loss functions look like this:
import keras.backend as K
def get_argmax_max(features, diff_list):
return [features[K.argmax(diff_list)], K.max(diff_list).numpy()]
def get_argmin_min(features, diff_list):
return [features[K.argmin(diff_list)], K.min(diff_list).numpy()]
# features = [list of feature names]
decoded_prediction = test_autoencoder.predict(testing_data)
individual_diff_list = list(abs(x-y) for x, y in zip(decoded_prediction, testing_data))
error_list = np.mean(individual_diff_list, axis=0)
max_err = get_argmax_max(features, error_list)
min_err = get_argmin_min(features, error_list)
I have added two columns to train and test - one completely random, and one uniform (eg: all 1s) - in order to test the success of this custom metric. My hypothesis is that the custom metric will identify the uniform column as having the minimum error (as it is very simple to learn) and the random column as having the maximum error (as it is not possible to learn well). In practice, however, this is not the case, and the metric often selects random other columns instead of the expected uniform and random columns. I have ensured that the range of the random column exceeds that of all columns in the dataset, tested with sufficiently large encoding dimension, and have also tried running for a large number of epochs to no avail.

MinMax scaling the target

I applied a linear regression on some features to predict the target with 10 folds cross validation.
MinMax scale was applied for both the features and the target.
Then the features standardized.
When I run the model, the r2 equal to 0.65 and MSE is 0.02.
But when I use the target as they are without MinMax scaling, I got r2 same but the MSE increase a lot to 18.
My question is, do we have to deal with targets as same we do with features in terms of data preprocessing? and which of the values above is correct? because the mse got quit bigger with out scaling the target.
Some people say we have to scale the targets too while others say no.
Thanks in advance.
Whether you scale your target or not will change the 'meaning' of your error. For example, consider 2 different targets, one ranged [0, 100] and another one [0, 10000]. If you run models against them (with no scaling), MSE of 20 would mean different things for the two models. In the former case it will be disastrous, while in the latter case it will be pretty decent.
So the fact that you get lower MSE with target range [0, 1] than the original is not surprising.
At the same time, r2 value is independent of the range since it is calculated using variances.
Scaling allows you to compare model performance for different targets, among other things.
Also for some model types (like NNs) scaling would be more important.
Hope it helps!

How to Kullback Leibler divergence of two datasets

I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?
Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.

Intuition behind nloglikelihood value in xgboost poisson run

When I am using count:poisson instead of rmse I am seeing nloglikelihood values. Now I am not sure how to compare those numbers with rmse or mae.
Definitely lesser the value better .. but not getting actual error intuition that we get with rmse or Mae.
For example -> train-poisson-nloglik:2.01885 val-poisson-nloglik:2.02898
Here can we say, actual values differ by 2.02 error.
Can someone explain with small example.
Thanks.
There is a good post on the computation of the value here
Just to be more exhaustive, the value is:
mean(factorial(label) + preds - label*log(preds))
If you compare with the true formula of the negative log-likelihood, it should be the sum instead of the mean. I guess that they choose to take the mean so that the train and the test values are more comparable.
Finally, to answer the question, the likelihood is the probability that the data came from the distribution with a specific parameter. In the Poisson model, the parameters are just the set of predictions. So the better is your prediction, the greater is the probability, the smaller is the associate negative log-likelihood.
rmse or mae are based on the expectation of the difference between the prediction and the truth whereas negative log-likelihood is looking at a probability.

Categories