R-squared results of test and validation differ by a huge margin

R-squared results of test and validation differ by a huge margin - python

I am working on a regression problem with keras and tensorflow using a neural network. The data is split, so that 282774 datasets are for training, 70694 are for validation and 88367 are for testing. To evaluate my models I am printing out the mean squared error (MSE), the mean absolute error (MAE) and the R-squared score. These are some examples from the results I get:
MSE MAE R-squared
Training 1.562072899 0.958128839 0.849787137
Validation 0.687871457 0.62066941 0.935365564
Test 0.683918759 0.618674863 -16.22829222
I do not understand the value for R-squared on test data. I know that R-squared can be negative, but how can it be that there is such a big difference between validation and test if both fall into the category of unseen data. Can someone give me a hint?
Some background information:
Since keras does not have the R-squared metric built in, I implemented it with some code I found on the web and which seems logical for me:
def r2_keras(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
And if it helps: this is my model:
model = Sequential()
model.add(Dense(75, input_shape=(7,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
adam = optimizers.Adam(lr=0.001)
model.compile(loss='mse',
optimizer=adam,
metrics=['mse', 'mae', r2_keras])
history = model.fit(x_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2)
score = model.evaluate(x_test, y_test, batch_size=32)
One strange thing I noticed is, that not all testing data seems to be considered. The console prints out the following:
86304/88367 [============================>.] - ETA: 0s-----
Maybe this leads to a miscalculation for R-squared?
I am thankful for any help/hint I can get on understanding this issue.
Update:
I checked for outliers, but could not find any significant one. Min and max-values for test and train are close by, considering the standard deviation. Also the histograms look very much alike.
So in the next step I let my model predict the values for test data again and used pandas + numpy to calculate the r2_score. This time I got a value which is approximately equal to the r2_score for validation.
Below is how I did it. Do you see any flaws in the way I performed the calculation? (I just want to be sure that the old r2_score for "test" was indeed a calculation error)
# "test" is a dataframe with input data and the real outputs
# "inputs" is a list of the input column names
# The real/true outputs are contained in the column "output"
test['output_pred'] = model.predict(x=np.array(test[inputs]))
output_mean = test['output'].mean() # Is this the correct mean value for r2 here?
test['SSres'] = np.square(test['output']-test['output_pred'])
test['SStot'] = np.square(test['output']-output_mean)
r2 = 1-(test['SSres'].sum()/(test['SStot'].sum()))

Tensorflow's built-in evaluate method evaluates your test set batch by batch and hence calculates r2 at each batch. The metrics produced from model.evaluate() is then simple average of all r2 from each batch. While in model.fit(), r2 (and all metrics on validation set) are calculated per epoch (instead of per batch and then take avg.)
You may slice your output and output_pred into batches of the same batch size you used in model.evaluate() and calculate r2 on each batch. I guess the model produces high r2 on batches with high total sum of squares (SS_tot) and bad r2 on lower ones. So when taken average, result would be poor (however when calculate r2 on entire dataset, samples with higher ss_tot usually dominate the result).

Related

Why do I need a very high learning rate for this model to converge?

I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.
def gen_model():
return tf.keras.models.Sequential([
tf.keras.Input(shape=(28,28,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='softmax')
])
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)
Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.
The dataset was created with the following code:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];
y_train = tf.one_hot(y_train, 10)

Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:
The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.

Accuracy Difference in Deep Learning - epochs and final accuracy

I am trying to train a deep learning model for my project. The code for the same is as given below:
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
The accuracy for each epoch is coming in 90s. (eg. 0.97). But, When I predict the gender and print the accuracy with the help of accuracy_score, it's just coming in 80s. How are these accuracies related? I am new to machine learning, any help would be appreciated.

The training and test accuracy scores are used to identify over-fitting. A training accuracy that is far higher than test accuracy indicates over-fitting.
Accuracy can be misleading because of the nature of how it is calculated. In your case you are predicting the gender which has two possible values 0 and 1. If the data is imbalanced, e.g. if a large portion of the test data was of one class then predicting that class would provide a high accuracy score but it would not necessarily be true when the classifier is applied to other data.
Accuracy = ((True Positive) + (True Negative)) / (Total Population)
The Confusion Matrix enables an accurate assessment of the model performance.
True Positives (TP) − The case when both actual class & predicted class of data point is 1.
True Negatives (TN) − The case when both actual class & predicted class of data point is 0.
False Positives (FP) − The case when actual class of data point is 0 & predicted class of data point is 1.
False Negatives (FN) − The case when actual class of data point is 1 & predicted class of data point is 0.
Looking at your console output:
TP = 93,
FP = 27,
FN = 24,
TN = 101
Accuracy score : (93 + 101) / (93 + 27 + 24 + 101) = 0.7918367347
Precision : (TP) / (TP + FP) : 93 / (93 + 27) = 0.775
Recall : (TP) / (TP + FN) : 93 / (93 + 24) = 0.7948
Support is the number of samples of the true label in each class of target values. (120 are class 0, 125 are class 1)
F1-score : 2 * (Precision * Recall) / (Precision + Recall)
F1 score is the weighted average of the precision and recall. Its best value would be 1 and the worst would be 0.
Being able to interpret these variables will help you to better understand your model.
If you are only beginning your Machine Learning adventure, I'd highly recommend Andrew Ng's Machine Learning course on Coursera. He explains everything you need to understand quite simply without too much mathematics.

You wrongly mix up two things. There are two sets general, train set: user for training and you got above 90+ accuracy and another set you used for test and you got 79 - which by the way a clear indication of model overfitting.
All your confusion matrix, classification report, and lastly accuracy measurements are done using the test set. In your code,
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
This is the training part, and it properly .fit the training pairs - x_train -> gender_train. Now, after some fitting or some epoch later when we test the model on unseen data which never NEVER used for training, that eventually reveals the model's true performance. In your case, this unseen data is x_test.
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
If performance on x_test set or Unseen data is low or very low compare to the training data (i.e. x_train), it's simply called overfitting and that's not expected. In your case, your model is suffering from overfitting.
How to overcome
To avoid overfitting, people generally do as follows:
make a validation set for choosing hyper-parameter in training time
strong data augmentation
I would recommend these two article to read to get some more idea. Hope that helps.
Overfitting and Underfitting With Machine Learning Algorithms
Underfitting vs. Overfitting (vs. Best Fitting) in Machine Learning

Selecting validation metric for `categorical_crossentropy` in Keras

I am looking at these two questions and documentation:
Whats the output for Keras categorical_accuracy metrics?
Categorical crossentropy need to use categorical_accuracy or accuracy as the metrics in keras?
https://keras.io/api/metrics/probabilistic_metrics/#categoricalcrossentropy-class
For classification of X-Rays images I (15 classes) I do:
# Compile a model
model1.compile(optimizer = 'adam', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
# Fit the model
history1 = model1.fit_generator(train_generator, epochs = 10,
steps_per_epoch = 10, verbose = 1, validation_data = valid_generator)
My model works and I have an output:
But I am not sure how to add validation accuracy here to compare results and avoid over/underfitting.

I hope the following can help you:
The use of "categorical_crossentropy" tells me that your labels are a one hot encoding over different classes.
Let's say you have 15 classes, the correct prediction would be a vector with 14 zeros, and a one at the corresponding index. In this context "accuracy" will be very high as your model will be correctly predicting mostly zero everywhere, so the accuracy should easily be at least 13/15 = 0.86.
A more suitable metric would be "categorical_accuracy" which will give you 1 if the model predicts the correct index, and else 0.
If you have a validation "categorical_accuracy" better than 1/15 = 0.067 (assuming your class are correctly balanced), your model is better than random.
You can find a list of metrics at keras metrics.

Keras RNN accuracy doesn't improve

I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
Update:
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.add(keras.layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=24,
validation_split=0.35,
shuffle=False,
)
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work

A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.

What Does Accuracy Metrics Mean in Keras' Sample Denoising Autoencoder?

I am working with Keras' sample denoising autoencoder;
https://keras.io/examples/mnist_denoising_autoencoder/
As I compile it, I use the following options:
autoencoder.compile(loss='mse', optimizer= Adadelta, metrics=['accuracy'])
Followed by training. I did training deliberately WITHOUT using noisy training data(x_train_noisy), but merely tried to recover x_train.
autoencoder.fit(x_train, x_train, epochs=30, batch_size=128)
After training 60,000 inputs of MNIST digits, it gives me an accuracy of 81.25%. Does it mean there are 60000*81.25% images are PERFECTLY recovered (equaling to the original input pixel by pixel), that is, 81.25% output images from the autoencoder are IDENTICAL to their input counterparts, or something else?
Furthermore, I also conducted a manual check by comparing output and the original data (60000 28X28 matrices) pixel by pixel--counting non-zeros elements from their differences:
x_decoded = autoencoder.predict(x_train)
temp = x_train*255
x_train_uint8 = temp.astype('uint8')
temp = x_decoded*255
x_decoded_uint8 = temp.astype('uint8')
c = np.count_nonzero(x_train_uint8 - x_decoded_uint8)
cp = 1-c /60000/28/28
Yet cp is only about 71%. Could any tell me why there is a difference?

Accuracy doesn't make sense for a regression problem, hence the keras sample doesn't use that metric during autoencoder.compile.
In this case, keras calculates the accuracy as per this metric.
binary_accuracy
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
Using this numpy implementation, you should get the same value as output by Keras for validation accuracy at the end of training.
x_decoded = autoencoder.predict(x_test_noisy)
acc = np.mean(np.equal(x_test, np.round(x_decoded)))
print(acc)
Refer this answer for more details:
What function defines accuracy in Keras when the loss is mean squared error (MSE)?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.