I'm training a text classification model where the input data consists of 4096 term frequency–inverse document frequencies.
My output are 416 possible categories. Each piece of data has 3 categories, so there are 3 ones in an array of 413 zeros (one-hot-encodings)
My model looks like this:
model = Sequential()
model.add(Dense(2048, activation="relu", input_dim=X.shape[1]))
model.add(Dense(512, activation="relu"))
model.add(Dense(416, activation="sigmoid"))
When I train it with the binary_crossentropy loss, it has a loss of 0.185 and an accuracy of 96% after one epoch. After 5 epochs, the loss is at 0.037 and the accuracy at 99.3%. I guess this is wrong, since there are a lot of 0s in my labels, which it classifies correctly.
When I train it with the categorical_crossentropy loss, it has a loss of 15.0 and an accuracy of below 5% in the first few epochs, before it gets stuck at a loss of 5.0 and an accuracy of 12% after several (over 50) epochs.
Which one of those would be right for my situation (large one-hot-encodings with multiple 1s)? What do these scores tell me?
EDIT: These are the model.compile() statement:
model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
and
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
In short: the (high) accuracy reported when you use loss='binary_crossentropy' is not the correct one, as you already have guessed. For your problem, the recommended loss is categorical_crossentropy.
In long:
The underlying reason for this behavior is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation, as you have. In other words, while your first compilation option
model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(),
metrics=['accuracy']
is valid, your second one:
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).
Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected loss='binary_crossentropy' and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.
Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # WRONG way
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # only 2 epochs, for demonstration purposes
verbose=1,
validation_data=(x_test, y_test))
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.9975801164627075
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001
score[1]==acc
# False
Arguably, the verification of the above behavior with your own data should be straightforward.
And just for the completeness of the discussion, if, for whatever reason, you insist in using binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])
In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.98580000000000001
# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001
score[1]==acc
# True
System setup:
Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4
Related
I am trying to build a regression model but the mse and mae are very high. I filter and normalize the data (both the input and output, and also the test and train set). I think the problem comes because I have very high values in one column: the minimum is 1 and the maximum is 9100000 (without normalizing), but I actually need to predict these high values.
The model looks like this: I have 6 input columns and 800000 rows. And I have tried with more neurons and layers, or changing the sigmoid function, but the loss and the error keep being around 0.8 for mse and 0.3 for mae. The predictions are also way lower than they should be, never achieving the high values.
model = Sequential()
model.add(Dense(7, input_dim=num_input, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='rmsprop', metrics=['mse', 'mae'])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(x_val, y_val))
A few remarks/advices:
RMSProp is generally not used with fully connected layers, I recommend switching to Adam or SGD.
If you have a skewed distribution with many large values, you might consider using the log of these values instead.
First try with a shallow model with few neurons. Then gradually increase the number of neurons in order to overfit the dataset. You should be able to reach perfect score on the train set. At that point you can start decrease the number of neurons and add layers with dropout to improve generalisation.
As already mentioned in the comments, the output activation for regression should be "linear". Sigmoid is for binary classification.
Iam doing a text classification, my dataset size is 16000 KB, my problem is I have 95% of training and 90% in testing.. can I increase testing ? and how?
here is my code
model = Sequential()
model.add(Conv1D( filters=256,kernel_size=5, activation = 'relu',input_shape=(7,1)))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(Dense(11, activation='softmax'))
model.summary()
model.compile(Adam(lr=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train, y_train,
epochs=200,
verbose=True,
validation_data=(X_test, y_test),
batch_size=128)
loss, accuracy = model.evaluate(X_train, y_train, verbose=True)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
The first step to debug the model is to plot the training validation curve like the example.
Typical training validation curve
Now based on how the curves behave there can be below possible inferences and solutions.
The two curves diverge as the model is trained, training keeps on improving while the testing either gets worse or saturates way earlier than training.
Cause: Model is overfitting the training and needs regularisation eg. dropout, weight decay, etc.
The two curves stick close together at the end and no further improvements happen.
Cause: Model is saturated or stuck in local minima, try increasing the learning rate to push out of minima, if still no major improvements, try adding more complexity to the model.
The two curves have saturated at the end, but are a small distance apart, and not major changes happed as further trained.
Cause: the model has learned what it could from the available data and will not improve any further, try data transformations to generate new data or get more data.
I am using tensorflow and keras for a binary classification problem.
I have only 121 samples, but 20.000 features. I know its too less samples and too many features, but its a biological problem (gene-expression data), so i have to deal with it.
My question: Why is accuracy (train and test) going up to 100%, then down and then increasing again. BUT loss is decreasing all the time?
Accuracy plot:
Validation plot:
Since my dataset is only 118 samples big i have only 24 test data points. See confusion matrix:
This is my neural network architecture:
with current settings:
{'ann__dropout_rate': 0.4, 'ann__learning_rate': 0.01, 'ann__n_neurons': 16, 'ann__num_hidden': 1, 'ann__regularization_rate': 0.6}
model = Sequential()
model.add(Dense(input_shape, activation="relu",
input_dim=input_shape)) # First Layer
model.add(Dense(n_neurons, activation="relu",
kernel_regularizer=tf.keras.regularizers.l1(regularization_rate)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation="sigmoid"))
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(loss="binary_crossentropy",
optimizer=optimizer, metrics=['accuracy'])
return model
Thank you!
try to shuffle your training data if you are not doing so already. You might also try a larger batch size. I also recommend using the ReduceLROnPlateau callback in model.fit. Documentation is here. Set it up to monitor validation loss and to reduce the learning rate by a factor <1 if the loss fails to reduce after patience epochs.
I implemented your #Gerry P ideas (Shuffle=true) and ReduceLROnPlateau (batch size is 64). My callbacks are now:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=1e-6, verbose=1)
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=20, mode='auto')
My Accuracy accuracy and Loss loss looks now like this:
I would say its still overfitted.
Confusion-Matrix:
Confusionmatrix
I am training and fitting a keras model using validation split:
self.model = Sequential()
self.model.add(LSTM(hidden_units, input_shape=(1, n_features), dropout=drp))
self.model.add(Dense(n_classes, activation='sigmoid'))
self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
self.model.build(X_train.shape)
self.history = self.model.fit(X, y.values, epochs=epochs, batch_size=32, validation_split=0.33, shuffle=True, callbacks=cb_list)
After fitting, I want to access the test set that was used.
How can I do this?
As per the FAQ in the documentation, the validation_split attribute makes the last % of samples to be your validation data.
In your case, you set the value of validation_split to be 0.33. This means that the last 33% of the samples in X get used as validation data.
So, you can just directly slice off the last 33% of X and use it.
For a school project, I'm trying to predict data using the keras framework, but it's returning 'nan' loss and values when I try to get predicted data.
Source code :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)
# create model
model = Sequential()
model.add(Dense(950, input_shape=(425,), activation='relu'))
model.add(Dense(425, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer='sgd')
# Fit the model
model.fit(X_train, y_train, epochs=20, batch_size=1, verbose=1)
#evaluate the model
y_pred = model.predict(X_test)
score = model.evaluate(X_test, y_test,verbose=1)
print(score)
# calculate predictions
predictions = model.predict(X_pred)
Data :
X_train and X_test are (panda)dataframes of 5000 rows(nber of samples) * 425 columns (number of dimensions).
y_train and y_test look like :
array([ 1.17899644, 1.46080518, 0.9662137 , ..., 2.40157461,
0.53870386, 1.3192718 ])
Can you help me with that ?
Thank you for you help!
Usually, this means that something converges to infinity. As #desertnaut pointed out in the comment, reducing the learning rate might help.
But the root of the issue is your input data. What do these 425 data points mean? Are they from different sources, different features, different parameters? Finding outliners or normalizing the data, could help.
Your code looks fine otherwise.
Make sure your target output is in range (0, 1) as you have sigmoid in the last layer.
sigmoid has an output between zero and one so if the target output is not in this range then (a) change the activation function or (b) normalize outputs in the required range.
Make sure the purpose of this model is the regression.
After considering the above three points, play around with learning rate (decrease) and the optimiser (replace with any other).
Try changing your optimizer to 'Adam' instead of SGD
You initialized your SGD optimizer in variable sgd but you're not using it in compile