Keras Binary Classification - Sigmoid activation function

Keras Binary Classification - Sigmoid activation function - python

I've implemented a basic MLP in Keras with tensorflow and I'm trying to solve a binary classification problem. For binary classification, it seems that sigmoid is the recommended activation function and I'm not quite understanding why, and how Keras deals with this.
I understand the sigmoid function will produce values in a range between 0 and 1. My understanding is that for classification problems using sigmoid, there will be a certain threshold used to determine the class of an input (typically 0.5). In Keras, I'm not seeing any way to specify this threshold, so I assume it's done implicitly in the back-end? If this is the case, how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem? With binary classification, we want a binary value, but with regression a nominal value is needed. All I can see that could be indicating this is the loss function. Is that informing Keras on how to handle the data?
Additionally, assuming Keras is implicitly applying a threshold, why does it output nominal values when I use my model to predict on new data?
For example:
y_pred = model.predict(x_test)
print(y_pred)
gives:
[7.4706882e-02] [8.3481872e-01] [2.9314638e-04] [5.2297767e-03]
[2.1608515e-01] ... [4.4894204e-03] [5.1120580e-05] [7.0263929e-04]
I can apply a threshold myself when predicting to get a binary output, however surely Keras must be doing that anyway in order to correctly classify? Perhaps Keras is applying a threshold when training the model, but when I use it to predict new values, the threshold isn't used as the loss function isn't used in predicting? Or is not applying a threshold at all, and the nominal values outputted happen to be working well with my model? I've checked this is happening on the Keras example for binary classification, so I don't think I've made any errors with my code, especially as it's predicting accurately.
If anyone could explain how this is working, I would greatly appreciate it.
Here's my model as a point of reference:
model = Sequential()
model.add(Dense(124, activation='relu', input_shape = (2,)))
model.add(Dropout(0.5))
model.add(Dense(124, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr = 0.1, momentum = 0.003),
metrics=['acc'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)

The output of a binary classification is the probability of a sample belonging to a class.
how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem?
It does not need to. It uses the loss function to calculate the loss, then the derivatives and update the weights.
In other words:
During training the framework minimizes the loss. The user must specify the loss function (provided by the framework) or supply their own. The network only cares about the scalar value this function outputs and its 2 arguments are predicted y^ and actual y.
Each activation function implements the forward propagation and back-propagation functions. The framework is only interested in these 2 functions. It does not care what the function does exactly, as long as it is differentiable for gradient descent to work.

You can assign the threshold explicitly in compile() by using
tf.keras.metrics.BinaryAccuracy(
name="binary_accuracy", dtype=None, threshold=0.5
)
like following:
model.compile(optimizer='sgd',
loss='mse',
metrics=[tf.keras.metrics.BinaryAccuracy()])

Related

Basic Regression Neural Network unable to learn

I am trying to train a basic neural network for regression on a dataset to predict the price range of a car. The linear regression model doesn't perform very well for the dataset, thus making a neural network model.
Here are the layers I used.
tf.keras.backend.clear_session()
nmodel = Sequential()
nmodel.add(tf.keras.layers.Dense(10, activation='relu', input_shape=[28,]))
nmodel.add(tf.keras.layers.Dense(units=1))
After compiled ,following output can get
opt = tf.keras.optimizers.Adam(learning_rate=.2)
nmodel.compile(loss='mean_squared_error' , optimizer=opt, metrics=['accuracy'])
And this is the final function I used to fit it.
keras_history = nmodel.fit(X_train, Y_train , batch_size=32 ,epochs=100 , validation_data=(X_test, Y_test))
The loss for the training set starts to stagnate after 5 epochs usually and the model stops learning after that. The accuracy for the model is also very low, i.e.~=0.0015.
I have tried a couple of ways to resolve this. I initially thought that maybe the learning rate was too small but when I increased the learning rate, it would not learn either.
I thought of removing the activation function as maybe the relu was causing the neurons to die after it reached a certain loss. That too had no results.
I have tested out different numbers of layers and different numbers of neurons. In the end, all of them end up having no effect on the model learning.
I used to make classification neural networks and this is my first time making a regression neural network. I feel like I am missing something quite basic.

I took a look into your notebook and I noted 2 things:
You shouldn't fit_transform the test set
We use fit_transform() on the train data to learn the parameters of the scaling on the train data and scale the latter at the same time. However, we only use transform() on test data because we need to keep the scaling parameters learn from the train data in order to scale the test data.
You need to have
X_test = scaler.transform(X_test)
Also, you don't need to scale the target value.
You can use this instead :
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

My LSTM model overfits over validation data

Here's my LSTM model to classify hand gesture. Initially, I had 1960 training data of shape(num_sequences, num_joints, 3) that I reshape to shape(num_sequences, num_joints*3).
Here's my model:
input_shape = (trainx.shape[1], trainx.shape[2])
print("Build LSTM RNN model ...")
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(171, 66)))
model.add(Bidirectional(LSTM(units=256, activation='tanh', return_sequences=True, input_shape=input_shape)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=True)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=False)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(units=trainy.shape[1], activation="softmax"))
print("Compiling ...")
# Keras optimizer defaults:
# Adam : lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.
# RMSprop: lr=0.001, rho=0.9, epsilon=1e-8, decay=0.
# SGD : lr=0.01, momentum=0., decay=0.
opt = Adam()
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
I get a 90% accuracy on train and 50% on test

Overfitting is quite common in deep learning.
To circumvent over fitting with your LSTM architecture try the following things in this order:
Decrease the learning rate from 0.1 or 0.01 to 0.001,0.0001,0.00001.
Reduce the number of epochs. You can try to plot the training and validation accuracy as a function of the number of epochs and see when the training accuracy becomes larger than the validation accuracy. That is the number of epochs that you should use. Combine this with the 1st step decreasing the learning rate.
Then you can try to modify the architecture of the LSTM, here you already added dropout (maximum value 0.5), I would suggest to try 0.2, 0.3. You have 3 cells which is better than 2, the size of the nodes look reasonable. What is the embedding dimension you are currently using? Since you are overfitting it is worth a try to reduce the number of cells from 3 to 2 and keeping the same number of nodes.
The batch size might be important as well as the distribution of subclasses in your dataset. Is the dataset equally distributed and equally balanced between training and validation sets? What I mean by this is that if one hand gesture is over represented in the training set compared to the validation set that might be a problem. A good strategy to overcome this is to keep some part of the data as a test set. Then do a train/split cross validation using sklearn (5 times). Then train your architecture on each train/split model separately (5 times) and compare the training and validation accuracy. If there is a big bias in the split or among the sets you will be able to identify it in this manner.
Last, you can try augmentation, specifically rotation and horizontal/vertical flip. This library might help https://github.com/aleju/imgaug
Hope this helps!

How do you know the network is over fitting versus some kind of error in your data set. Does the validation loss improve initially up to some epoch then plateau or start to increase? Then it is over fitting. If it starts at 50% and stays there it is not an over fitting problem. With the amount of drop out you have over fitting does not look very likely. How did you select your validation set? Was it randomly selected from the overall data set or did you do the selection? It is always better to randomly select the data so that its probability distribution mirrors that of the training data. As said in the comments please show your code for model.fit there could be a problem there. How do you input the data? Did you use generators? A 50% validation accuracy leads me to suspect some error in how your validation data is provided to the network or some error in labeling of the validation data. I would also recommend you consider the use of dynamically adjusting your learning rate based on monitoring of validation loss. Keras has a callback for this
called ReduceLROnPlateau. Documentation is here. Set it up to monitor validation loss. I set the parameters patience=3 and factor=.5 which seems to work well. You can think of training as descending into a valley. As you descend the valley gets narrower. If the learning rate is to large and remains fixed you won't be able to reach further down toward the minimum. This should improve your training accuracy which should result in improvement of validation accuracy. As I said with the level of drop out you have I do not think it is over fitting but if it is you can also use Keras regularizes to help avoid over training. Documentation is here.

LSTM accuracy too low

I’m a new learner, I just try to get accuracy and validate accuracy using the below code
model = Sequential()
model.add(LSTM(10, input_shape=(train_X.shape[1], train_X.shape[2])))
#model.add(Dropout(0.2))
#model.add(LSTM(30, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1), return_sequences=True)
model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=120, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history[‘loss’], label=’train’)
pyplot.plot(history.history[‘val_loss’], label=’test’)
pyplot.legend()
pyplot.show()
print(history.history[‘acc’])
As the loss value is very low (which is round 0.0136) inspite of that I’m getting the accuracy is 6.9% and validate accuracy is 2.3% respectively, which is very low

That is because accuracy is meaningful only for classification problems; for regression (i.e. numeric prediction) ones, such as yours, accuracy is meaningless.
What's more, the fact is that Keras unfortunately will not "protect" you or any other user from putting such meaningless requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting; see my answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details and a practical demonstration (the argument is identical in the case of MAE instead of MSE, since both loss functions signify regression problems).
In regression settings, usually the performance metric is the same with the loss (here MAE), so you should just remove the metrics=[‘accuracy’] argument from your model compilation and worry only for your loss (which, as you say, is low indeed).

How does exactly fitting with SGD in Keas works?

I'm a newbie in a Neural Networks. I'm doing my university NN project in Keras. I assembled and trained the one-layer sequential model using SGD optimizer:
[...]
nn_model = Sequential()
nn_model.add(Dense(32, input_dim=X_train.shape[1], activation='tanh'))
nn_model.add(Dense(1, activation='tanh'))
sgd = keras.optimizers.SGD(lr=0.001, momentum=0.25)
nn_model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])
history = nn_model.fit(X_train, Y_train, epochs=2000, verbose=2, validation_data=(X_test, Y_test))
[...]
I've tried various learning rate, momentum, neurons and get satisfying accuracy and error results. But, I have to know how keras works. So could you please explain me how exactly fitting in keras works, because I can't find it in Keras documentation?
How does Keras update weights? Does it use a backpropagation algorithm? (I'm 95% sure.)
How SGD algorithm is implemented in Keras? Is it similar to Wikipedia explanation?
How Keras exactly calculate a gradient?
Thank you kindly for any information.

Let's try to break it down and I'll cover only Keras specific bits:
How does Keras update weights? Using an Optimiser which is a base class for different optimisers. Each optimiser calculates the new weights, under a function get_updates which returns a list of functions when run applies the updates.
Back-propagation? Yes, but Keras doesn't implement it directly, it leaves it for the backend tensor libraries to perform automatic differentiation. For example K.gradients calls tf.gradients in the Tensorflow backend.
SGD algorithm? It is implemented as expected on Wikipedia in the SGD class with the basic extensions such as momentum. You can follow the code easily and how it calculates the updates.
How gradient is calculated? Using back-propagation.

Multi-class multi-label classification in Keras

I am trying to train a multi-task multi-label classifier using Keras. The output layer is a fork of two outputs. The task of each output layer is to predict the categories of its task. The y vectors are OneHot encoded.
I am using a custom generator for my data that yields the y arrays in a list to the fit_generator function
I am using a categorigal_crossentropy loss function at each layer
fork1.compile(loss={'O1': 'categorical_crossentropy', 'O2': 'categorical_crossentropy'},
optimizer=optimizers.Adam(lr=0.001),
metrics=['accuracy'])
The problem: The loss doesn't decrease with this setup. However, if I train each task separately, I have low loss and high accuracy. So what could be the problem ?

To perform multilabel categorical classification (where each sample can have several classes), end your stack of layers with a Dense layer with a number of units equal to the number of classes and a sigmoid activation, and use binary_crossentropy as the loss. Your targets should be k-hot encoded.
Regarding the multi-output model, training such a model requires the ability to specify different loss functions for different heads of the network requiring a different training procedure.
You should provide more info in order to give a clear indication of what you want to achieve.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.