Usage of sigmoid activation function in Keras - python

I have a big dataset composed of 18260 input field with 4 outputs. I am using Keras and Tensorflow to build a neural network that can detect the possible output.
However I tried many solutions but the accuracy is not getting above 55% unless I use sigmoid activation function in all model layers except the first one as below:
def baseline_model(optimizer= 'adam' , init= 'random_uniform'):
# create model
model = Sequential()
model.add(Dense(40, input_dim=18260, activation="relu", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(10, activation="sigmoid", kernel_initializer=init))
model.add(Dense(4, activation="sigmoid", kernel_initializer=init))
model.summary()
# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
Is using sigmoid for activation correct in all layers? The accuracy is reaching 99.9% when using sigmoid as shown above. So I was wondering if there is something wrong in the model implementation.

The sigmoid might work. But I suggest using relu activation for hidden layers' activation. The problem is, your output layer's activation is sigmoid but it should be softmax(because you are using sparse_categorical_crossentropy loss).
model.add(Dense(4, activation="softmax", kernel_initializer=init))
Edit after discussion on comments
Your outputs are integers for class labels. Sigmoid logistic function outputs values in range (0,1). The output of the softmax is also in range (0,1), but the Softmax function adds another constraint on the outputs:- the sum of the outputs must be 1. Therefore the outputs of softmax can be interpreted as probability of the input being each class.
E.g
def sigmoid(x):
return 1.0/(1 + np.exp(-x))
def softmax(a):
return np.exp(a-max(a))/np.sum(np.exp(a-max(a)))
a = np.array([0.6, 10, -5, 4, 7])
print(sigmoid(a))
# [0.64565631, 0.9999546 , 0.00669285, 0.98201379, 0.99908895]
print(softmax(a))
# [7.86089760e-05, 9.50255231e-01, 2.90685280e-07, 2.35544722e-03,
4.73104222e-02]
print(sum(softmax(a))
# 1.0

You got to use one or the other activation, as activations are the source to bring non-linearity into the model. If the model doesn't have any activation, then it basically behaves like a single layer network. Read more about 'Why to use activations here'. You can check various activations here.
Although it seems like your model is overfitting when using sigmoid, so try techniques to overcome it like creating train/dev/test sets, reducing complexity of the model, dropouts, etc.

Neural networks require non-linearity at each layer to work. Without non-linear activation no matter how many layers you have, you could write the same thing with only one layer.
Linear functions are limited in complexity and if "g" and "f" are linear functions g(f(x)) could be written as z(x) where z is also a linear function. It is pointless to stack them without adding non-linearity.
And that's why we use non-linear activation functions. sigmoid(g(f(x))) cannot be written as a linear function.

Related

Can we have a dense layer between Conv layers in 1D CNN Architecture?

I have a question regarding the one-dimensional convolutional neural network 1D CNN.
Can we have a dense layer between Conv layers in the architecture? just like what I have done in the following example:
Note: It is working correctly with CSV files for classification problems.
model = Sequential()
# First Convolusional Layer
model.add(Conv1D(128, 5, input_shape=(20,1), strides=2, padding='same'))
model.add(Dense(256, activation="relu"))
model.add(MaxPooling1D())
# Second Convolusional Layer
model.add(Conv1D(128, 3, strides=1, padding='same'))
model.add(Dense(64, activation="relu"))
model.add(MaxPooling1D())
# Passing to Fully Connected Layers
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
#model.add(Dropout(0.02))
# Output Layer
model.add(Dense(2, activation = 'sigmoid'))
# Model Compilation
model.compile(loss = 'sparse_categorical_crossentropy',
optimizer = "adam", metrics = ['accuracy'])
# Summary of The Model
model.summary()
Thank you very much!
Yes, that you can certainly do. It is not usual at all and not very advisable from a theoretical perspective but it is possible.
why is it not advisable? (theory) With convolutions one tries to capture spatial features (i.e. information). Values next to each other should have an influence but values far away from this point (in time -- in the case of time series data) should have less influence. That is the whole idea of CNNs. To a fully connected NN the order in which the input is presented to it is not important. It looks at all inputs at the same time since it is equally connected to all. So you loose spatial information. BTW, that is also the reason why it is plausible to do a global pooling before feeding the output of the CNN-part of a model to the fully-connected part of a model (i.e. the dense layers).
Now if you do convolution, you care about spatial information. If you then apply a dense layer, you kind of say "I cared enough about the spatial info". If you then apply convolution again on the output vector of a dense layer it becomes totally irrational.
feasibility
Nonetheless, such a network would be feasible. You would just need to make sure that the dense layer outputs a vector (or matrix) again, on which you can apply convolution.
However, your code lacks of a proper adapter from the output of the convolution layer to the dense layer. You should apply some type of global pooling operation to create a vector that serves as an input to the dense layer. That would also save you the Flatten() step. Again, it should work your way anyway. It is just about style since now your are sending mixed signals: flatten should concatenate all spatial layers but the NN ignores spatial information...
I don't get the point of applying MaxPooling1D´ after the Dense-layer. One could simply reduce the number of outputs of the Dense-layer. And you definitely don't need a second Flattenafter aDense`-layer as it returns a vector by definition (and pooling won't add a dimension to it)

softmax and sigmoid are giving same results in multiclass classification

I am building an lstm model. I tested my model using softmax and sigmoid activation function. In the documentation sigmoid is used for binary classification and softmax is used for multiclass classification. But in my case, both are giving the same results. Why is it so?
Here is my code:
embedding_vecor_length = 128
max_length = 700
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_vecor_length, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=16, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Here are the predicted results:
[[2.72062905e-02 1.47979835e-03 4.44446778e-04 1.60833297e-05
4.15672457e-06 3.20438482e-02 9.38653767e-01 1.41544719e-04
5.55426550e-06 4.47654566e-06]
[2.31099591e-01 1.71699154e-03 1.32052042e-02 4.70457249e-04
8.86382014e-02 2.65704724e-03 6.54215395e-01 7.50611164e-03
4.89178114e-04 1.89376965e-06]
[1.24909900e-01 8.73659015e-01 9.71468398e-06 1.66079029e-04
1.05203628e-06 4.14116839e-05 3.97000113e-05 6.98190925e-05
1.10231712e-03 9.84829512e-07]
The sigmoid allows you to have high probability for all of your classes, some of them, or none of them. Example: classifying diseases in a chest x-ray image. The image might contain pneumonia, emphysema, and/or cancer, or none of those findings.
The softmax enforces that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes. Example: classifying images from the MNIST data set of handwritten digits. A single picture of a digit has only one true identity - the picture cannot be a 7 and an 8 at the same time.
So in your case if the model is good the prediction will not differ alot when either using sigmoid or softmax, softmax forces the sum of prediction to be 1 sigmoid doesn't do that.

relu as a parameter in Dense() ( or any other layer) vs ReLu as a layer in Keras

I was just wondering if there is any significant difference between the use and speciality of
Dense(activation='relu')
and
keras.layers.ReLu
How and where the later one can be used? My best guess is in Functional API usecase but I don't know how.
Creating some Layer instance passing the activation as parameter i.e. activation='relu' is the same as creating some Layer instance and then creating an activation e.g. Relu instance. Relu() is a layer which returns K.relu() function over inputs:
class ReLU(Layer):
.
.
.
def call(self, inputs):
return K.relu(inputs,
alpha=self.negative_slope,
max_value=self.max_value,
threshold=self.threshold)
From the Keras documentation:
Usage of activations
Activations can either be used through an Activation layer, or through
the activation argument supported by all
forward layers:
from keras.layers import Activation, Dense
model.add(Dense(64))
model.add(Activation('tanh'))
This is equivalent to:
model.add(Dense(64, activation='tanh'))
You can also pass an element-wise TensorFlow/Theano/CNTK function as
an activation:
from keras import backend as K
model.add(Dense(64, activation=K.tanh))
Update:
Answering OP's aditional question: How and where the later one can be used?:
You can use it when you used some layer, which doesn't accept activation parameter like e.g. tf.keras.layers.Add, tf.keras.layers.Subtract etc, but you want to get a rectified output of such layers as a result:
added = tf.keras.layers.Add()([x1, x2])
relu = tf.keras.layers.ReLU(added)
The most obvious use case is when you need to put a ReLU without a Dense layer, for example when implementing ResNet, the design requires a ReLU activation after summing the residual connection, like it is shown here:
x = layers.add([x, shortcut])
x = layers.Activation('relu')(x)
return x
It is also useful when you want to put a BatchNormalization layer between the pre-activation of a Dense layer and the ReLU activation. When using a GlobalAveragePooling classifier (such as in the SqueezeNet architecture), then you need to put a softmax activation after the GAP using Activation("softmax") and there are no Dense layers in the network.
There are probably more cases, these are just samples.

Loss Returns NaN When Using Regulizers On LSTM

When I train my LSTM model, it returns nan for the loss
I am using a single layer LSTM with a Dense softmax layer at the end for classification output
Adam optimizer
Categorical crossentropy loss function
Relu activation
For some reason when I use regulizers of any kind on the LSTM layer I get NaN for the loss.
Also this only occurs when I make the LSTM layer over 128 units, so I can get rid of it if I get rid of the regularization or make the network 128 or less units.
I have already confirmed that there are no NaNs in the input.
I'm wondering why this happens, and how it is possible to regularize larger LSTM layers.
Here's my code:
def build_model():
model = Sequential()
model.add(LSTM(130, batch_input_shape=(None,90,5), return_sequences=False, recurrent_dropout=0.1, kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation("relu"))
model.add(Dense(2))
model.add(Activation("softmax"))
model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics=["categorical_accuracy"])
return(model)
Thanks

Where do the parameters in keras layers apply?

I'm trying to get to grips with the basics of neural networks and am struggling to understand keras layers.
Take the following code from tensorflow's tutorials:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
So this network has 3 layers? The first is just the 28*28 nodes representing the pixel values. The second is a hidden layer which takes weighted sums from the first, applies relu and then sends these to 10 output layers which are softmaxed?
but then this model seems to require different inputs to the layers:
model = keras.Sequential([
layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
Why does the input layer now have both an input_shape and a value 64? I read that the first parameter specifies the number of nodes in the second layer, but that doesn't seem to fit with the code in the first example. Also, why does the input layer have an activation? Is this just relu-ing the values before they enter the network?
Also, with regards activation functions, why are softmax and relu treated as alternatives? I thought relu applied to all the inputs of a single node, whereas softmax acted on the outputs of all the nodes across a layer?
Any help is really appreciated!
First example is from: https://www.tensorflow.org/tutorials/keras/basic_classification
Second example is from: https://www.tensorflow.org/tutorials/keras/basic_regression
Basically you have two types of API in Keras: Sequential and Functional API https://keras.io/getting-started/sequential-model-guide/
In Sequential API, you don't explictly refers an Input Layer Input https://keras.io/layers/core/#input
That is why you need to add an input_shape to specify the dimensions Of the First layer,
more information in https://jovianlin.io/keras-models-sequential-vs-functional/

Categories