I am trying to train a multi-task multi-label classifier using Keras. The output layer is a fork of two outputs. The task of each output layer is to predict the categories of its task. The y vectors are OneHot encoded.
I am using a custom generator for my data that yields the y arrays in a list to the fit_generator function
I am using a categorigal_crossentropy loss function at each layer
fork1.compile(loss={'O1': 'categorical_crossentropy', 'O2': 'categorical_crossentropy'},
optimizer=optimizers.Adam(lr=0.001),
metrics=['accuracy'])
The problem: The loss doesn't decrease with this setup. However, if I train each task separately, I have low loss and high accuracy. So what could be the problem ?
To perform multilabel categorical classification (where each sample can have several classes), end your stack of layers with a Dense layer with a number of units equal to the number of classes and a sigmoid activation, and use binary_crossentropy as the loss. Your targets should be k-hot encoded.
Regarding the multi-output model, training such a model requires the ability to specify different loss functions for different heads of the network requiring a different training procedure.
You should provide more info in order to give a clear indication of what you want to achieve.
Related
I'm trying to build a model that use MLP for feature extraction and dimension reduction. The model could transform the data from 204 dimensions to 80 dimensions after this process. The proposed model is as follows:
A 512 dimension dense layer with the input of original data (204 dimension)
A 256 dimension dense layer with the input of 512 dimensions
A 80 dimension dense layer with the input of 256 dimensions
The proposed training epoch is 1, and the output of the MLP is regarded as the input of the further models (such as, LR, SVM, etc.)
My question is: When training the MLP, what loss function should I set? Is the MSE loss OK, or I should use other loss functions? Thanks!
What would you be training this MLP on? (what would be the target 80-dimensional "Y"?)
MLPs are used to learn features at the same time as the model. For example if you wanted to have an MLP that does linear regression and learns a set of features that are 80-dimensional you could create something like this:
model = keras.models.Sequential()
model.add(layers.Dense(80, input_dim=512, activation=MY_ACTIVATION))
model.add(layers.Dense(1))
model.compile(loss="mean_squared_error")
In the last layer, the network will learn to find the "best" weights and biases to capture Y as a function of the 80 features extracted. These features are in turn a function of X - a function the network learns by adjusting for how well these features are able to capture Y (this is backpropagation).
So creating an MLP just to learn features doesn't make sense without a problem statement for what these features are supposed to do.
As such I would recommend using something like Principal Component Analysis or Singular Value Decomposition. These project the data onto the k-dimensional space that captures the most variance (information) in the data.
I'm trying to train a neural network in a supervised learning which has as input x_train a list of 100 list each containing 2000 column ....... and a target output y_train which has a list of 100 list also but contains each 20 column.
This is what x_train and y_train look like:
here is the neural networks that I created :
dnnmodel = tf.keras.models.Sequential()
dnnmodel.add(tf.keras.layers.Dense(40, input_dim = len(id2word), activation='relu'))
dnnmodel.add(tf.keras.layers.Dense(20, activation='relu'))
dnnmodel. compile ( loss = tf.keras.losses.MeanSquaredLogarithmicError(), optimizer = 'adam' , metrics = [ 'accuracy' ])
during the training phase I cannot choose the right number of neurons, layers and the activation and loss functions, since the accurency and loss values are not at all reasonable. .... can someone help me please?
Here is the display after the execution:
There is no correct method or formula to decide the correct number of layers or neurons or any other functions you use in your model. It all comes down experimentation and what works best for your data and the problem that you are trying to solve.
Here are some tips:
sigmoid, tanh = These activations are generally not used in hidden layers as their computed slopes or gradient is very small. So the model can take a long to converge.
Relu, elu, leaky relu - These activations can be used used in hidden layers as they have steep slope compared to others so the training process is fast. Relu is commonly used.
Layers: The more layers you add the deeper you make your neural network. Deeper neural networks are able to learn complex features about your data but they are prone to overfitting. Also, Deep Neural Network suffers from problems like vanishing gradient or exploding gradients. Fewer layers mean fewer params to learn and prone to underfitting.
Loss Function - Loss function depends on the problem you are trying to solve.
For classification
If y_label is categorical go for categorical_cross_entropy
If y_label is discreet go for sparse_categorical_cross_entropy
For regression problems
Use Rmse or MSE
Coming to the training logs. Your model is training as you can see the loss at each epoch less than the previous one. You should train your model for more epochs in order to see improvements in your accuracy.
I'm training LSTM model for time series forecasting. This is the train loss plot.
This is a one-step-ahead forecasting case, so I'm training the model using a rolling window. Here, we have 26 steps of forecasting (for every step, I train the model again). As you can see, after Epoch #25~27, the training loss suddenly will be so noisily. Why we have this behaviour?
Ps. I'm using LSTM with tanh activation. Also, I used L1 and L2 regularization, but the behaviour is the same. The layer after LSTM is a Dense layer with linear activation, I MinMaxScaler is applied on input data and the optimizer is Adam. I also see the same behaviour in validation dataset.
Are you using gradient clipping if so not that could help you since gradient values become really , really small or large making it very difficult to make further progress for the model to learn better. The recurrent layer may have created this valley of loss that you may be missing because the gradient is too large.
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.
I am trying to build a multi-task CNN in Tensorflow which has two dense dense layers in parallel ,one for Age prediction and other for Gender prediction. How can I train each Dense layer for different number of epochs since one can converge before the other and training both for same no of epochs would overfit one of them?
Also, if I propagate the gradients of both age and gender to the CNN, would it overfit since it's weights are being updated at twice the rate of Dense layers?
I have ask a similar question and i've finally found the answer : LINK
SOLUTION : You can define 2 different train_step, and each one has his own learning rate. Each train_step can be called a chosen number of times. In addition, you can define some dependencies if you want some variables to be trainable only for a selected train_step. (See the documentation).