I am currently using code from https://keras.io/examples/vision/handwriting_recognition/ which is a tutorial on text recognition. I am using a local dataset to test the model. And during my experiments I have encountered something which made me question.
1.) Is it normal for a loss value to start at a higher value than the previous loss? If not what could be the cause of this and how can I prevent this?
2.) Is a val_loss of 1 good enough for bi-LSTM networks? If not how can I lessen the loss?
Here is the snippet of two consecutive epochs.
1520/1520 [==============================] - 735s 484ms/step - loss: 2.5462 - val_loss: 2.7302
Epoch 12/100
443/1520 [=======>......................] - ETA: 8:18 - loss: 3.9221
Below is the summary of the current model
Layer (type) Output Shape Param # Connected to
==================================================================================================
image (InputLayer) [(None, 128, 32, 1) 0 []
]
Conv1 (Conv2D) (None, 128, 32, 32) 320 ['image[0][0]']
batchnorm1 (BatchNormalization (None, 128, 32, 32) 128 ['Conv1[0][0]']
)
pool1 (MaxPooling2D) (None, 64, 16, 32) 0 ['batchnorm1[0][0]']
Conv2 (Conv2D) (None, 64, 16, 64) 18496 ['pool1[0][0]']
Conv3 (Conv2D) (None, 64, 16, 64) 36928 ['Conv2[0][0]']
batchnorm2 (BatchNormalization (None, 64, 16, 64) 256 ['Conv3[0][0]']
)
pool2 (MaxPooling2D) (None, 32, 8, 64) 0 ['batchnorm2[0][0]']
reshape (Reshape) (None, 32, 512) 0 ['pool2[0][0]']
dense1 (Dense) (None, 32, 64) 32832 ['reshape[0][0]']
dropout_3 (Dropout) (None, 32, 64) 0 ['dense1[0][0]']
bidirectional_9 (Bidirectional (None, 32, 256) 197632 ['dropout_3[0][0]']
)
bidirectional_10 (Bidirectiona (None, 32, 256) 394240 ['bidirectional_9[0][0]']
l)
bidirectional_11 (Bidirectiona (None, 32, 128) 164352 ['bidirectional_10[0][0]']
l)
label (InputLayer) [(None, None)] 0 []
dense2 (Dense) (None, 32, 85) 10965 ['bidirectional_11[0][0]']
ctc_loss (CTCLayer) (None, 32, 85) 0 ['label[0][0]',
'dense2[0][0]']
==================================================================================================
Total params: 856,149
Trainable params: 855,957
Non-trainable params: 192
__________________________________________________________________________________________________
optimizer = Adam
batch_size = 64
total_dataset = 100,000+
activation = relu
To answer the first query:
Yes, it is common for a loss value to start higher than it was in the previous epoch. During each epoch, your model is trained on different batches of data, and the loss is accumulated or averaged (depends on your loss function) over these batches. At the end of the epoch, you observe the loss over the entire dataset. At the start of the next epoch, you observe the loss over the first batch of the dataset that your model is training on.
Your dataset (ideally) follows a general pattern that you want your model to learn. A batch of your dataset will likely contain a sub-pattern out of the general pattern. At the end of an epoch, given that your model has been exposed to the entire dataset before, it will be better optimized to predict the general pattern of your data than a sub-pattern. Therefore, the loss on the batch/ data containing the sub-pattern will be higher.
For the second question:
It's hard to say if a certain numerical value of loss will be good or bad for a network, since your validation loss will depend on many factors. These include what loss function you are using, how many data points were used to compute the loss, and so on. The numerical value of your loss should not matter as long as your model meets the performance criteria you define in your evaluation metric.
Related
I have a custom model trained initially on VGG16 using transfer learning. However, it was initially trained on images with a smaller input size. Now, I am using images with bigger sizes, therefore I'd like to grab the first model and take advantage of what it has learned but now with new dataset.
More specifically:
Layer (type) Output Shape Param #
=================================================================
block1_conv1 (Conv2D) (None, 128, 160, 64) 1792
block1_conv2 (Conv2D) (None, 128, 160, 64) 36928
block1_pool (MaxPooling2D) (None, 64, 80, 64) 0
block2_conv1 (Conv2D) (None, 64, 80, 128) 73856
block2_conv2 (Conv2D) (None, 64, 80, 128) 147584
block2_pool (MaxPooling2D) (None, 32, 40, 128) 0
block3_conv1 (Conv2D) (None, 32, 40, 256) 295168
block3_conv2 (Conv2D) (None, 32, 40, 256) 590080
block3_conv3 (Conv2D) (None, 32, 40, 256) 590080
block3_pool (MaxPooling2D) (None, 16, 20, 256) 0
block4_conv1 (Conv2D) (None, 16, 20, 512) 1180160
block4_conv2 (Conv2D) (None, 16, 20, 512) 2359808
block4_conv3 (Conv2D) (None, 16, 20, 512) 2359808
block4_pool (MaxPooling2D) (None, 8, 10, 512) 0
block5_conv1 (Conv2D) (None, 8, 10, 512) 2359808
block5_conv2 (Conv2D) (None, 8, 10, 512) 2359808
block5_conv3 (Conv2D) (None, 8, 10, 512) 2359808
block5_pool (MaxPooling2D) (None, 4, 5, 512) 0
flatten (Flatten) (None, 10240) 0
dense (Dense) (None, 16) 163856
output (Dense) (None, 1) 17
The problem is that this model already includes an input layer of 128x160, and I'd like to change it to 384x288 for transfer learning.
The above is my first model, I now would like to do transfer learning again but with a different dataset that has an input of size 384x288 and I'd like to use a softmax for two classes instead.
So, what i want to do is a transfer learning from the custom model on a different dataset, So I need to change the input size and retrain the new model with my own data
How can I do a transfer learning on the model above but with a new dataset and different classification layer in the output?
You can follow these steps:
Build another instance of model, don't forget to change it's input shape.
Copy the weights of the shared convolutional layers from the loaded model, and set them to be non_trainable.
for new_layer, layer in zip(new_model.layers[0:-4], model.layers[0:-4]):
new_layer.set_weights(layer.get_weights())
new_layer.trainable = False
Add new dense layers and train the whole model.
Further reading:
This answer and This question expain how you can change the input shape.
Keras guides shows how you can do transfer learning with Keras. Under This question are some useful code snippets.
There are many possible solutions for it.
As suggested by many and a very simple solution:
Downscale the image to the input size of pretrained model
Change the final layer of pretrained model and freeze the rest of the layers
Train the model [transfer learning]
Once the model converges you can unfreeze the full model and train the full model again at a very low learning rate [finetuning]
However, in the above approach you are not able to take advantage of higher resolution images you have.
Using pretrained model as feature extractor
Another approach is to use the pretrained model just as feature extractor and train a seperate model on high resolution images. Finally use the features from both the pretrained model as well as your trained model to do the final predictions. The high level idea is as below:
Sample code:
import numpy as np
import tensorflow as tf
from tensorflow import keras
low_res_image_size = (150, 150, 3)
hig_res_image_size = (320, 240, 3)
n_classes = 4
# Load your pretrained model train on low resolution images
base_model = tf.keras.applications.VGG16(
include_top=False, weights='imagenet', input_shape=low_res_image_size)
# Freeze the pretrained model
base_model.trainable = False
# Unfreezed model to be trained on high resolution images
model = tf.keras.applications.VGG19(
include_top=False, weights='imagenet', input_shape=hig_res_image_size)
model.trainable = True
# Downscale images
downscale_layer = tf.keras.layers.Resizing(
low_res_image_size[0], low_res_image_size[1],
interpolation='bilinear', crop_to_aspect_ratio=False)
# Create model
inputs = keras.Input(shape=hig_res_image_size)
downscaled_inputs = downscale_layer(inputs)
features = base_model(downscaled_inputs, training=False)
features = keras.layers.GlobalAveragePooling2D()(features)
x = model(inputs, training=True)
x = keras.layers.GlobalAveragePooling2D()(x)
concatted = tf.keras.layers.Concatenate()([features, x])
outputs = keras.layers.Dense(n_classes)(concatted)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam", loss='sparse_categorical_crossentropy')
# Train on some random data
model.fit(
np.random.random((100,*hig_res_image_size)),
np.random.randint(0, n_classes, 100), epochs=3)
Output:
Epoch 1/3
4/4 [==============================] - 4s 553ms/step - loss: 8.7033
Epoch 2/3
4/4 [==============================] - 2s 554ms/step - loss: 9.0746
Epoch 3/3
4/4 [==============================] - 2s 553ms/step - loss: 9.0746
<keras.callbacks.History at 0x7f559a104650>
As and added step, after the model converges you and also unfreeze all the layers and train the full model again using a very low learning rate. Just keep an eye on overfitting.
Found a very simple solution to my problem and now I am able to train it with different data and diferent classification layers:
from keras.models import load_model
from keras.models import Model
from keras.models import Sequential
old_model = load_model("/content/drive/MyDrive/old_model.h5")
old_model = Model(old_model.input, old_model.layers[-4].output) # Remove the classification, dense and flatten layers
base_model = Sequential() # Create a new model from the 2nd layer and all the convolutional blocks
for layer in old_model.layers[1:]:
base_model.add(layer)
for layer_number, layer in enumerate(base_model.layers):
print(layer_number, layer.name, layer.trainable)
# Perform transfer learning
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(384, 288, 3)),
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Copy your model to the another model (transfer learning), and then update the new model in the way you want to use it. Change input size, change activation functions, whatever you wanna do.
I'm a beginner in the development of CNNs and for a university assignment I've been tasked to create an image classificator for food items. The dataset I'm using is Recipes5k. It has 101 classes of foods:
I'm using Google Colab paired with the Tensorflow to achieve this and have been following Tensorflow's image classification beginner tutorial.
So far, everything has been clear and easy to understand but I've ran across a problem when it comes to training my model: The Validation Accuracy is outrageously low (10-11%) when compared to the training accuracy (90%+). I suspect this may be due to overfitting of the model. So far, I've tried image augmentation techniques and applying dropout to the model. This did not work as expected and only boosted the accuracy by about 5%. I have posted the code snippets necessary below:
Data Augmentation layer:
data_augmentation = keras.Sequential(
[
layers.experimental.preprocessing.RandomFlip("horizontal",
input_shape=(img_height,
img_width,
3)),
layers.experimental.preprocessing.RandomRotation(0.1),
layers.experimental.preprocessing.RandomZoom(0.1),
]
)
Model:
model = Sequential([
data_augmentation,
layers.experimental.preprocessing.Rescaling(1./255),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.3),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
Model Summary:
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sequential_1 (Sequential) (None, 224, 224, 3) 0
_________________________________________________________________
rescaling_2 (Rescaling) (None, 224, 224, 3) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 224, 224, 16) 448
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 112, 112, 16) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 112, 112, 32) 4640
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 56, 56, 32) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 56, 56, 64) 18496
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 28, 28, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 28, 28, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 50176) 0
_________________________________________________________________
dense_2 (Dense) (None, 128) 6422656
_________________________________________________________________
dense_3 (Dense) (None, 101) 13029
=================================================================
Total params: 6,459,269
Trainable params: 6,459,269
Non-trainable params: 0
_________________________________________________________________
Results after training with 250 epochs
Epoch 250/250
121/121 [==============================] - 3s 25ms/step - loss: 0.2564 - accuracy: 0.9270 - val_loss: 17.6184 - val_accuracy: 0.1202
What other techniques can I use to improve the accuracy of my model?
Update: I followed Gerry P's suggestion and edited my last dense layer to work with softmax activation. The results of 1250 epochs of training presented a slower increase in training accuracy and around 5-6% more validation accuracy. This improved my model but it is still a very low accuracy.
For your last dense layer change it to
layers.Dense(num_classes, activation='softmax')
In model.compile() use
loss='categorical_crossentropy'
If your labels are one hot encoded. If they are integers then use
loss='sparse_categorical_crossentropy'
I am going through this link to understand Multi-channel CNN Model for Text Classification.
The code is based on this tutorial.
I have understood most of the things, however I can't understand how Keras defines the output shapes of certain layers.
Here is the code:
define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text.
#Skipped keras imports
# load a clean dataset
def load_dataset(filename):
return load(open(filename, 'rb'))
# fit a tokenizer
def create_tokenizer(lines):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# calculate the maximum document length
def max_length(lines):
return max([len(s.split()) for s in lines])
# encode a list of lines
def encode_text(tokenizer, lines, length):
# integer encode
encoded = tokenizer.texts_to_sequences(lines)
# pad encoded sequences
padded = pad_sequences(encoded, maxlen=length, padding='post')
return padded
# define the model
def define_model(length, vocab_size):
# channel 1
inputs1 = Input(shape=(length,))
embedding1 = Embedding(vocab_size, 100)(inputs1)
conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
drop1 = Dropout(0.5)(conv1)
pool1 = MaxPooling1D(pool_size=2)(drop1)
flat1 = Flatten()(pool1)
# channel 2
inputs2 = Input(shape=(length,))
embedding2 = Embedding(vocab_size, 100)(inputs2)
conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
drop2 = Dropout(0.5)(conv2)
pool2 = MaxPooling1D(pool_size=2)(drop2)
flat2 = Flatten()(pool2)
# channel 3
inputs3 = Input(shape=(length,))
embedding3 = Embedding(vocab_size, 100)(inputs3)
conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
drop3 = Dropout(0.5)(conv3)
pool3 = MaxPooling1D(pool_size=2)(drop3)
flat3 = Flatten()(pool3)
# merge
merged = concatenate([flat1, flat2, flat3])
# interpretation
dense1 = Dense(10, activation='relu')(merged)
outputs = Dense(1, activation='sigmoid')(dense1)
model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)
# compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# summarize
print(model.summary())
plot_model(model, show_shapes=True, to_file='multichannel.png')
return model
# load training dataset
trainLines, trainLabels = load_dataset('train.pkl')
# create tokenizer
tokenizer = create_tokenizer(trainLines)
# calculate max document length
length = max_length(trainLines)
# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Max document length: %d' % length)
print('Vocabulary size: %d' % vocab_size)
# encode data
trainX = encode_text(tokenizer, trainLines, length)
print(trainX.shape)
# define model
model = define_model(length, vocab_size)
# fit model
model.fit([trainX,trainX,trainX], array(trainLabels), epochs=10, batch_size=16)
# save the model
model.save('model.h5')
Running the code:
Running the example first prints a summary of the prepared training dataset.
Max document length: 1380
Vocabulary size: 44277
(1800, 1380)
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_1 (InputLayer) (None, 1380) 0
____________________________________________________________________________________________________
input_2 (InputLayer) (None, 1380) 0
____________________________________________________________________________________________________
input_3 (InputLayer) (None, 1380) 0
____________________________________________________________________________________________________
embedding_1 (Embedding) (None, 1380, 100) 4427700 input_1[0][0]
____________________________________________________________________________________________________
embedding_2 (Embedding) (None, 1380, 100) 4427700 input_2[0][0]
____________________________________________________________________________________________________
embedding_3 (Embedding) (None, 1380, 100) 4427700 input_3[0][0]
____________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, 1377, 32) 12832 embedding_1[0][0]
____________________________________________________________________________________________________
conv1d_2 (Conv1D) (None, 1375, 32) 19232 embedding_2[0][0]
____________________________________________________________________________________________________
conv1d_3 (Conv1D) (None, 1373, 32) 25632 embedding_3[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout) (None, 1377, 32) 0 conv1d_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout) (None, 1375, 32) 0 conv1d_2[0][0]
____________________________________________________________________________________________________
dropout_3 (Dropout) (None, 1373, 32) 0 conv1d_3[0][0]
____________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D) (None, 688, 32) 0 dropout_1[0][0]
____________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D) (None, 687, 32) 0 dropout_2[0][0]
____________________________________________________________________________________________________
max_pooling1d_3 (MaxPooling1D) (None, 686, 32) 0 dropout_3[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten) (None, 22016) 0 max_pooling1d_1[0][0]
____________________________________________________________________________________________________
flatten_2 (Flatten) (None, 21984) 0 max_pooling1d_2[0][0]
____________________________________________________________________________________________________
flatten_3 (Flatten) (None, 21952) 0 max_pooling1d_3[0][0]
____________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 65952) 0 flatten_1[0][0]
flatten_2[0][0]
flatten_3[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 10) 659530 concatenate_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 11 dense_1[0][0]
====================================================================================================
Total params: 14,000,337
Trainable params: 14,000,337
Non-trainable params: 0
____________________________________________________________________________________________________
And
Epoch 6/10
1800/1800 [==============================] - 30s - loss: 9.9093e-04 - acc: 1.0000
Epoch 7/10
1800/1800 [==============================] - 29s - loss: 5.1899e-04 - acc: 1.0000
Epoch 8/10
1800/1800 [==============================] - 28s - loss: 3.7958e-04 - acc: 1.0000
Epoch 9/10
1800/1800 [==============================] - 29s - loss: 3.0534e-04 - acc: 1.0000
Epoch 10/10
1800/1800 [==============================] - 29s - loss: 2.6234e-04 - acc: 1.0000
My interpretation of the Layer and output shape are as follows:
Please help me understand if its correct as I am lost in multi-dimension.
input_1 (InputLayer) (None, 1380) : ---> 1380 is the total number of features ( that is 1380 input neurons) per data point. 1800 is the total number of documents or data points.
embedding_1 (Embedding) (None, 1380, 100) 4427700 ----> Embedding layer is : 1380 as features(words) and each feature is a vector of dimension 100.
How the number of parameters here is 4427700??
conv1d_1 (Conv1D) (None, 1377, 32) 12832 ------> Conv1d is of kernel size=4. Is it 1*4 filter which is used 32 times. Then how the dimension became (None, 1377, 32) with 12832 parameters?
max_pooling1d_1 (MaxPooling1D) (None, 688, 32) with MaxPooling1D(pool_size=2) how the dimension became (None, 688, 32)?
flatten_1 (Flatten) (None, 22016) This is just multiplication of 688, 32?
** Does every epoch trains 1800 data points at once?**
Please let me know how output dimensions is calculated. Any reference or help would be appreciated.
Please see the answers below:
input_1 (InputLayer) (None, 1380) : ---> 1380 is the total number of features ( that is 1380 input neurons) per data point. 1800 is the total number of documents or data points.
Yes. model.fit([trainX,trainX,trainX], array(trainLabels), epochs=10, batch_size=16) says, that you want the network to train 10 times (for 10 epochs) on the whole training dataset in batches of size 16.
This means, that every 16 data points, the backpropagation algorithm will be launched and the weights will update. This will happen 1800/16 times and will be called an epoch.
1380 is the number of neurons in the first layer.
embedding_1 (Embedding) (None, 1380, 100) | 4427700 ----> Embedding layer is : 1380 as features(words) and each feature is a vector of dimension 100.
1380 is the size of the input (numbers of neurons in the previous layer) and 100 is the size (length) of the embedding vector.
The number of parameters here is vocabulary_size * 100 as for each v in vocabulary you need to train 100 parameters. Embedding layer is in fact a matrix built with vocabulary_size vectors of size 100 where each row represents a vector representation of each word from the vocabulary.
conv1d_1 (Conv1D) (None, 1377, 32) | 12832 ------> Conv1d is of kernel size=4. Is it 1*4 filter which is used 32 times. Then how the dimension became (None, 1377, 32) with 12832 parameters?
1380 becomes 1377 because of the size of kernel. Imagine the following input (of size 10 to simplify) with kernel of size 4:
0123456789 #input
KKKK456789
0KKKK56789
12KKKK6789
123KKKK789
1234KKKK89
12345KKKK9
123456KKKK
Look, the Kernel can't move any further to the right, so for the input size 10 and Kernel size 4, the output shape would be 7.
In general, for input shape of n and kernel shape of k, the output shape would be n - k + 1, so for n=1380, k=4 the result is 1377.
The amount of the parameters is equal to 12832 because the number of parameters is equal to output_channels * (input_channels * window_size + 1). In your case it's 32*(100*4 + 1).
max_pooling1d_1 (MaxPooling1D) (None, 688, 32) with MaxPooling1D(pool_size=2) how the dimension became (None, 688, 32)?
The max_pooling takes every two consecutive numbers and replaces them with a max of them, so you end up with original_size/pool_size values.
flatten_1 (Flatten) (None, 22016) This is just multiplication of 688, 32?`
Yes, this is just a multiplication of 688 and 32. It's because, the flatten operation does the following:
1234
5678 -> 123456789012
9012
so it takes all values from all dimensions and put it into a one-dimensional vector.
Does every epoch trains 1800 data points at once?
No. It takes them in batches of 16 as pointed out in the first answer. Each epoch takes 1800 data points in a random order in batches of 16 data points. An epoch is a term which means, a period in time, after which we'll start reading data again.
Edit:
I will clarify the place where 1d convolutional layers are applied to embedding layers.
The output of the Embedding layers you should interpret as a vector of width 1380 and 100 channels.
Similarly to 2d images where you have an RGB image with three channels at the input, its shape is (width, height, 3) when you apply a convolutional layer built of 32 filters (filter size is irrelevant), the convolution operation is applied simultaneously to all channels and the output shape will be (new_width, new_height, 32). Notice the output shape is the same as the number of filters.
Back to your example. Treat the output shape from the embedding layer as (width, channels). So then the 1d convolutional layer with 32 filters and kernel size equals to 4 is applied to vector 1380 and depth 100. As result, you will get the output of shape (1377, 32).
I am training a normal feed-forward network on financial data of the last 90 days of a stock, and I am predicting whether the stock will go up or down on the next day. I am using binary cross entropy as my loss and standard SGD for the optimizer. When I train, the training and validation loss continue to go down as they should, but the accuracy and validation accuracy stay around the same.
Here's my model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 90, 256) 1536
_________________________________________________________________
elu (ELU) (None, 90, 256) 0
_________________________________________________________________
flatten (Flatten) (None, 23040) 0
_________________________________________________________________
dropout (Dropout) (None, 23040) 0
_________________________________________________________________
dense_1 (Dense) (None, 1024) 23593984
_________________________________________________________________
elu_1 (ELU) (None, 1024) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 1024) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 524800
_________________________________________________________________
elu_2 (ELU) (None, 512) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_3 (Dense) (None, 512) 262656
_________________________________________________________________
elu_3 (ELU) (None, 512) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_4 (Dense) (None, 256) 131328
_________________________________________________________________
activation (Activation) (None, 256) 0
_________________________________________________________________
dense_5 (Dense) (None, 2) 514
_________________________________________________________________
activation_1 (Activation) (None, 2) 0
_________________________________________________________________
Total params: 24,514,818
Trainable params: 24,514,818
Non-trainable params: 0
_________________________________________________________________
I expect that either both losses should decrease while both accuracies increase, or the network will overfit and the validation loss and accuracy won't change much. Either way, shouldn't the loss and its corresponding accuracy value be directly linked and move inversely to each other?
Also, I notice that my validation loss is always less than my normal loss, which seems wrong to me.
Here's the loss (Normal: Blue, Validation: Green)
Here's the accuracy (Normal: Black, Validation: Yellow):
Loss and accuracy are indeed connected, but the relationship is not so simple.
Loss drops but accuracy is about the same
Let's say we have 6 samples, our y_true could be:
[0, 0, 0, 1, 1, 1]
Furthermore, let's assume our network predicts following probabilities:
[0.9, 0.9, 0.9, 0.1, 0.1, 0.1]
This gives us loss equal to ~24.86 and accuracy equal to zero as every sample is wrong.
Now, after parameter updates via backprop, let's say new predictions would be:
[0.6, 0.6, 0.6, 0.4, 0.4, 0.4]
One can see those are better estimates of true distribution (loss for this example is 16.58), while accuracy didn't change and is still zero.
All in all, the relation is more complicated, network could fix its parameters for some examples, while destroying them for other which keeps accuracy about the same.
Why my network is unable to fit to the data?
Such situation usually occurs when your data is really complicated (or incomplete) and/or your model is too weak. Here both are the case, financial data prediction has a lot of hidden variables which your model cannot infer. Furthermore, dense layers are not the ones for this task; each day is dependent on the previous values, it is a perfect fit for Recurrent Neural Networks, you can find an article about LSTMs and how to use them here (and tons of others over the web).
I have network :
Tensor("input_1:0", shape=(?, 5, 1), dtype=float32)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 5, 1) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 5, 64) 2176
_________________________________________________________________
activation_1 (Activation) (None, 5, 64) 0
_________________________________________________________________
bidirectional_2 (Bidirection (None, 5, 128) 16512
_________________________________________________________________
activation_2 (Activation) (None, 5, 128) 0
_________________________________________________________________
bidirectional_3 (Bidirection (None, 1024) 656384
_________________________________________________________________
activation_3 (Activation) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1025
_________________________________________________________________
p_re_lu_1 (PReLU) (None, 1) 1
=================================================================
Total params: 676,098
Trainable params: 676,098
Non-trainable params: 0
_________________________________________________________________
None
Train on 27496 samples, validate on 6875 samples
I fit and compile it by:
model.compile(loss='mse',optimizer=Adamx,metrics=['accuracy'])
model.fit(x_train,y_train,batch_size=100,epochs=10,validation_data=(x_test,y_test),verbose=2)
When I run it and also evaluate it on unseen data,it returns 0.0 Accuracy with very low loss. I can't figure out what's the problem.
Epoch 10/10
- 29s - loss: 1.6972e-04 - acc: 0.0000e+00 - val_loss: 1.7280e-04 - val_acc: 0.0000e+00
What you are getting is expected. Your model is working correctly, it is your metrics of measure that is incorrect. The aim of the optimization function is to minimize loss, not to increase accuracy.
Since you are using PRelu as the activation function of your last layer, you always get float output from the network. Comparing these float output with actual label for measure of accuracy doesn't seem the right option. Since the outputs and labels are continuous random variable the joint probability for specific value will be zero. Therefore, even if the model predicts values very close to the true label value the model accuracy still will be zero unless the model predicts exactly the same value as true label - which is improbable.
e.g if y_true is 1.0 and the model predicts 0.99999 still this value does not add value to accuracy of the model since 1.0 != 0.99999
Update
The choice of metrics function depends on the type of problem. Keras also provides functionality for implementing custom metrics.
Assuming the problem on question is linear regression and two values are equal if difference between the two values is less than 0.01, the custom loss metrics can be defined as:-
import keras.backend as K
import tensorflow as tf
accepted_diff = 0.01
def linear_regression_equality(y_true, y_pred):
diff = K.abs(y_true-y_pred)
return K.mean(K.cast(diff < accepted_diff, tf.float32))
Now you can use this metrics for your model
model.compile(loss='mse',optimizer=Adamx,metrics=[linear_regression_equality])