I am training a normal feed-forward network on financial data of the last 90 days of a stock, and I am predicting whether the stock will go up or down on the next day. I am using binary cross entropy as my loss and standard SGD for the optimizer. When I train, the training and validation loss continue to go down as they should, but the accuracy and validation accuracy stay around the same.
Here's my model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 90, 256) 1536
_________________________________________________________________
elu (ELU) (None, 90, 256) 0
_________________________________________________________________
flatten (Flatten) (None, 23040) 0
_________________________________________________________________
dropout (Dropout) (None, 23040) 0
_________________________________________________________________
dense_1 (Dense) (None, 1024) 23593984
_________________________________________________________________
elu_1 (ELU) (None, 1024) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 1024) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 524800
_________________________________________________________________
elu_2 (ELU) (None, 512) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_3 (Dense) (None, 512) 262656
_________________________________________________________________
elu_3 (ELU) (None, 512) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_4 (Dense) (None, 256) 131328
_________________________________________________________________
activation (Activation) (None, 256) 0
_________________________________________________________________
dense_5 (Dense) (None, 2) 514
_________________________________________________________________
activation_1 (Activation) (None, 2) 0
_________________________________________________________________
Total params: 24,514,818
Trainable params: 24,514,818
Non-trainable params: 0
_________________________________________________________________
I expect that either both losses should decrease while both accuracies increase, or the network will overfit and the validation loss and accuracy won't change much. Either way, shouldn't the loss and its corresponding accuracy value be directly linked and move inversely to each other?
Also, I notice that my validation loss is always less than my normal loss, which seems wrong to me.
Here's the loss (Normal: Blue, Validation: Green)
Here's the accuracy (Normal: Black, Validation: Yellow):
Loss and accuracy are indeed connected, but the relationship is not so simple.
Loss drops but accuracy is about the same
Let's say we have 6 samples, our y_true could be:
[0, 0, 0, 1, 1, 1]
Furthermore, let's assume our network predicts following probabilities:
[0.9, 0.9, 0.9, 0.1, 0.1, 0.1]
This gives us loss equal to ~24.86 and accuracy equal to zero as every sample is wrong.
Now, after parameter updates via backprop, let's say new predictions would be:
[0.6, 0.6, 0.6, 0.4, 0.4, 0.4]
One can see those are better estimates of true distribution (loss for this example is 16.58), while accuracy didn't change and is still zero.
All in all, the relation is more complicated, network could fix its parameters for some examples, while destroying them for other which keeps accuracy about the same.
Why my network is unable to fit to the data?
Such situation usually occurs when your data is really complicated (or incomplete) and/or your model is too weak. Here both are the case, financial data prediction has a lot of hidden variables which your model cannot infer. Furthermore, dense layers are not the ones for this task; each day is dependent on the previous values, it is a perfect fit for Recurrent Neural Networks, you can find an article about LSTMs and how to use them here (and tons of others over the web).
Related
I'm investigating the use of CNNs for feature extraction, chaining together with a Gradient Boosting Trees model for classification. The CNN architecture that I have found is most effective is 4 convolution layers with activation after each, before a Max Pooling layer, a flattening step and then 2 fully connected layers.
I'm not particularly experienced with CNNs, but my understanding is that generally the expectation is that you have several sets of convolution layers and max pooling layers (i.e. pool more than once in the architecture) - however I've found that this approach massive reduces the precision and recall of the classification model, although I'm not especially clear why.
I had based my approach on the work done in this article https://towardsdatascience.com/cnn-application-on-structured-data-automated-feature-extraction-8f2cd28d9a7e (apologies if its pay-walled for you).
My model doesn't seem to be overfit, as I have a 40% validation step in the classification model train/test, but I feel sure that I'm missing something as the CNN architecture I've used seems uncommon/not used at all.
For clarity, here is my CNN architecture:
_________________________________________________________________
Layer (type)
=================================================================
conv2d_1_input (InputLayer)
_________________________________________________________________
conv2d_1 (Conv2D)
_________________________________________________________________
activation_1 (Activation)
_________________________________________________________________
conv2d_2 (Conv2D)
_________________________________________________________________
activation_2 (Activation)
_________________________________________________________________
conv2d_3 (Conv2D)
_________________________________________________________________
activation_3 (Activation)
_________________________________________________________________
conv2d_4 (Conv2D)
_________________________________________________________________
activation_4 (Activation)
_________________________________________________________________
max_pooling2d_1 (MaxPooling2
_________________________________________________________________
flatten_1 (Flatten)
_________________________________________________________________
dense_1 (Dense)
_________________________________________________________________
batch_normalization_1 (Batch
_________________________________________________________________
dropout_1 (Dropout)
_________________________________________________________________
feature_dense (Dense)
_________________________________________________________________
Thanks in advance!
I am working on Text Classification Problem. My Model looks like this :
Model: "sequential_6"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_6 (Embedding) (None, 100, 50) 676050
_________________________________________________________________
lstm_6 (LSTM) (None, 16) 4288
_________________________________________________________________
dropout_1 (Dropout) (None, 16) 0
_________________________________________________________________
dense_6 (Dense) (None, 3) 51
=================================================================
Total params: 680,389
Trainable params: 680,389
Non-trainable params: 0
_________________________________________________________________
None
The dataset contains around 5300 No. of Sentences. I am using validation split=0.33.
The Model behaves in abnormal way. The validation loss keeps increasing and validation accuracy moves in constant way. I am attaching the graph.
Please guide me how to solve this issue.
My Model looks like this :
model=Sequential()
model.add(Embedding(
num_words,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH
))
model.add(LSTM(32,return_sequences=True))
model.add(Dropout(0.5))
model.add(GlobalMaxPool1D())
model.add(Dense(len(possible_labels), activation="softmax"))
I am also attaching Accuracy Graph.
Increase dropout.
Train for fewer epochs.
Try Conv1D instead of LSTM to see if the overfitting goes away.
I think this model is underfitting. Is this correct?
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (50, 60, 100) 42400
_________________________________________________________________
dropout_1 (Dropout) (50, 60, 100) 0
_________________________________________________________________
lstm_2 (LSTM) (50, 60) 38640
_________________________________________________________________
dropout_2 (Dropout) (50, 60) 0
_________________________________________________________________
dense_1 (Dense) (50, 20) 1220
_________________________________________________________________
dense_2 (Dense) (50, 1) 21
=================================================================
The above is a summary of the model.
Any advice on how the model could be improved?
When both, train and test losses are bad your model is under fitting which is not the case here. From the model loss plot, the train and test losses are close in value and good. Maybe more training data is helpful in this case looking at how quickly your train loss dropped. I think the data used for training and validation (first plot) is less and very related to one another (less variance). So the model is seeing vastly different type of data from your second plot, which it never saw during it's training (based on training and validation data set used for generating the model). As mentioned initially try removing dropout (that is what I meant by regularisation).
I have network :
Tensor("input_1:0", shape=(?, 5, 1), dtype=float32)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 5, 1) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 5, 64) 2176
_________________________________________________________________
activation_1 (Activation) (None, 5, 64) 0
_________________________________________________________________
bidirectional_2 (Bidirection (None, 5, 128) 16512
_________________________________________________________________
activation_2 (Activation) (None, 5, 128) 0
_________________________________________________________________
bidirectional_3 (Bidirection (None, 1024) 656384
_________________________________________________________________
activation_3 (Activation) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1025
_________________________________________________________________
p_re_lu_1 (PReLU) (None, 1) 1
=================================================================
Total params: 676,098
Trainable params: 676,098
Non-trainable params: 0
_________________________________________________________________
None
Train on 27496 samples, validate on 6875 samples
I fit and compile it by:
model.compile(loss='mse',optimizer=Adamx,metrics=['accuracy'])
model.fit(x_train,y_train,batch_size=100,epochs=10,validation_data=(x_test,y_test),verbose=2)
When I run it and also evaluate it on unseen data,it returns 0.0 Accuracy with very low loss. I can't figure out what's the problem.
Epoch 10/10
- 29s - loss: 1.6972e-04 - acc: 0.0000e+00 - val_loss: 1.7280e-04 - val_acc: 0.0000e+00
What you are getting is expected. Your model is working correctly, it is your metrics of measure that is incorrect. The aim of the optimization function is to minimize loss, not to increase accuracy.
Since you are using PRelu as the activation function of your last layer, you always get float output from the network. Comparing these float output with actual label for measure of accuracy doesn't seem the right option. Since the outputs and labels are continuous random variable the joint probability for specific value will be zero. Therefore, even if the model predicts values very close to the true label value the model accuracy still will be zero unless the model predicts exactly the same value as true label - which is improbable.
e.g if y_true is 1.0 and the model predicts 0.99999 still this value does not add value to accuracy of the model since 1.0 != 0.99999
Update
The choice of metrics function depends on the type of problem. Keras also provides functionality for implementing custom metrics.
Assuming the problem on question is linear regression and two values are equal if difference between the two values is less than 0.01, the custom loss metrics can be defined as:-
import keras.backend as K
import tensorflow as tf
accepted_diff = 0.01
def linear_regression_equality(y_true, y_pred):
diff = K.abs(y_true-y_pred)
return K.mean(K.cast(diff < accepted_diff, tf.float32))
Now you can use this metrics for your model
model.compile(loss='mse',optimizer=Adamx,metrics=[linear_regression_equality])
I'm addressing a sentence-level binary classification task. My data consists of 3 subarrays of tokens: left context, core, and right context.
I used Keras to devise several alternatives of Convolutional Neural Networks and validate which one best fit my problem.
I'm a newbie in Python and Keras and I decided to start with simpler solutions in order to test which changes improve my metrics (accuracy, precision, recall, f1 and auc-roc). The first simplification was regarding input data: I decided to ignore contexts to create a Sequential model of Keras:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 500) 0
_________________________________________________________________
masking_1 (Masking) (None, 500) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 500, 100) 64025600
_________________________________________________________________
conv1d_1 (Conv1D) (None, 497, 128) 51328
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 62, 128) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 61, 256) 65792
_________________________________________________________________
dropout_2 (Dropout) (None, 61, 256) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 54, 32) 65568
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 34
=================================================================
As you can see, I use a fixed size of inputs so I applied a padding preprocessing. I also used an embedding layer with a Word2Vec model.
This model returns the following results:
P 0.875457875
R 0.878676471
F1 0.87706422
AUC-ROC 0.906102654
I wished to implement how to select a subarray of input data inside my CNN by means of Lambda layers. I use the following definition of my Lambda layer:
Lambda(lambda x: x[:, 1], output_shape=(500,))(input)
And this is the summary of my new CNN (as you can see it's almost the same than the prior):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 3, 500) 0
_________________________________________________________________
lambda_1 (Lambda) (None, 500) 0
_________________________________________________________________
masking_1 (Masking) (None, 500) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 500, 100) 64025600
_________________________________________________________________
conv1d_1 (Conv1D) (None, 497, 128) 51328
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 62, 128) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 61, 256) 65792
_________________________________________________________________
dropout_2 (Dropout) (None, 61, 256) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 54, 32) 65568
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 34
=================================================================
But the results were disgusting because accuracy stops at 60% and obviously, precision, recall and f1 were too low (< 0.10) regarding the first model results.
I don't know what's happening and I don't know if these networks are more different that I thought.
Any clue regarding this issue?
Two initial questions (I would comment but don't have sufficient rep yet):
(1) What's the motivation for using a CNN? These are good at picking out local features in a 2-d array of input values - for example, if you imagine a black and white picture as a 2-d array of integers where the integers represent the greyscale, they might pick out clumps of pixels that represented things like edges, corners or diagonal white lines. Unless you have a reason to expect your data to, like a picture, have such locally clustered features, and for points that are nearer to eachother both horizontally and vertically in your input arrays to be more relevant you may be better with dense layers where there are no assumptions as to which input features are relevant to which others. Start with say 2 layers, and see where that gets you.
(2) Assuming you are confident about the shape of your architecture, have you tried lowering the learning rate? That's the first thing to try in any NN which is not converging well.
(3) Depending on the task, you may be better using a dictionary and one-hot encoding for your words, esp if it's relatively simple classification and context isn't too much of a big deal. Word2Vec means you are encoding the words as numbers, which has implications for gradient descent. Hard to say without knowing what you are trying to achieve, but if you don't have some reasonable idea why using word2vec is a good idea it may not be...
This link explains the difference between CNNs and dense layers well, so may help you judge.