Query about Loss Functions for LSTM models (Binary Classification) - python

I'm working on building an LSTM model to binary classify price movements.
My training data is data I simulated, it's a 2,000 rows * 3,780 columns dataframe of price movements.
I have a separate labels file that classifies price movements as either 1 or 2 (due to memory).
From what I've read, it appears as though two loss functions are the most appropriate for binary classification:
Binary Cross-Entropy
Hinge Loss
I've implemented two separate LSTM models in google colab which run as expected.
I have the same code for both models, with just the loss function being changed from a Squared Hinge loss in the former to a Binary Cross Entropy in the latter.
My issue is deciding which is the better model, as the model outputs give conflicting outputs.
Hinge Loss Output:
Training Output:
The loss starts at 0.3, then goes to 0.20 after and stays pretty much constant for the remaining 98 epochs.
The MSE does decrease marginally across the epochs from 2.8 to 1.68 at the end. Average MSE = 1.72.
The accuracy is 0.00 on every epoch (which I don't understand).
Validation Output:
The Validation loss starts at 0.0117 and goes to 9.8264e-06 by the end.
The Validation MSE starts at 2.4 and ends at 1.54. Average Validation MSE = 1.31.
The Validation accuracy is 0.00 on every epoch (which again I don't understand).
Binary Cross Entropy Loss Output:
Training Output:
The loss starts at 8.3095, then goes to 3.83 after and stays pretty much constant for the remaining 97 epochs.
The MSE does decrease marginally across the epochs from 2.8 to 1.68 at the end. Average MSE = 1.69.
The accuracy starts at 0.00 and increased to roughly 0.8 by end.
Validation Output:
The Validation loss starts at -0.82 and goes to -.89 by the end.
The Validation MSE starts at 1.56 and ends at 1.53. Average Validation MSE = 1.30.
The Validation accuracy starts at 0.00 and increased to roughly 0.997 by end.
So, I have a question now:
Why is the accuracy of the SHL model 0.00? Is there an error in my model?
My code is saved here:
https://nbviewer.jupyter.org/github/Ianfm94/Financial_Analysis/blob/master/LSTM_Workings/LSTM_Model.ipynb
The training data* and labels data are saved at the below location:
https://github.com/Ianfm94/Financial_Analysis/tree/master/LSTM_Workings
*Training data here is split into two separate files due to Github limiting file size to 25 mb's.
Any help would be greatly appreciated.
Thanks.

Related

How can I improve performance of my neural network model?

I'm trying to build a model on python to predict an operational parameter (ROP- Rate of Penetration) while drilling an oil well. I'm working with a neural network trained with PSO using pyswarms library. Input layer consists of 11 neurons and output layer just 1 neuron (ROP). I'm still searching for the "right" number of hidden layers.I don't have enough knowledge about machine learning, so any suggestion will be accepted.The loss function to minimize is MAE, due to it is not affected by outliers.
To track the performance of the model I'm not sure about what loss function I have to use. That's why after every run, I print MAE, RMSE, MSE R2 and R. The problem is that the values for train are "high" (loss functions) or "low" (R o R2) and for validation data is quite close.
I would like you to give oppinion about my "work".I'm not really sure about if the model is overfitting, underfitting or data quality is low.
Whole dataset consists of 6 wells (F-1A, F-1B, F-1C, F-11A,F-11B,F-11T2), for each well we have 12 parameters (including ROP that is the target). The number of samples for each well is different:
For instance:
Well F-1A: 60 000 samples (aprox)
Well F-1B: 20 000 samples (aprox)
Well F-1C: 25 000 samples (aprox)
So I consider that is enough to train my model on one well, for example on Well F-11A and then validate on Well F-1B.
On one of those runs I got this result:
Input layer: 11
Hidden layers: 2 (8 neurons and 10 neurons)
Output layer: 1
Options : {'c1': 0.68, 'c2': 0.7, 'w': 0.73}
n_particles = 100
iters = 100
The results for loss functions, R2 and R for each dataset are:
ROP Train Data r^2= 0.4955
ROP Train Data r= 0.7039
ROP Train Data MAE = 3.272725
ROP Train Data MSE = 19.528535
ROP Train Data RMSE = 4.41911
ROP Validation Data r^2= 0.5169
ROP Validation Data r= 0.719
ROP Validation Data MAE = 10.755544
ROP Validation Data MSE = 124.405781
ROP Validation Data RMSE = 11.153734
I dont know well what is the interpretation of this values. What I have to do next? Because I have realized that on the right plot, the curve of the Predicted Validation data (green curve) follow the trend of the Actual Validation data (blue curve) but the predicted values seems to be lower (as if they had been displaced)

Python - neural network training loss and validation loss values

I have a question about my training loss and validation loss for a neural network in python using pytorch. I am using bert to classify labels for some given text.
I have about 14k text records with 20 unique labels - where some labels are more frequent than others.
I use about 25% as my validation set and use strafification when performing train_test_splits.
My learning rate is 1e-6
attention_probs_dropout_prob=0.2
hidden_dropout_prob=0.2
There is no data leakeage as I did not impute any values
While training my model I notice few things.
training loss remains higher than validation loss
with each epoch both losses go down but training loss never goes below the validation loss even though they are close
Example
epoch 8
epoch 50
epoch 100
epoch 200
training loss
2.9
1.18
.98
.75
validation loss
2.4
1.0
.67
.45
F1 score - weighted
.55
.75
.86
.90
As noticed we see that the training loss decreases a bit at first but then slows down, but validation loss keeps decreasing with bigger increments
can someone explain to me what is going on with how this model is learning? My understanding is that it is not performing well based on the training loss and validation loss values. Usually my values should be lower with the training loss below validation loss
any input is appreciated
Thank you

Tensorflow Beginner, Basic Question On Linear Model

https://www.tensorflow.org/tutorials/estimator/linear
I am following the Tensorflow documentation to implement a Linear Classifier but I like to use my own data instead of the tutorial set. I just have a few general questions.
My dataset is as follows. It's not a time series.
row[0] - float (changed to binary, 0 = negative, 1 = positive) VALUE TO ESTIMATE
row[1] - string (categorical, changed to vocabulary, ints 1,2,3,4,5,6,7,8,9)
row[2-19] - float (positive and negative)
row[20-60] - ints (percentile ranks, ints 10,20,30,40,50,60,70,80,90)
row[61-95] - ints (binary 1, 0)
I started by using 50k (45k training) rows of data and num_epochs=100, batch_size=256.
{'accuracy': 0.8912, 'accuracy_baseline': 0.8932, 'auc': 0.7101819, 'auc_precision_recall': 0.2830853, 'average_loss': 0.30982444, 'label/mean': 0.1068, 'loss': 0.31013006, 'precision': 0.4537037, 'prediction/mean': 0.11840516, 'recall': 0.0917603, 'global_step': 17600}
Does the column I want to estimate need to be a column of binaries for this model?
Is it a bad idea to mix data types like this? Would it be necessary to normalize the data using something like preprocessing.Normalization ?
Should I alter the epochs/batch if I want to use more data?
The accuracy seems high but the loss also seems quite high, why is that?
Any other suggestions?
Thanks for any help or advice.
Here is the answer to your questions.
By default tf.estimator.LinearClassifier considers as binary classification with n_classes=2, but you can have more than 2 classes as well.
For a linear classification normalizing data won't affect much in terms of accuracy compared to non linear classifier accuracy change after normalizing on the same data.
You can observe the change in accuracy and loss, if it does not change much for about 5-10 epochs, you can restrict the number of epochs there only. Again you can repeat the same step by changing the batch size.
Accuracy and loss are not dependent on each other, consider an example of your case to classify 0 and 1. A model with 2 classes that always predicts 0.51 for the true class would have the same accuracy as one that predicts 0.99. Best model would be with high accuracy and with less loss, if your model is giving good accuracy and high loss that means your model made huge errors on few data.
Try to tune your model hyper parameters based on several observations and to feed quality data with some preprocessing is always best way to reach high accuracy and less loss, with some additional data would be good to have always.

Improving accuracy on a multi-class image classifiier

I am building a classifier using the Food-101 dataset. The dataset has predefined training and test sets, both labeled. It has a total of 101,000 images. I’m trying to build a classifier model with >=90% accuracy for top-1. I’m currently sitting at 75%. The training set was provided unclean. But now, I would like to know some of the ways I can improve my model and what are some of the things I’m doing wrong.
I’ve partitioned the train and test images into their respective folders. Here, I am using 0.2 of the training dataset to validate the learner by running 5 epochs.
np.random.seed(42)
data = ImageList.from_folder(path).split_by_rand_pct(valid_pct=0.2).label_from_re(pat=file_parse).transform(size=224).databunch()
top_1 = partial(top_k_accuracy, k=1)
learn = cnn_learner(data, models.resnet50, metrics=[accuracy, top_1], callback_fns=ShowGraph)
learn.fit_one_cycle(5)
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.153797 1.710803 0.563498 0.563498 19:26
1 1.677590 1.388702 0.637096 0.637096 18:29
2 1.385577 1.227448 0.678746 0.678746 18:36
3 1.154080 1.141590 0.700924 0.700924 18:34
4 1.003366 1.124750 0.707063 0.707063 18:25
And here, I’m trying to find the learning rate. Pretty standard to how it was in the lectures:
learn.lr_find()
learn.recorder.plot(suggestion=True)
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Min numerical gradient: 1.32E-06
Min loss divided by 10: 6.31E-08
Using the learning rate of 1e-06 to run another 5 epochs. Saving it as stage-2
learn.fit_one_cycle(5, max_lr=slice(1.e-06))
learn.save('stage-2')
epoch train_loss valid_loss accuracy top_k_accuracy time
0 0.940980 1.124032 0.705809 0.705809 18:18
1 0.989123 1.122873 0.706337 0.706337 18:24
2 0.963596 1.121615 0.706733 0.706733 18:38
3 0.975916 1.121084 0.707195 0.707195 18:27
4 0.978523 1.123260 0.706403 0.706403 17:04
Previously I ran 3 stages altogether but the model wasn’t improving beyond 0.706403 so I didn’t want to repeat it. Below is my confusion matrix. I apologize for the terrible resolution. Its the doing of Colab.
Since I’ve created an additional validation set, I decided to use the test set to validate the saved model of stage-2 to see how well it was performing:
path = '/content/food-101/images'
data_test = ImageList.from_folder(path).split_by_folder(train='train', valid='test').label_from_re(file_parse).transform(size=224).databunch()
learn.load('stage-2')
learn.validate(data_test.valid_dl)
This is the result:
[0.87199837, tensor(0.7584), tensor(0.7584)]
Try augmentations like RandomHorizontalFlip, RandomResizedCrop,
RandomRotate, Normalize etc from torchvision transforms. These always help a lot in classification problems.
Label smoothing and/or Mixup precision training.
Simply try using a more optimized architecture, like EfficientNet.
Instead of OneCycle, a longer, more manual training approach may help. Try Stochastic Gradient Descent with a weight decay of 5e-4 and a Nesterov Momentum of 0.9. Use Warmup Training of around 1-3 epochs, and then regular training of around 200 epochs. You could set a manual learning rate schedule or cosine annealing or some other scheme. This entire method will consume a lot more time and effort than the usual onecycle training, and should be explored only if other methods don't show considerable gains.

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')
Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

Categories