How can I improve performance of my neural network model?

How can I improve performance of my neural network model? - python

I'm trying to build a model on python to predict an operational parameter (ROP- Rate of Penetration) while drilling an oil well. I'm working with a neural network trained with PSO using pyswarms library. Input layer consists of 11 neurons and output layer just 1 neuron (ROP). I'm still searching for the "right" number of hidden layers.I don't have enough knowledge about machine learning, so any suggestion will be accepted.The loss function to minimize is MAE, due to it is not affected by outliers.
To track the performance of the model I'm not sure about what loss function I have to use. That's why after every run, I print MAE, RMSE, MSE R2 and R. The problem is that the values for train are "high" (loss functions) or "low" (R o R2) and for validation data is quite close.
I would like you to give oppinion about my "work".I'm not really sure about if the model is overfitting, underfitting or data quality is low.
Whole dataset consists of 6 wells (F-1A, F-1B, F-1C, F-11A,F-11B,F-11T2), for each well we have 12 parameters (including ROP that is the target). The number of samples for each well is different:
For instance:
Well F-1A: 60 000 samples (aprox)
Well F-1B: 20 000 samples (aprox)
Well F-1C: 25 000 samples (aprox)
So I consider that is enough to train my model on one well, for example on Well F-11A and then validate on Well F-1B.
On one of those runs I got this result:
Input layer: 11
Hidden layers: 2 (8 neurons and 10 neurons)
Output layer: 1
Options : {'c1': 0.68, 'c2': 0.7, 'w': 0.73}
n_particles = 100
iters = 100
The results for loss functions, R2 and R for each dataset are:
ROP Train Data r^2= 0.4955
ROP Train Data r= 0.7039
ROP Train Data MAE = 3.272725
ROP Train Data MSE = 19.528535
ROP Train Data RMSE = 4.41911
ROP Validation Data r^2= 0.5169
ROP Validation Data r= 0.719
ROP Validation Data MAE = 10.755544
ROP Validation Data MSE = 124.405781
ROP Validation Data RMSE = 11.153734
I dont know well what is the interpretation of this values. What I have to do next? Because I have realized that on the right plot, the curve of the Predicted Validation data (green curve) follow the trend of the Actual Validation data (blue curve) but the predicted values seems to be lower (as if they had been displaced)

Related

How to obtain weight matrices during training on Scikit

I am training a MLPClassifier by using Scikit. Lets say I want to train for 5 epochs on MNIST with one hidden layer of 100 neurons.
If I do "mlp = MLPClassifier(...)" and then "mlp.fit(train,test)", then I can obtain the trained weights with "mlp.coefs_".
But what I want is the sequence of weight matrices obtained after each epoch during training. So if I train for 5 epochs I would want a list of size 5 with the history of weight matrices.
Is this possible with scikit? Or should I use Keras?

One option is to train your model with a fraction of the epochs you wanted to do.
Store the parameters.
Then continue training your model with the warm_start = True parameter. You would do this until you got the overall number of epochs you wanted.
In the context of sci-kit learns implementation the max_iter parameter would be the epochs. This is referenced in this link.
https://stats.stackexchange.com/questions/284491/are-the-epochs-equivalent-to-the-iterations

Very large loss values when training multiple regression model in Keras

I was trying to build a multiple regression model to predict housing prices using the following features:
[bedrooms bathrooms sqft_living view grade]
= [0.09375 0.266667 0.149582 0.0 0.6]
I have standardized and scaled the features using sklearn.preprocessing.MinMaxScaler.
I used Keras to build the model:
def build_model(X_train):
model = Sequential()
model.add(Dense(5, activation = 'relu', input_shape = X_train.shape[1:]))
model.add(Dense(1))
optimizer = Adam(lr = 0.001)
model.compile(loss = 'mean_squared_error', optimizer = optimizer)
return model
When I go to train the model, my loss values are insanely high, something like 4 or 40 trillion and it will only go down about a million per epoch making training infeasibly slow. At first I tried increasing the learning rate, but it didn't help much. Then I did some searching and found that others have used a log-MSE loss function so I tried it and my model seemed to work fine. (Started at 140 loss, went down to 0.2 after 400 epochs)
My question is do I always just use log-MSE when I see very large MSE values for linear/multiple regression problems? Or are there other things i can do to try and fix this issue?
A guess as to why this issue occurred is the scale between my predictor and response variables were vastly different. X's are between 0-1 while the highest Y went up to 8 million. (Am I suppose to scale down my Y's? And then scale back up for predicting?)

A lot of people believe in scaling everything. If your y goes up to 8 million, I'd scale it, yes, and reverse the scaling when you get predictions out, later.
Don't worry too much about specifically what loss number you see. Sure, 40 trillion is a bit ridiculously high, indicating changes may need to be made to the network architecture or parameters. The main concern is whether the validation loss is actually decreasing, and the network actually learning therewith. If, as you say, it 'went down to 0.2 after 400 epochs', then it sounds like you're on the right track.
There are many other loss functions besides log-mse, mse, and mae, for regression problems. Have a look at these. Hope that helps!

Training E-net on human segmentation

I am trying to train a semantic-segmentation network (E-Net) in particular for high-quality human segmentation. For that, I have collected the "Supervisely Person" data-set and extracted the annotation masks using the provided API. This data-set holds high quality masks, thus I think it will provide better results in comparison to e.g. COCO data-set.
Supervisely - Example below : original image - ground truth.
First I want to give some details of the model. The network itself (Enet_arch) returns logits from the last convolution layer and probabilities which are produced through tf.nn.sigmoid(logits,name='logits_to_softmax').
I am using sigmoid cross-entropy on the ground truth and the returned logits, momentum and exponential decay on the learning rate. The model instance and the training pipeline is as follows.
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.momentum = tf.Variable(0.9, trainable=False)
# introducing weight decay
#with slim.arg_scope(ENet_arg_scope(weight_decay=2e-4)):
self.logits, self.probabilities = Enet_arch(inputs=self.input_data, num_classes=self.num_classes, batch_size=self.batch_size) # returns logits (2d), probabilities (2d)
#self.gt is int32 with values 0 or 1 (coming from read_tfrecords.Read_TFRecords annotation images + placeholder defined to int)
self.gt = self.input_masks
# self.probabilities is output of sigmoid, pixel-wise between probablities [0, 1].
# self.predictions is filtered probabilities > 0.5 = 1 else 0
self.predictions = tf.to_int32(self.probabilities > 0.5)
# capture segmentation accuracy
self.accuracy, self.accuracy_update = tf.metrics.accuracy(labels=self.gt, predictions=self.predictions)
# losses and updates
# calculate cross entropy loss on logits
loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.gt, logits=self.logits)
# add the loss to total loss and average (?)
self.total_loss = tf.losses.get_total_loss()
# decay_steps = depend on the number of epochs
self.learning_rate = tf.train.exponential_decay(self.starter_learning_rate, global_step=self.global_step, decay_steps=123893, decay_rate=0.96, staircase=True)
#Now we can define the optimizer
#optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, epsilon=1e-8)
optimizer = tf.train.MomentumOptimizer(self.learning_rate, self.momentum)
#Create the train_op.
self.train_op = optimizer.minimize(loss, global_step=self.global_step)
I first tried to over-fit the model on a single image to identify the depth of details that this network can capture. To increase the output quality I resized all the images to 1080p before feeding them to the network. On this trial I trained the network for 10K iterations and the total error reached ~30% (captured from tf.losses.get_total_loss() ).
The results while training on a single image are pretty good as you can see below.
Supervisely - Example below : (1) Loss (2) input (before resizing) | ground truth (before resizing) | 1080p out
Later, I tried to train on the whole data-set but the training loss produce lot of oscillations. That means that in some images the network perform well and in some other not. As a results after 743360 iterations (which is 160 epochs, since the training set holds 4646 images) I stopped training since obviously there is something wrong with the hyper-parameters selection that I made.
Supervisely - Example below : (1) Loss (2) learning rate (3) input (before resizing) | ground truth (before resizing) | 1080p out
On the other hand on some instances of the training set images the network produce fair (not very good though) results like below.
Supervisely - Example below : input (before resizing) | ground truth (before resizing) | 1080p out
Why do I have those differences on these training instances? Are there any obvious changes that I should do on the model or on the hyper-parameters? Is it possible that this model is just not suitable for this use-case (e.g. low network capacity) ?
Thanks in advance.

It turns out that the problem here is indeed E-net architecture. I changed the architecture with DeepLabV3 and saw a big difference in loss behaviour and performance.. even in small resolution!

R-squared results of test and validation differ by a huge margin

I am working on a regression problem with keras and tensorflow using a neural network. The data is split, so that 282774 datasets are for training, 70694 are for validation and 88367 are for testing. To evaluate my models I am printing out the mean squared error (MSE), the mean absolute error (MAE) and the R-squared score. These are some examples from the results I get:
MSE MAE R-squared
Training 1.562072899 0.958128839 0.849787137
Validation 0.687871457 0.62066941 0.935365564
Test 0.683918759 0.618674863 -16.22829222
I do not understand the value for R-squared on test data. I know that R-squared can be negative, but how can it be that there is such a big difference between validation and test if both fall into the category of unseen data. Can someone give me a hint?
Some background information:
Since keras does not have the R-squared metric built in, I implemented it with some code I found on the web and which seems logical for me:
def r2_keras(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
And if it helps: this is my model:
model = Sequential()
model.add(Dense(75, input_shape=(7,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
adam = optimizers.Adam(lr=0.001)
model.compile(loss='mse',
optimizer=adam,
metrics=['mse', 'mae', r2_keras])
history = model.fit(x_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2)
score = model.evaluate(x_test, y_test, batch_size=32)
One strange thing I noticed is, that not all testing data seems to be considered. The console prints out the following:
86304/88367 [============================>.] - ETA: 0s-----
Maybe this leads to a miscalculation for R-squared?
I am thankful for any help/hint I can get on understanding this issue.
Update:
I checked for outliers, but could not find any significant one. Min and max-values for test and train are close by, considering the standard deviation. Also the histograms look very much alike.
So in the next step I let my model predict the values for test data again and used pandas + numpy to calculate the r2_score. This time I got a value which is approximately equal to the r2_score for validation.
Below is how I did it. Do you see any flaws in the way I performed the calculation? (I just want to be sure that the old r2_score for "test" was indeed a calculation error)
# "test" is a dataframe with input data and the real outputs
# "inputs" is a list of the input column names
# The real/true outputs are contained in the column "output"
test['output_pred'] = model.predict(x=np.array(test[inputs]))
output_mean = test['output'].mean() # Is this the correct mean value for r2 here?
test['SSres'] = np.square(test['output']-test['output_pred'])
test['SStot'] = np.square(test['output']-output_mean)
r2 = 1-(test['SSres'].sum()/(test['SStot'].sum()))

Tensorflow's built-in evaluate method evaluates your test set batch by batch and hence calculates r2 at each batch. The metrics produced from model.evaluate() is then simple average of all r2 from each batch. While in model.fit(), r2 (and all metrics on validation set) are calculated per epoch (instead of per batch and then take avg.)
You may slice your output and output_pred into batches of the same batch size you used in model.evaluate() and calculate r2 on each batch. I guess the model produces high r2 on batches with high total sum of squares (SS_tot) and bad r2 on lower ones. So when taken average, result would be poor (however when calculate r2 on entire dataset, samples with higher ss_tot usually dominate the result).

Why does recognition rate drop after multiple online training epochs?

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)

DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.

There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.