How can I get weights converged in a way that MSE minimizes?

How can I get weights converged in a way that MSE minimizes? - python

here is my code
for _ in range(5):
K.clear_session()
model = Sequential()
model.add(LSTM(256, input_shape=(None, 1)))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='RmsProp', metrics=['accuracy'])
hist = model.fit(x_train, y_train, epochs=20, batch_size=64, verbose=0, validation_data=(x_val, y_val))
p = model.predict(x_test)
print(mean_squared_error(y_test, p))
plt.plot(y_test)
plt.plot(p)
plt.legend(['testY', 'p'], loc='upper right')
plt.show()
Total params : 330,241
samples : 2264
and below is the result
I haven't changed anything.
I only ran for loop.
As you can see in the picture, the result of the MSE is huge, even though I have just run the for loop.
I think the fundamental reason for this problem is that the optimizer can not find global maximum and find local maximum and converge. The reason is that after checking all the loss graphs, the loss is no longer reduced significantly. (After 20 times) So in order to solve this problem, I have to find the global minimum. How should I do this?
I tried adjusting the number of batch_size, epoch. Also, I tried hidden layer size, LSTM unit, kerner_initializer addition, optimizer change, etc. but could not get any meaningful result.
I wonder how can I solve this problem.
Your valuable opinions and thoughts will be very much appreciated.
if you want to see full source here is link https://gist.github.com/Lay4U/e1fc7d036356575f4d0799cdcebed90e

From your example, the problem simply comes from the fact that you have over 100 times more parameters than you have samples. If you reduce the size of your model, you will see less variance.
The wider question you are asking is actually very interesting that usually isn't covered in tutorials. Nearly all Machine Learning models are by nature stochastic, the output predictions will change slightly everytime you run it which means you will always have to ask the question: Which model do I deploy to production ?
Off the top of my head there are two things you can do:
Choose the first model trained on all the data (after cross-validation, ...)
Build an ensemble of models that all have the same hyper-parameters and implement a simple voting strategy
References:
https://machinelearningmastery.com/train-final-machine-learning-model/
https://machinelearningmastery.com/randomness-in-machine-learning/

If you want to always start from the same point you should set some seed. You can do it like this if you use Tensorflow backend in Keras:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
If you want to learn why do you get different results in ML/DL models, I recommend this article.

Related

how could i increase the accuracy of training set

i'm working on a classification problem (human activity classification) and i used CNN the code of model is :
model = Sequential()
model.add(Conv2D(100, (2, 2), activation = 'relu', input_shape = X_train[0].shape))
model.add(Dropout(0.1))
#adding pooling layer
model.add(MaxPool2D(2,2))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='softmax'))
compiling and fiting :
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 20, validation_data= (X_test, y_test), verbose=1)
the accuracy was like this
how coul'd i increase the last value of accuracy ? and why the curve is increasing kinda fast?

There are a few avenues you can pursue here, specifically finding answers to the following questions for your particular problem. Here's a great video, although not for tensorflow, but I think the question you are asking is general enough for it to apply
What is the right amount of time to train for? Likely the answer here is somewhere between 20 epochs and 90, more specifically, it's where your two series in the plot start to diverge; in other words, your model starts to memorize the training data at the point of divergence. Tensorflow has early stopping mechanisms to help with this.
What is the performance of a naïve guesser? Is the complexity of your model proportional to the complexity/dimensionality of the problem?
What is the human insight that you can bring to the problem? Are there things you can do to the features that will help the model create separability in higher dimensions? For example, let's say your model is going to predict what activity a person is going to do at a given point in time. In this case, information related to people might be separate from time and activity data. You can create features that represent combinations of other features (assuming you have a lot of data), and encode this and feed it to your model. You can create embeddings in your model to get your model to deal with the sparsity that occurs when you combine such categorical features.
Another aspect of this that I think is very important to answer is "Why am I solving this problem?". In some cases, the answer might be "I want to learn X", in which case you might approach it differently. For example, if it's all tabular data, you might have more interpretable/better results using something like scikit-learn using a tree based model. It also, of course, depends on the amount and type of data you have. Nested cross-validation can give you great insight into what are the combinations of hyperparameters and features that will produce a model that generalizes, and also about the variation you can expect to see on unseen data.
Best of luck!

My LSTM model overfits over validation data

Here's my LSTM model to classify hand gesture. Initially, I had 1960 training data of shape(num_sequences, num_joints, 3) that I reshape to shape(num_sequences, num_joints*3).
Here's my model:
input_shape = (trainx.shape[1], trainx.shape[2])
print("Build LSTM RNN model ...")
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(171, 66)))
model.add(Bidirectional(LSTM(units=256, activation='tanh', return_sequences=True, input_shape=input_shape)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=True)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=False)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(units=trainy.shape[1], activation="softmax"))
print("Compiling ...")
# Keras optimizer defaults:
# Adam : lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.
# RMSprop: lr=0.001, rho=0.9, epsilon=1e-8, decay=0.
# SGD : lr=0.01, momentum=0., decay=0.
opt = Adam()
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
I get a 90% accuracy on train and 50% on test

Overfitting is quite common in deep learning.
To circumvent over fitting with your LSTM architecture try the following things in this order:
Decrease the learning rate from 0.1 or 0.01 to 0.001,0.0001,0.00001.
Reduce the number of epochs. You can try to plot the training and validation accuracy as a function of the number of epochs and see when the training accuracy becomes larger than the validation accuracy. That is the number of epochs that you should use. Combine this with the 1st step decreasing the learning rate.
Then you can try to modify the architecture of the LSTM, here you already added dropout (maximum value 0.5), I would suggest to try 0.2, 0.3. You have 3 cells which is better than 2, the size of the nodes look reasonable. What is the embedding dimension you are currently using? Since you are overfitting it is worth a try to reduce the number of cells from 3 to 2 and keeping the same number of nodes.
The batch size might be important as well as the distribution of subclasses in your dataset. Is the dataset equally distributed and equally balanced between training and validation sets? What I mean by this is that if one hand gesture is over represented in the training set compared to the validation set that might be a problem. A good strategy to overcome this is to keep some part of the data as a test set. Then do a train/split cross validation using sklearn (5 times). Then train your architecture on each train/split model separately (5 times) and compare the training and validation accuracy. If there is a big bias in the split or among the sets you will be able to identify it in this manner.
Last, you can try augmentation, specifically rotation and horizontal/vertical flip. This library might help https://github.com/aleju/imgaug
Hope this helps!

How do you know the network is over fitting versus some kind of error in your data set. Does the validation loss improve initially up to some epoch then plateau or start to increase? Then it is over fitting. If it starts at 50% and stays there it is not an over fitting problem. With the amount of drop out you have over fitting does not look very likely. How did you select your validation set? Was it randomly selected from the overall data set or did you do the selection? It is always better to randomly select the data so that its probability distribution mirrors that of the training data. As said in the comments please show your code for model.fit there could be a problem there. How do you input the data? Did you use generators? A 50% validation accuracy leads me to suspect some error in how your validation data is provided to the network or some error in labeling of the validation data. I would also recommend you consider the use of dynamically adjusting your learning rate based on monitoring of validation loss. Keras has a callback for this
called ReduceLROnPlateau. Documentation is here. Set it up to monitor validation loss. I set the parameters patience=3 and factor=.5 which seems to work well. You can think of training as descending into a valley. As you descend the valley gets narrower. If the learning rate is to large and remains fixed you won't be able to reach further down toward the minimum. This should improve your training accuracy which should result in improvement of validation accuracy. As I said with the level of drop out you have I do not think it is over fitting but if it is you can also use Keras regularizes to help avoid over training. Documentation is here.

How do I best optimize my paramters, choices of activation, optimizer ect. in a LSTM?

I'm training a LSTM Neural Network to predict a volatilty (timeseries) in Keras. At the moment, my network is specified as follows:
model = Sequential()
model.add(LSTM(10, input_shape=(1,1), kernel_regularizer = l2(0.0001)))
model.add(Dense(1, activation = 'relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=16)
Here, I have a lot of parameters I could cross validate:
units in LSTM?
More layers?
regularizer (L1 or l2, and amount)?
Activation function?
Optimizer?
Batch size?
However, CV on all these parameters would result in huge computational time, so how do I determind the correct specifications for all of them?

So far as I know, doing grid-search might be the best approach. However, you can lessen your search space by examining your data. If you don't have much data, try to go for a smaller model, don't go too big (or else it will overfit). This can lessen your search space a bit. Some say less layer but more unit works well for low-resource data, but still, it is not guaranteed.
Regularizer can sometimes good or bad, it depends on the task. You'll never know if the setting is correct or not unless you experiment on it.
For batch size, it is recommended to experiment on the batch size from 16 to 512 (or you can go higher if you can). The larger the batch size is, the faster it trains, the more memory it consumes. Smaller batch size also means the model will "walk" more random. In other words, the loss will decrease at a more random pace.
For optimizer, if you want to grid search, just use Adam. It is quite good for most tasks.
All in all, no one can guarantee that tuning different hyperparameters will result in a performance gain. It all needs to be experimented and record. That's why there are so many research papers done on hyperparameters tuning.

Changing optimizer or lr after loading model yields strange results

I'm using the latest Keras with Tensorflow backend (Python 3.6)
I'm loading a model that had a training accuracy at around 86% when I last trained it.
The orginal optimizer that I used was :
r_optimizer = optimizer=Adam(lr=0.0001, decay = .02)
model.compile(optimizer= r_optimizer,
loss='categorical_crossentropy', metrics = ['accuracy'])
If I load the model and continue training without recompiling, my
accuracy would stay around 86% (even after 10 or so more epochs).
So I wanted to try changing the learning rate or optimizer.
If I recompile the model and try to change the learning rate or the
optimizer as follows:
new_optimizer = optimizer=Adam(lr=0.001, decay = .02)
or to this one:
sgd = optimizers.SGD(lr= .0001)
and then compile:
model.compile(optimizer= new_optimizer ,
loss='categorical_crossentropy', metrics = ['accuracy'])
model.fit ....
The accuracy would reset to around 15% - 20%, instead of starting around 86%,
and my loss would be much higher.
Even if I used a small learning rate, and recompiled, I would still start
off from a very low accuracy.
From browsing the internet it seems some optimizers like ADAM or RMSPROP have
a problem with resetting weights after recompiling (can't find the link at the moment)
So I did some digging and tried to reset my optimizer without recompiling as follows:
model = load_model(load_path)
sgd = optimizers.SGD(lr=1.0) # very high for testing
model.optimizer = sgd #change optimizer
#fit for training
history =model.fit_generator(
train_gen,
steps_per_epoch = r_steps_per_epoch,
epochs = r_epochs,
validation_data=valid_gen,
validation_steps= np.ceil(len(valid_gen.filenames)/r_batch_size),
callbacks = callbacks,
shuffle= True,
verbose = 1)
However, these changes don't seem to be reflected in my training.
Despite raising the lr significantly, I'm still floundering around 86% with the same loss. During each epoch, I'm seeing very little loss or accuracy movement. I would expect the loss to be a lot more volatile.
This leads me to believe that my change in optimizer and lr isn't being
realized by the model.
Any idea what I could be doing wrong?

I think your change does not assign new lr to optimizer, and I find a solution to reset lr values after loading model in Keras, hope it will help you.

This is a partial answer referring to what you wrote here:
From browsing the internet it seems some optimizers like ADAM or RMSPROP have a problem with resetting weights after recompiling (can't find the link at the moment)
Adaptive optimizers such as ADAM RMSPROP, ADAGRAD, ADADELTA, and any variation on these, rely on previous update steps to improve the direction and magnitude of any current adjustment to the weights of the model.
Because of this, the first few steps that they take tend to be relatively "bad" as they "calibrate themselves" with information from previous steps.
When used on a random initialization, this is not a problem, but when used on a pretrained model, these few first steps, can degrade the model so much, that almost all of the pretrained work gets lost.
Even worse, now the training doesn't start from a carefully chosen random initialization like a Xavier initialization, but from some sub-optimal starting point, which could potentially prevent the model from converging to the local optimum that it would have reached if it started from a good random initialization.
Unfortunately I'm not sure how you can avoid this... Perhaps pretrain with one optimizer --> save weights --> replace optimizer --> restore weights --> train for a few epochs and hope the new adaptive optimizer learns a "useful history" --> than restore the weights agin from the saved weights of the pretrained model and without recompiling start training again, now with a better optimizer "history".
Please let us know if this works.

Getting started with Keras for machine learning

I'm getting started with machine learning tools and I'd like to learn more about what the heck I'm doing. For instance, the script:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization
from keras.initializers import RandomUniform
import numpy
numpy.random.seed(13)
RandomUniform(seed=13)
model = Sequential()
model.add(Dense(6, input_dim=6))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(optimizer='sgd', loss='mean_absolute_error', metrics=['accuracy'])
data = numpy.loadtxt('train', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
model.fit(X, Y, batch_size=1, epochs=1000)
data = numpy.loadtxt('test', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
score = model.evaluate(X, Y, verbose=1)
print ('\n\nThe error is:\n', score, "\n")
print('\n\nPrediction:\n')
Y = model.predict(X, batch_size=1, verbose=1)
print('\nResult:\n', Y, '\n')
It's a Frankenstein I made from some examples I found on the internet and I have many unanswered questions about it:
The file train has 60 rows. Is 1000 epochs too little? Is it too much? Can I get an Underfit/Overfit?
What does the result I get from model.evaluate() mean? I know it's the loss but, if I get a [7.0506157875061035, 0.0], does it mean that my model has a 7% error?
And last, I'm getting a prediction of 0.99875391, 0.99875391, 0.9362126, 0.99875391, 0.99875391, 0.99875391, 0.93571019 when the expected values were anything close to 7.86, 3.57, 8.93, 6.57, 11.7, 8.53, 9.06, which means it's a real bad prediction. Clearly there's a lot of things I'm doing wrong. Could you guys give me a few pointers?
I know it all depends on the type of data I'm using, but is there anything I shouldn't do at all? Or maybe something I should be doing?

1
There is never a ready answer for how many epochs is a good number. It varies wildly depending on the size of your data, your model, and what you want to achieve. Normally, small models require less epochs, bigger models require more. Yours seem small enough and 1000 epochs seems way too much.
It also depends on the learning rate, a parameter given to the optimizer that defines how long are the steps your model takes to update its weights. Bigger learning rates mean less epochs, but there is a chance that you simply never find a good point because you're adjusting weights beyond what is good. Smaller learning rates mean more epochs and better learning.
Normally, if the loss reaches a limit, you're approaching a point where training is not useful anymore. (Of course, there may be problems with the model too, there is really no simple answer for this one).
To detect overfitting, you need besides the training data (X and Y), another group with test data (say Xtest and Ytest, for instance).
Then you use it in model.fit(X,Y, validation_data=(Xtest,Ytest), ...)
Test data is not given for training, it's kept separate just to see if your model can predict good things from data it has never seen in training.
If the training loss goes down, but the validation loss doesn't, you're overfitting (roughly, your model is capable of memorizing the training data without really understanding it).
An underfit, on the contrary, happens when you never achieve the accuracy you expect (of course we always expect a 100% accuracy, no mistakes, but good models get around the 90's, some applicatoins go better 99%, some worse, again, it's very subjective).
2
model.evaluate() gives you the losses and the metrics you added in the compile method.
The loss value is something your model will always try to decrease during training. It roughly means how distant your model is from the exact values. There is no rule for what the loss value means, it could even be negative (but usually keras uses positive losses). The point is: it must decrease during training, that means your model is evolving.
The accuracy value means how many right predictions your model outputs compared to the true values (Y). It seems your accuracy is 0%, your model is getting everything wrong. (You can see that from the values you typed).
3
In your model, you used activation functions. These normalize the results so they don't get too big. This avoids overflowing problems, numeric errors propagating, etc.
It's very very usual to work with values within such bounds.
tanh - outputs values between -1 and 1
sigmoid - outputs values between 0 and 1
Well, if you used a sigmoid activation in the last layer, your model will never output 3 for instance. It tries, but the maximum value is 1.
What you should do is prepare your data (Y), so it's contained between 0 and 1. (This is the best to do in classification problems, often done with images too)
But if you actually want numerical values, then you should just remove the activation and let the output be free to reach higher values. (It all depends on what you want to achieve with your model)

Epoch is a single pass through the full training set. I my mind it seems a lot, but you'd have to check for overfitting and evaluate the predictions. There are many ways of checking and controlling for overfitting in a model. If you understand the methods of doing so from here, coding them in Keras should be no problem.
According to the documentation .evaluate returns:
Scalar test loss (if the model has no metrics) or list of scalars (if the model computes other metrics)
so these are the evaluation metrics of your model, they tell you how good your model is given some notion of good. Those metrics depend on the model and type of data that you've used. Some explanation on those can be found here and here. As mentioned in the documentation,
The attribute model.metrics_names will give you the display labels for the scalar outputs.
So you can know what metric you are looking at. It is easier to do that interactively through the console (ipython, bpython) or Jupyter notebook.
I can't see your data, but a if you are doing a classification problem as suggested by metrics=['accuracy'], the loss=mean_absolute_error doesn't make sense, since it is made for regression problems. To learn more about those I refer you to here and here which discuss classification and regression problems with Keras.
PS: question 3 is not related to software per se, but to the theoretical construct supporting the software. In such cases, I'd recommend asking them at Cross Validated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.