Randomness of LSTM model - python

I have one LSTM model like below:
model = Sequential()
model.add(Conv1D(3, 32, input_shape=(60, 12)))
model.add(LSTM(units=256, return_sequences=False, dropout=0.25))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.summary()
Each time I use the same dataset to train it, I will get a different model. Most of the time, the performance of the trained model is acceptable, but sometime is really bad. I think that there are some randomness during the training or initialization. So how can I fix everything to get same model for each training?

I've experienced this problem with Keras as well, it has to do with the random seed, you can fix your random seed like this before importing the Keras, so that you could get the consistent result.
import numpy as np
np.random.seed(1000)
import os
import random
os.environ['PYTHONHASHSEED'] = '0'
random.seed(12345)
# Also set the tf randomness to some fixed values like this if you need:
tf.set_random_seed(1234)
This worked for me.

Weights are initialized randomly in neural networks, so it is possibly to get different results by design. If you think about how backpropagation works and how the cost function is minimized, you will notice that you donĀ“t have any guarantee that your network will find the "global minima". Fixing the seed is one idea to get reproducible results, but on the other hand you limit your network to a fixed starting position, where it probably will never reach the global minima.
A lot of complex models, especially LSTMs are unstable. You could look at convolutional approaches. I noticed, they are performing almost equally and are much more stable.
https://arxiv.org/pdf/1803.01271.pdf

You can save it
from keras.models import load_model
model.save("lstm_model.h5")
And load it later on
model = model.load("lstm_model.h5")

Related

How do I best optimize my paramters, choices of activation, optimizer ect. in a LSTM?

I'm training a LSTM Neural Network to predict a volatilty (timeseries) in Keras. At the moment, my network is specified as follows:
model = Sequential()
model.add(LSTM(10, input_shape=(1,1), kernel_regularizer = l2(0.0001)))
model.add(Dense(1, activation = 'relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=16)
Here, I have a lot of parameters I could cross validate:
units in LSTM?
More layers?
regularizer (L1 or l2, and amount)?
Activation function?
Optimizer?
Batch size?
However, CV on all these parameters would result in huge computational time, so how do I determind the correct specifications for all of them?
So far as I know, doing grid-search might be the best approach. However, you can lessen your search space by examining your data. If you don't have much data, try to go for a smaller model, don't go too big (or else it will overfit). This can lessen your search space a bit. Some say less layer but more unit works well for low-resource data, but still, it is not guaranteed.
Regularizer can sometimes good or bad, it depends on the task. You'll never know if the setting is correct or not unless you experiment on it.
For batch size, it is recommended to experiment on the batch size from 16 to 512 (or you can go higher if you can). The larger the batch size is, the faster it trains, the more memory it consumes. Smaller batch size also means the model will "walk" more random. In other words, the loss will decrease at a more random pace.
For optimizer, if you want to grid search, just use Adam. It is quite good for most tasks.
All in all, no one can guarantee that tuning different hyperparameters will result in a performance gain. It all needs to be experimented and record. That's why there are so many research papers done on hyperparameters tuning.

Most suitable Machine Learning algorithm for this problem?

I have a dataset and i want to decide on which ML algorithm to apply to my given problem.
Customers are to fill out an assessment questionnaire of 50 questions. Examples of the questions are, what is your job, previous job history, how much do you earn, have you been rejected for a loan etc, and the end goal is to decide whether they should be rejected or not.
I have circa 500 entries for my algorithm to learn from and have pre-processed my dataset and converted the inputs into a numpy array and wondering what would be the best algorithm to use? Should i use a classification algorithm or a neural network in tensorflow and if the latter, what would be the layers I should use?
Thanks
How about beginning with xgboost or random forest? - So plain "old" ML?
The advantage would be that you could visualize the decision tree of the model once trained.
If using a NN in tensorflow (or even easier: keras with tensorflow backend), you could go with a MLP (multi layer perceptron), since the questions answers have fixed position in the input. You don't need many layers.
Important is that you normalize your input data columnwise, so that the input numbers are not much bigger/smaller than +1/-1, respectively. Introductory books often miss this point, though important.
Since your target labeling is "accept" or "reject", binary classifier will do it (also in the machine learning approach). (You use 0 and 1 as labels).
For NN, you don't need for such kind of classification that many layers or neurons. Try the smallest network first. let's say 10 neurons in first layer, then 7 neurons in the next layer (probably even less) and then 1 output neuron for the binary decision.
With Keras this would be:
from keras.models import Sequential
from keras.layers import Dense
def create_mlp(n_input = 500): # number of columns of input data 500 here
model = Sequential()
model.add(Dense(10, input_dim=n_input, kernel_initializer='normal', activation='relu')) # init = kernel_initializer
model.add(Dense(7, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
return model
model = create_mlp(500) # this will generate the correct NN compiled.
Your data frame (or Numpy input array must have as rows the samples,
the columns are for each answer for a question 1 column.
the answers you have to encode in a numeric form. Numbers should be small - the best between -1 and 1. NNs don't like big numbers. Thus column-wise normalization can help.)
That's it. I learned all this stuff last year. Good luck for learning. It will be tons of fun!

Keras trained regression model predicts same output for all set of test features

I am trying to build a regression model that predicts the 'Ratings' for movies using the dataset https://www.kaggle.com/shubhammehta21/movie-lens-small-latest-dataset. However after training the model, predictions outputs the same value for all test features. I have read previous similar features that suggested adjusting learning rates, no. of features and checking that the model predicting is the same as the trained model. None of these has worked for me.
I load the data and process it:
links= pd.read_csv('../input/movie-lens-small-latest-dataset/links.csv')
movies=pd.read_csv('../input/movie-lens-small-latest-dataset/movies.csv')
...
dataset=movies.merge(ratings,on='movieId').merge(tags,on='movieId').merge(links,on='movieId')
to_drop='title','genres','timestamp_x','timestamp_y','userId_y','imdbId','tmdbId']
dataset.drop(columns=to_drop,inplace=True)
dataset=pd.get_dummies(dataset)
The code shows how I build the regression model. I have tried adjusting the number of neuron and layers, however, that has not influenced the output.
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam
model = Sequential()
model.add(Dense(13, input_dim=1586, kernel_initializer='zero', activation='relu'))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal',activation='linear'))
# Compile model
adam = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=adam,metrics=['mse','mae'])
model.summary()
history = model.fit(train_dataset,train_labels,batch_size=30, epochs=10,verbose=1, validation_split=0.3)
score = model.evaluate(validation_dataset,validation_labels)
print("Test score:", score)
Whenever I try to predict the test dataset:
model.predict(test_dataset)
It predicts the value of
3.97
on all values. I am expecting a range of values between 0 - 5.
You should never (I mean, never) use kernel_initializer='zero' - to be honest, I am surprised that the option even exists in Keras!
Also, kernel_initializer='normal' is not recommended.
As a first step, remove all kernel_initializer arguments, so as to revert to the default and recommended one, kernel_initializer='glorot-uniform'; keep in mind that defaults are there for a reason (usually they work well), and you should change them only if you really have a reason to do so (which I trust you don't have here) and you know what you are doing.
If you still don't get what you would expect, experiment with other parameters (no. of layers/neurons, more epochs etc); you should leave the learning rate (lr) of Adam optimizer as is for starters (it's also one of these default values that seem to work nicely across cases).

How can I get weights converged in a way that MSE minimizes?

here is my code
for _ in range(5):
K.clear_session()
model = Sequential()
model.add(LSTM(256, input_shape=(None, 1)))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='RmsProp', metrics=['accuracy'])
hist = model.fit(x_train, y_train, epochs=20, batch_size=64, verbose=0, validation_data=(x_val, y_val))
p = model.predict(x_test)
print(mean_squared_error(y_test, p))
plt.plot(y_test)
plt.plot(p)
plt.legend(['testY', 'p'], loc='upper right')
plt.show()
Total params : 330,241
samples : 2264
and below is the result
I haven't changed anything.
I only ran for loop.
As you can see in the picture, the result of the MSE is huge, even though I have just run the for loop.
I think the fundamental reason for this problem is that the optimizer can not find global maximum and find local maximum and converge. The reason is that after checking all the loss graphs, the loss is no longer reduced significantly. (After 20 times) So in order to solve this problem, I have to find the global minimum. How should I do this?
I tried adjusting the number of batch_size, epoch. Also, I tried hidden layer size, LSTM unit, kerner_initializer addition, optimizer change, etc. but could not get any meaningful result.
I wonder how can I solve this problem.
Your valuable opinions and thoughts will be very much appreciated.
if you want to see full source here is link https://gist.github.com/Lay4U/e1fc7d036356575f4d0799cdcebed90e
From your example, the problem simply comes from the fact that you have over 100 times more parameters than you have samples. If you reduce the size of your model, you will see less variance.
The wider question you are asking is actually very interesting that usually isn't covered in tutorials. Nearly all Machine Learning models are by nature stochastic, the output predictions will change slightly everytime you run it which means you will always have to ask the question: Which model do I deploy to production ?
Off the top of my head there are two things you can do:
Choose the first model trained on all the data (after cross-validation, ...)
Build an ensemble of models that all have the same hyper-parameters and implement a simple voting strategy
References:
https://machinelearningmastery.com/train-final-machine-learning-model/
https://machinelearningmastery.com/randomness-in-machine-learning/
If you want to always start from the same point you should set some seed. You can do it like this if you use Tensorflow backend in Keras:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
If you want to learn why do you get different results in ML/DL models, I recommend this article.

Keras: specify input dropout layer that always keeps certain features

I'm training a neural net using Keras in Python for time-series climate data (predicting value X at time t=T), and tried adding a (20%) dropout layer on the inputs, which seemed to limit overfitting and cause a slight increase in performance. However, after I added a new and particularly useful feature (the value of the response variable at time of prediction t=0), I found massively increased performance by removing the dropout layer. This makes sense to me, since I can imagine how the neural net would "learn" the importance of that one feature and base the rest of its training around adjusting that value (i.e, "how do these other features affect how the response at t=0 changes by time t=T").
In addition, there are a few other features that I think should be present for all epochs. That said, I am still hopeful that a dropout layer could improve the model performance-- it just needs to not drop out certain features, like X at t_0: I need a dropout layer that will only drop out certain features.
I have searched for examples of doing this, and read the Keras documentation here, but can't seem to find a way to do it. I may be missing something obvious, as I'm still not familiar with how to manually edit layers. Any help would be appreciated. Thanks!
Edit: sorry for any lack of clarity. Here is the code where I define the model (p is the number of features):
def create_model(p):
model = Sequential()
model.add(Dropout(0.2, input_shape=(p,))) # % of features dropped
model.add(Dense(1000, input_dim=p, kernel_initializer='normal'
, activation='sigmoid'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal',activation='linear'))
model.compile(loss=cost_fn, optimizer='adam')
return model
The best way I can think of applying dropout only to specific features is to simply separate the features in different layers.
For that, I suggest you simply divide your inputs in essential features and droppable features:
from keras.layers import *
from keras.models import Model
def create_model(essentialP,droppableP):
essentialInput = Input((essentialP,))
droppableInput = Input((droppableP,))
dropped = Dropout(0.2)(droppableInput) # % of features dropped
completeInput = Concatenate()([essentialInput, dropped])
output = Dense(1000, kernel_initializer='normal', activation='sigmoid')(completeInput)
output = Dense(30, kernel_initializer='normal', activation='relu')(output)
output = Dense(1, kernel_initializer='normal',activation='linear')(output)
model = Model([essentialInput,droppableInput],output)
model.compile(loss=cost_fn, optimizer='adam')
return model
Train the model using two inputs. You have to manage your inputs before training:
model.fit([essential_train_data,droppable_train_data], predictions, ...)
I don't see any harm to using dropout in the input layer. The usage/effect would be a little different than normal of course. The effect would be similar to adding synthetic noise to an input signal; only the feature/pixel/whatever would be entirely unknown[zeroed out] instead of noisy. And inserting synthetic noise into the input is one of the oldest ways to improve robustness; certainly not bad practice as long as you think about whether it makes sense for your data set.
This question has already an accepted answer but it seems to me you are using dropout in a bad way.
Dropout is only for the hidden layers, not for the input layer !
Dropout act as a regularizer, and prevent the hidden layer complex coadaptation, quoting Hinton paper "Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging" (http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
Dropout can be seen as training several different models with your data and averaging the prediction at test time. If you prevent your models to have all the inputs during training, it will perform badly, especially if one input is crucial. What you want is actually avoid overfitting, meaning you prevent too complex models during the training phase (so each of your models will select the most important features first) before testing.
It is common practice to drop some of the features in ensemble learning but it is control and not stochastic like for dropout. It also works for neural networks as hidden layers have (often) way more neurons as inputs, and so dropout follows the law of big numbers, as for a small number of inputs, you can have in some bad case almost all your inputs dropped.
In conlusion: it is a bad practice to use dropout in the input layer of a neural network.

Categories