I was following a tutorial to generate English text using LSTMs and using Shakespeare's works as a training file. This is the model I am using with reference to that-
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
model.add(Dropout(0.2))
for i in range(LAYER_NUM - 1):
model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
After 30 epochs of training, I save the model using model.save('model.h5'). At this point, the model has learned the basic format and has learned a few words. However, when I try to load the model in a new program using load_model('model.h5') and try to generate some text, it ends up predicting completely random letters and symbols. This led me to think that the model weights are not being restored properly, since I encountered the same problem while storing only the model weights. So is there any alternative for storing and restoring trained models with LSTM layers?
For reference, in order to generate the text, the function randomly generates a character and feeds it into the model to predict the next character. This is the function-
def generate_text(model, length):
ix = [np.random.randint(VOCAB_SIZE)]
y_char = [ix_to_char[ix[-1]]]
X = np.zeros((1, length, VOCAB_SIZE))
for i in range(length):
X[0, i, :][ix[-1]] = 1
print(ix_to_char[ix[-1]], end="")
ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
y_char.append(ix_to_char[ix[-1]])
return ('').join(y_char)
EDIT
The snippet of code for training-
for nbepoch in range(1, 11):
print('Epoch ', nbepoch)
model.fit(X, y, batch_size=64, verbose=1, epochs=1)
if nbepoch % 10 == 0:
model.model.save('checkpoint_{}_epoch_{}.h5'.format(512, nbepoch))
generate_text(model, 50)
print('\n\n\n')
Where generate_text() is just a function to predict a new character, starting from a randomly generated character. After every 10 epochs of training, the entire model is saved as a .h5 file.
The code for loading the model-
print('Loading Model')
model = load_model('checkpoint_512_epoch_10.h5')
print('Model loaded')
generate_text(model, 400)
As far as predictions go, the text generation is normally structured while training and the model learns some words. However, when the saved model is loaded, the text generation is completely random, as if the weights were randomly reinitialized.
After doing a bit of digging, I finally found out that the way I was creating the dictionary mapping between characters and one-hot vectors was the issue. I was using the char = list(set(data)) function to get a list of all the characters in the file, and then assign the index of the character as that character's 'code number'. However, apparently the list(set(data)) function does not always output the same list, instead the order is randomized for each 'session' of python. So my dictionary mapping used to change between saving and loading the model, since that occurred in different scripts. Using char = sorted(list(set(data))) works to eliminate this problem.
Related
I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow
My question is about the practical implementation of "Domain Adaptation" into a functional model in keras with tensorflow backend.
Description of the problem:
I have a collection of particle collision samples which consist of n variables. One half of them is simulated data with certain class labels (e.g "W-Boson"). The other half is real collision data which is not labeled. The key idea now is to setup a keras model, which has two outputs. One for classifying the class of a sample and one for classifying the domain, so wether it is simulated or real data. The thing is that the model shall be trained so that the domain classifier performs very poor. This is achieved by flipping the sign of the incoming gradient from the domain end of the network during training. This technique is called "Domain Adaptation". The model is expected to be trained to find domain-invariant features, or in other words, to perform the same on simulated and real collision data.
The framework I am working with has an existin functional keras model, which I wanted to expand with said domain classifier. This is a prototype I came up with:
# common layers
inputs = keras.Input(shape=(n_variables, ))
X = layers.Dense(units=50, activation="relu")(inputs)
# domain end
flip_layer = flipGradientTF.GradientReversal(hp_lambda=0.3)(X)
X_domain = layers.Dense(units=50, activation="relu")(flip_layer)
domain_out = layers.Dense(units=2, activation="softmax", name="domain_out")(X_domain)
# class end
X_class = layers.Dense(units=50, activation="relu")(X)
class_out = layers.Dense(units=n_classes, activation="softmax", name="class_out")(X_class)
The code for flipGradientTF is taken from https://github.com/michetonu/gradient_reversal_keras_tf
And further on for compiling and training the model:
model = keras.Model(inputs=inputs, outputs=[class_out, domain_out])
model.compile(optimizer="adam", loss=loss_function, metrics="accuracy")
# train model
model.fit(
x = train_data,
y = [train_class_labels, train_domain_labels],
batch_size = 200,
epochs = 200,
sample_weight = {"class_out": class_weights, "domain_out": None}
)
For train_data I am passing the dataframe which consists of the data from both domains. As I have tried to use either "categorical_crossentropy" or "sparse_categorical_crossentropy" as the loss_function, train_class_labels and train_domain_labels where either in the one-hot representation or in the integer representation. My biggest issue is figuring out what to use for the class labels of the unlabeled data and this led to a gut feeling that I am on the wrong track here.
So in a nutshell:
Is this implementation strategy legit and assuming it is, what should I do about the class labels for the unlabeled data? And if it is not legit, what would be a better way of attacking this problem?
Any help would be much appreciated :)
To preface this, I have plenty of experience with python and moderate experience building and using machine learning networks. That being said, this is the first LSTM I have made aside from some of the cookie-cutter examples available, so any help is appreciated. I feel like this is a problem with a simple solution and that I have just been looking at this code for far too long to see it.
This model is made in a python3.5 venv using Keras with a tensorflow backend.
In short, I am trying to make predictions of some temporal data using the data itself as well as a few mathematical permutations of this data, creating four input features. I am building a time-series input from the prior 60 data points and specifying the prediction target to be 60 data points in the future.
Shape of complete training data (input)(target): (2476224, 60, 4) (2476224)
Shape of single data "point" (input)(target): (1, 60, 4) (1)
What appears to be happening is that the trained model has fit the trailing value of my input time-series (the current value) instead of the target I have provided it (60 cycles in the future).
What is interesting is that the loss function seems to be calculating according to the correct prediction target, yet the model is not converging to the proper solution.
I have no idea why the model should be doing this. My first thought was that I was preprocessing my data incorrectly and feeding it the wrong target. I have tested my input formatting of the data extensively and am pretty confident that I am providing the model with he correct target and input information.
In one instance, I had increased the learning rate a tad such that the model converged to a local minima. This testing loss of this convergence was very similar to the loss of my preferred learning rate (still quite high). But the predictions were still of the "current value". Why is this so?
Here is how I created my model:
def create_model():
lstm_model = Sequential()
lstm_model.add(CuDNNLSTM(100, batch_input_shape=(batch_size, time_step, train_input.shape[2]),
stateful=True, return_sequences=True,
kernel_initializer='random_uniform'))
lstm_model.add(Dropout(0.4))
lstm_model.add(CuDNNLSTM(60))
lstm_model.add(Dropout(0.4))
lstm_model.add(Dense(20, activation='relu'))
lstm_model.add(Dense(1, activation='linear'))
optimizer = optimizers.Adagrad(lr=params["lr"])
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)
return lstm_model
This is how I am pre-processing the data. The first function, build_timeseries, constructs my input-output pairs. I believe this is working correctly (but please correct me if I am wrong). The second function trims the pairs to fit the batch size. I do the exact same for the test input/target.
train_input, train_target = build_timeseries(train_input, time_step, pred_horiz, 0)
train_input = trim_dataset(train_input, batch_size)
train_target = trim_dataset(train_target, batch_size)
def build_timeseries(mat, TIME_STEPS, PRED_HORIZON, y_col_index):
# y_col_index is the index of column that would act as output column
dim_0 = mat.shape[0] # num datasets
dim_1 = mat.shape[1] # num features
dim_2 = mat.shape[2] # num datapoints
# Reformatted matrix
mat = mat.swapaxes(1, 2)
x = np.zeros((dim_0*(dim_2-PRED_HORIZON), TIME_STEPS, dim_1))
y = np.zeros((dim_0*(dim_2-PRED_HORIZON),))
k = 0
for i in range(dim_0): # Iterate through datasets
for j in range(TIME_STEPS, dim_2-PRED_HORIZON):
x[k] = mat[i, j-TIME_STEPS:j]
y[k] = mat[i, j+PRED_HORIZON, y_col_index]
k += 1
print("length of time-series i/o", x.shape, y.shape)
return x, y
def trim_dataset(mat, batch_size):
no_of_rows_drop = mat.shape[0] % batch_size
if(no_of_rows_drop > 0):
return mat[no_of_rows_drop:]
else:
return mat
Lastly, this is how I call the actual model.
history = model.fit(train_input, train_target, epochs=params["epochs"], verbose=2, batch_size=batch_size,
shuffle=True, validation_data=(test_input, test_target), callbacks=[es, mcp])
As the model converges, I expect it to predict values close to the specified targets I had fed it. However instead, its predictions align much more closely with the trailing value of the time-series data (or the current value). Though, on the other hand, the model appears to be evaluating the loss according to the specified target.... Why is it working this way and how can I fix it? Any help is appreciated.
I am building a very simple DNN binary model which I define as:
def __build_model(self, vocabulary_size):
model = Sequential()
model.add(Embedding(vocabulary_size, 12, input_length=vocabulary_size))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
return model
with training like:
def __train_model(self, model, model_data, training_data, labels):
hist = model.fit(training_data, labels, epochs=20, verbose=True, validation_split=0.2)
model.save('models/' + model_data['Key'] + '.h5')
return model
The idea is to feed tfidf vectorized text after training and predict whenever it belongs to class 1 or 0. Sadly when I run predict against it, I get an array of predictions instead of expected 1 probability for the article belonging to class 1. The array values seem very uniform. I assume this comes from some mistake in the model. I try popping prediction like so:
self._tokenizer.fit_on_texts(asset_article_data.content)
predicted_post_vector = self._tokenizer.texts_to_matrix(post, mode='tfidf')
return model.predict(predicted_post_vector) > 0.60 // here return array instead of true/false
The training data is vectorized text itself. What might be off?
The mistake you are probably making is that the post is a string, whereas it should be a list of strings. That's why, as you mentioned, the model.predict() produces a lot of values: because tokenizer has iterated over the characters of post and produced a Tf-idf vector for each of them! Just put it in a list and the problem would be resolved:
... = self._tokenizer.texts_to_matrix([post], ...)
There are two ways of solving your issue:
model.predict_classes as Simon said or use argmax
np.argmax(model.predict(predicted_post_vector), axis=1)
I would personally use pd.get_dummies(y_train) in your target variable and adjust output layer to Dense(2, activation='sigmoid').
Keras is build to predict output for more than one input that's why the output is an array. Refer to the keras doc here (Returns Numpy array(s) of predictions). So if you need a single output, just select the first element of the array :
model.predict(predicted_post_vector)[0] > 0.60
I am trying to use LSTM neural networks in order to make a song composer. Basically this is based of a text generator (tries to predict the next character after looking at a sequence of characters) but instead of characters, it tried to predict notes.
Structure of the midi file that serves as the input (Y-axis is the pitch or note value while X-axis is time):
And this is the predicted note values:
I set an epoch of 50, but it seems that the LSTM's loss rate does not decrease, most of the time its loss rate does not improve.
I suspect this is because there is an overwhelming number of a particular note (in this case, note value 65) which makes the LSTM lazy during training phase and predict 65 each and every time.
I feel like this is a common problem among LSTMs and time-series based learning algorithms. How would I solve a problem like this? If what I mentioned is not the problem, then what is the problem and how do I solve that?
Here is the code I am using to train if you need it:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
seq_length = 100
read_path = '../matrices/input/world-is-mine/world-is-mine-y-0.npy'
raw_text = numpy.load(read_path)
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
# prepare the dataset of input to output pairs encoded as integers
dataX = []
dataY = []
# dataX is the encoding version of the sequence
# dataY is an encoded version of the next prediction
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length,1))
# normalize
X = X/float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
print 'X: ', X.shape
print 'Y: ', y.shape
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
#model.add(Dropout(0.05))
model.add(LSTM(256))
#model.add(Dropout(0.05))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.
# We are interested in a generalization of the dataset that minimizes the chosen loss function
# We are seeking a balance between generalization of the dataset and overfitting but short of memorization
# define the check point
filepath="../checkpoints/weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(X,y, nb_epoch=50, batch_size=64, callbacks=callbacks_list)
I have no experience on working with music data. From my experience with text data, this seems like a under-fitted model. Increasing the training dataset with different note value should overcome the underfitting problem. It seems like the training examples are not enough for learning the note variation. For example, for char language model, 1 MB data is too small for training a reasonable LSTM model. Also, try to train with smaller sequence length (let's say with 20) first. Smaller sequence length will be easier to learn than the longer one, with limited training data.