Character LSTM keeps generating same character sequence - python

I'm training a 2-layer character LSTM with keras to generate sequences of characters similar to the corpus I am training on. When I train the LSTM, however, the generated output by the trained LSTM is the same sequence over and over again.
I've seen suggestions for similar problems to increase the LSTM input sequence length, increase the batch size, add dropout layers, and increase the dropout amount. I've tried all these things and none of them seem to have fixed the issue. The one thing that has yielded some success is adding a random noise vector to each vector outputted by the LSTM during generation. This makes sense since the LSTM uses the previous step's output to generate the next output. However, generally if I add enough noise to break the LSTM out of its repetitive generation, the quality of the output degrades a great deal.
My LSTM training code is as follows:
# [load data from file]
raw_text = collected_statements.lower()
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text + '\b')))
char_to_int = dict((c, i) for i, c in enumerate(chars))
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]),
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,
save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fix random seed for reproducibility
seed = 8
numpy.random.seed(seed)
# split into 80% for train and 20% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=seed)
# train the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=18,
batch_size=256, callbacks=callbacks_list)
My generation code is as follows:
filename = "weights-improvement-18-1.5283.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
int_to_char = dict((i, c) for i, c in enumerate(chars))
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = unpadded_patterns[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = (x / float(n_vocab)) + (numpy.random.rand(1, len(pattern), 1) * 0.01)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
#print(index)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print("\nDone.")
When I run the generation code, I get the same sequence over and over again:
we have the best economy in the history of our country." "we have the best
economy in the history of our country." "we have the best economy in the
history of our country." "we have the best economy in the history of our
country." "we have the best economy in the history of our country." "we
have the best economy in the history of our country." "we have the best
economy in the history of our country." "we have the best economy in the
history of our country." "we have the best economy in the history of our
country."
Is there anything else I could try that could help to generate something besides the same sequence over and over?

In your character generation I would suggest sampling from the probabilities your model outputs instead of taking the argmax directly. This is what the keras example char-rnn does to get diversity.
This is the code they use for sampling in their example:
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
In your code you've got index = numpy.argmax(prediction)
I'd suggest just replacing that with index = sample(prediction) and experiment with temperatures of your choice. Keep in mind that higher temperatures make your output more random and lower temperatures make it less random.

What the model generates as its output is the probability of the next character given the previous character. And in the text generation process you just take the character with maximum probability. Instead, it might help to inject some stochasticity (i.e. randomness) into this process by sampling the next character based on the probability distribution generated by the model. One easy way to do this is to use np.random.choice function:
# get the probability distribution generated by the model
prediction = model.predict(x, verbose=0)
# sample the next character based on the predicted probabilites
idx = np.random.choice(y.shape[1], 1, p=prediction[0])[0]
# the rest is the same...
This way the next selected character is not always the most probable characters. Rather, all the characters have a chance to be selected guided by the probability distribution generated by your model. This stochasticity not only breaks the repetitive loop, but also it may result in some interesting generated texts.
Additionally, you can further inject stochasticity by introducing softmax temperature in the sampling process, which you can see in the #Primusa's answer which is based on the Keras char-rnn example. Basically, its ideas is that it would re-weight the probability distribution so that you can control how much surprising (i.e. higher temperature/entropy) or predictable (i.e. lower temperature/entropy) the next selected character would be.

Related

Multi-step LSTM Time series prediction

I am trying to build an LSTM model for Multistep prediction. My data is a time series of parking occupancy rate sampled each five minutes (I have 25 weeks of samples). I started creating the code like below :
import numpy as np
training_data_len = int(np.ceil( len(data) * .90 ))
train_data = data.iloc[0:int(training_data_len), :]
print(len(train_data))
# Create the testing data set
test_data = data.iloc[training_data_len: , :] # - timestep
print(len(test_data))
data_train = np.array(train_data)
def split_sequence(sequence, n_steps_in, n_steps_out):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps_in
out_end_ix = end_ix + n_steps_out
# check if we are beyond the sequence
if out_end_ix > len(sequence):
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
X_train, y_train = [], []
X_train, y_train = split_sequence(data_train,6,6)
reg = Sequential()
reg.add(LSTM(units = 200,return_sequences=True, input_shape=(1,1)))#, return_sequences=True , activation = 'relu'
reg.add(Dropout(0.2))
reg.add(LSTM(units = 200,return_sequences=True)) #, activation = 'relu'
reg.add(Dropout(0.2))
reg.add(LSTM(units = 200,return_sequences=True)) #, activation = 'relu'
reg.add(Dropout(0.2))
reg.add(Dense(6,))
#here we have considered loss as mean square error and optimizer as adam
reg.compile(loss='mse', optimizer='adam')
#training the model
#,validation_split=0.1,
# shuffle=False
reg.fit(X_train, y_train, epochs = 10,verbose=1)
data_test = np.array(test_data)
#here we are splitting the data weekly wise(7days)
X_test, y_test = split_sequence(data_test,6,6)
y_pred = reg.predict(X_test)
My goal is to predict using 30 minutes in the past(6 samples =30 minutes) next 30 minutes(6 samples =30 minutes).
I'm new with these kind of models and I wanna know if i'm working good or there is something that i'm missing or some improves.
Thank you
Question: Is there an issue with my approach?
Usually, you may want to try out multiple models and multiple hyper-parameters. If it's a toy project, you should at least try out multiple models. Make sure you understand how each model works before setting parameters.
You may want to have more data in than out. Get 1h in and predict 10 min out.
You may want to do some data analysis before running any code, to get some insight about what might work. Make it visual, create graphics like PCA (may not work well with time series).
Talking about models: You can replace your LSTM with a Transformer. It can retain more information for longer. It's a new type of model that is better in every way to LSTMs.
If you have questions about data science or machine learning you should try the datascience.StackExchange instead of StackOverflow. Here we are supposed to help with quick, snappy responses about code. ;)

Regression with LSTM - python and Keras

I am trying to use a LSTM network in Keras to make predictions of timeseries data one step into the future. The data I have is of 5 dimensions, and I am trying to use the previous 3 periods of readings to predict the a future value in the next period. I have normalised the data and removed all NaN etc, and this is the code I am trying to use to train the network:
def Network_ii(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
length = len(OUT)
train_x = IN[:int(0.9 * length)]
validation_x = IN[int(0.9 * length):]
train_y = OUT[:int(0.9 * length)]
validation_y = OUT[int(0.9 * length):]
# Define Network & callback:
train_x = train_x.reshape(train_x.shape[0],3, 5)
validation_x = validation_x.reshape(validation_x.shape[0],3, 5)
model = Sequential()
model.add(LSTM(units=128, return_sequences= True, input_shape=(train_x.shape[1],3)))
model.add(LSTM(units=128))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
train_y = np.asarray(train_y)
validation_y = np.asarray(validation_y)
history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y))
# Score model
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score)
# Save model
model.save(f"models/new_model")
I am attempting to roughly follow the steps outlined here- https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
However, no matter what adjustments I have made in terms of changing the number of dimensions used to train the network or the length of the time period I cannot get the output of the model to give predictions that are not either 1 or 0. This is even though the target data, in the array 'OUT' is made up of data continuous on [0,1].
I think there may be something wrong with how I am setting up the .Sequential() function, but I cannot see what to adjust. I am relatively new to this so any help would be greatly appreciated.
You are probably using a prediction function that is not the standard. Maybe you are using predict_classes?
The one that is well documented and the standard is model.predict.

Keras network producing inverse predictions

I have a timeseries dataset and I am trying to train a network so that it overfits (obviously, that's just the first step, I will then battle the overfitting).
The network has two layers:
LSTM (32 neurons) and Dense (1 neuron, no activation)
Training/model has these parameters:
epochs: 20, steps_per_epoch: 100, loss: "mse", optimizer: "rmsprop".
TimeseriesGenerator produces the input series with: length: 1, sampling_rate: 1, batch_size: 1.
I would expect the network would just memorize such a small dataset (I have tried even much more complicated network to no avail) and the loss on training dataset would be pretty much zero. It is not and when I visualize the results on the training set like this:
y_pred = model.predict_generator(gen)
plot_points = 40
epochs = range(1, plot_points + 1)
pred_points = numpy.resize(y_pred[:plot_points], (plot_points,))
target_points = gen.targets[:plot_points]
plt.plot(epochs, pred_points, 'b', label='Predictions')
plt.plot(epochs, target_points, 'r', label='Targets')
plt.legend()
plt.show()
I get:
The predictions have somewhat smaller amplitude but are precisely inverse to the targets. Btw. this is not memorized, they are inversed even for the test dataset which the algorithm hasn't trained on at all.It appears that instead of memorizing the dataset, my network just learned to negate the input value and slightly scale it down. Any idea why this is happening? It doesn't seem like the solution the optimizer should have converged to (loss is pretty big).
EDIT (some relevant parts of my code):
train_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x,
y,
length=1,
sampling_rate=1,
batch_size=1,
shuffle=False
)
model = Sequential()
model.add(LSTM(32, input_shape=(1, 1), return_sequences=False))
model.add(Dense(1, input_shape=(1, 1)))
model.compile(
loss="mse",
optimizer="rmsprop",
metrics=[keras.metrics.mean_squared_error]
)
history = model.fit_generator(
train_gen,
epochs=20,
steps_per_epoch=100
)
EDIT (different, randomly generated dataset):
I had to increase number of LSTM neurons to 256, with the previous setting (32 neurons), the blue line was pretty much flat. However, with the increase the same pattern arises - inverse predictions with somewhat smaller amplitude.
EDIT (targets shifted by +1):
Shifting the targets by one compared to predictions doesn't produce much better fit. Notice the highlighted parts where the graph isn't just alternating, it's more apparent there.
EDIT (increased length to 2 ... TimeseriesGenerator(length=2, ...)):
With length=2 the predictions stop tracking the targets so closely but the overall pattern of inversion still stands.
You say that your network "just learned to negate the input value and slightly scale it down". I don't think so. It is very likely that all you are seeing is the network performing poorly, and just predicting the previous value (but scaled as you say). This issue is something I've seen again and again. Here is another example, and another, of this issue. Also, remember it is very easy to fool yourself by shifting the data by one. It is very likely you are simply shifting the poor prediction back in time and getting an overlap.
EDIT: After author's comments I do not believe this is the correct answer but I will keep it posted for posterity.
Great question and the answer is due to how the Time_generator works! Apparently instead of grabbing x,y pairs with the same index (e.g input x[0] to output target y[0]) it grabs target with offset 1 (so x[0] to y[1]).
Thus plotting y with offset 1 will produce the desired fit.
Code to simulate:
import keras
import matplotlib.pyplot as plt
x=np.random.uniform(0,10,size=41).reshape(-1,1)
x[::2]*=-1
y=x[1:]
x=x[:-1]
train_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x,
y,
length=1,
sampling_rate=1,
batch_size=1,
shuffle=False
)
model = keras.models.Sequential()
model.add(keras.layers.LSTM(100, input_shape=(1, 1), return_sequences=False))
model.add(keras.layers.Dense(1))
model.compile(
loss="mse",
optimizer="rmsprop",
metrics=[keras.metrics.mean_squared_error]
)
model.optimizer.lr/=.1
history = model.fit_generator(
train_gen,
epochs=20,
steps_per_epoch=100
)
Proper plotting:
y_pred = model.predict_generator(train_gen)
plot_points = 39
epochs = range(1, plot_points + 1)
pred_points = np.resize(y_pred[:plot_points], (plot_points,))
target_points = train_gen.targets[1:plot_points+1] #NOTICE DIFFERENT INDEXING HERE
plt.plot(epochs, pred_points, 'b', label='Predictions')
plt.plot(epochs, target_points, 'r', label='Targets')
plt.legend()
plt.show()
Output, Notice how the fit is no longer inverted and is mostly very accurate:
This is how it looks when the offset is incorrect:

Creating a neural network in keras to multiply two input integers

I am playing around with Keras v2.0.8 in Python v2.7 (Tensorflow backend) to create small neural networks that calculate simple arithmetic functions (add, subtract, multiply, etc.), and am a bit confused. The below code is my network which generates a random training dataset of integers with the corresponding labels (the two inputs added together):
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] + b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
X, y = create_data(0, 500, 10000)
model = Sequential()
model.add(Dense(3, input_dim=2))
model.add(Dense(5, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=10)
test_data, _ = create_data(0, 500, 10)
results = model.predict(test_data, batch_size=2)
sq_error = []
for i in range(0, len(test_data)):
print 'test value:', test_data[i], 'result:', results[i][0], 'error:',\
'%.2f' %(results[i][0] - (test_data[i][0] + test_data[i][1]))
sq_error.append((results[i][0] - (test_data[i][0] + test_data[i][1])))
print '\n total rmse error: ', sqrt(np.sum(np.array(sq_error)))
This trains perfectly well and produces no unexpected results. However, when I create the training data by multiplying the two inputs together the model's loss for each epoch stays around 7,000,000,000 and the model does not converge at all. The data creation function for this is as follows:
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] * b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
I also had the same problem when I had training data of a single input integer and created the label by squaring the input data. However, it worked fine when I only multiplied the single input by a constant value or added/subtracted by a constant.
I have two questions:
1) Why is this the case? I assume it has something to do with the fundamentals of neural networks, but I can't work it out.
2) How could I adapt this code to train a model that multiplies two input numbers together.
The network architecture (2 - 3 - 5 - 3 - 5 - 1) is fairly random right now. I've tried lots of different ones varying in layers and neurons, this one just happened to be on my screen as I write this and got an accuracy of 100% for adding two inputs.
It is due to large gradient updates caused by large numbers in training data. When using a neural network, you should first ensure that the training data falls in a small range (usually [-1,1] or [0,1]) to help the optimization process and prevent disruptive gradient updates. Therefore, you should first normalize data. In this case, one good candidate would be log-normalization.
Further, the 'accuracy' as a metric in Keras is used in case of a classification problem. In a regression problem, using it does not make sense, and instead it's better to use a relevant metric like "mean absolute error" or 'mae'.

How to deal with situation where LSTM fails to learn (constantly makes the same incorrect prediction)

I am trying to use LSTM neural networks in order to make a song composer. Basically this is based of a text generator (tries to predict the next character after looking at a sequence of characters) but instead of characters, it tried to predict notes.
Structure of the midi file that serves as the input (Y-axis is the pitch or note value while X-axis is time):
And this is the predicted note values:
I set an epoch of 50, but it seems that the LSTM's loss rate does not decrease, most of the time its loss rate does not improve.
I suspect this is because there is an overwhelming number of a particular note (in this case, note value 65) which makes the LSTM lazy during training phase and predict 65 each and every time.
I feel like this is a common problem among LSTMs and time-series based learning algorithms. How would I solve a problem like this? If what I mentioned is not the problem, then what is the problem and how do I solve that?
Here is the code I am using to train if you need it:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
seq_length = 100
read_path = '../matrices/input/world-is-mine/world-is-mine-y-0.npy'
raw_text = numpy.load(read_path)
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
# prepare the dataset of input to output pairs encoded as integers
dataX = []
dataY = []
# dataX is the encoding version of the sequence
# dataY is an encoded version of the next prediction
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length,1))
# normalize
X = X/float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
print 'X: ', X.shape
print 'Y: ', y.shape
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
#model.add(Dropout(0.05))
model.add(LSTM(256))
#model.add(Dropout(0.05))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.
# We are interested in a generalization of the dataset that minimizes the chosen loss function
# We are seeking a balance between generalization of the dataset and overfitting but short of memorization
# define the check point
filepath="../checkpoints/weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(X,y, nb_epoch=50, batch_size=64, callbacks=callbacks_list)
I have no experience on working with music data. From my experience with text data, this seems like a under-fitted model. Increasing the training dataset with different note value should overcome the underfitting problem. It seems like the training examples are not enough for learning the note variation. For example, for char language model, 1 MB data is too small for training a reasonable LSTM model. Also, try to train with smaller sequence length (let's say with 20) first. Smaller sequence length will be easier to learn than the longer one, with limited training data.

Categories