Currently I'am training my Word2Vec + CNN for Twitter sentiment analysis about COVID-19 vaccine domain. I used the pre-trained GoogleNewsVectorNegative300 word embedding. The problem is why I heavily overfit on training proses. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. Here several processes that I have done before fitting the model:
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Training:
Testing:
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding and CNN, and I split my training data again with 10% for validation as follows:
# Build single CNN model
def build_cnn_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,))
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# CNN model layer
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(embedding_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
flatten = Flatten()(batch_norm_layer)
# Dense model layer
dense_layer = Dense(units=10, activation='relu')(flatten)
batch_norm_layer = BatchNormalization()(dense_layer)
output_layer = Dense(units=3, activation='softmax')(batch_norm_layer)
cnn_model = Model(inputs=input_layer, outputs=output_layer)
return cnn_model
return lstm_model
sinovac_cnn_history = sinovac_cnn_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=128,
epochs=100,
validation_split=0.1,
verbose=1)
The training result:
I really need some suggestions or insights because I have been doing this without any performance progress to my model
That's quite a complex problem. It sure looks like overfitting as you said yourself. Meaning the model can't generalize well from your training set to new data.
Intuitively, I would suggest for you to cycle hyperparameters (epochs, batch size, learning rate, dropout layers), if you didn't already, to seek a better combination. Also, I would suggest to use cross-validation to get a better idea of the performance of your classifier. This would also shuffle the training data and avoid that the model learns the data by heart.
Have you tried classifying the original data without the data augmentation? It's not a lot of data, but it could be enough to see if the performance on the test set is better than the final version, and thus see whether the data augmentation might be screwing something up in your data.
Did you try another embedding? I don't really think this is the problem, but in the search for the error I would probably switch it to see what happens.
Last but not least, do you know for a fact that this model structure can handle this task? Meaning did you find a working example somewhere? It sure sounds like it could do it, but there is the chance that the CNN model for example just doesn't generalize well over the embeddings. Have you considered using a model specified on text classification, like a Transformer or an LSTM?
Related
currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:
# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,), dtype='int32')
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# LSTM model layer
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=True)(embedding_layer)
batch_normalization = BatchNormalization()(lstm_layer)
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=False)(batch_normalization)
batch_normalization = BatchNormalization()(lstm_layer)
# Dense model layer
dense_layer = Dense(units=128, activation='relu')(batch_normalization)
dropout_layer = Dropout(rate=0.5)(dense_layer)
batch_normalization = BatchNormalization()(dropout_layer)
output_layer = Dense(units=3, activation='softmax')(batch_normalization)
lstm_model = Model(inputs=input_layer, outputs=output_layer)
return lstm_model
# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=64,
epochs=20,
validation_split=0.1,
verbose=1)
The training result:
The evaluation result:
I really need some suggestions or insights to have a good accuracy on my test
Without reviewing everything, a few high-order things that may be limiting your results:
The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.
I'm training a 2-layer character LSTM with keras to generate sequences of characters similar to the corpus I am training on. When I train the LSTM, however, the generated output by the trained LSTM is the same sequence over and over again.
I've seen suggestions for similar problems to increase the LSTM input sequence length, increase the batch size, add dropout layers, and increase the dropout amount. I've tried all these things and none of them seem to have fixed the issue. The one thing that has yielded some success is adding a random noise vector to each vector outputted by the LSTM during generation. This makes sense since the LSTM uses the previous step's output to generate the next output. However, generally if I add enough noise to break the LSTM out of its repetitive generation, the quality of the output degrades a great deal.
My LSTM training code is as follows:
# [load data from file]
raw_text = collected_statements.lower()
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text + '\b')))
char_to_int = dict((c, i) for i, c in enumerate(chars))
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]),
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,
save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fix random seed for reproducibility
seed = 8
numpy.random.seed(seed)
# split into 80% for train and 20% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=seed)
# train the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=18,
batch_size=256, callbacks=callbacks_list)
My generation code is as follows:
filename = "weights-improvement-18-1.5283.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
int_to_char = dict((i, c) for i, c in enumerate(chars))
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = unpadded_patterns[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = (x / float(n_vocab)) + (numpy.random.rand(1, len(pattern), 1) * 0.01)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
#print(index)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print("\nDone.")
When I run the generation code, I get the same sequence over and over again:
we have the best economy in the history of our country." "we have the best
economy in the history of our country." "we have the best economy in the
history of our country." "we have the best economy in the history of our
country." "we have the best economy in the history of our country." "we
have the best economy in the history of our country." "we have the best
economy in the history of our country." "we have the best economy in the
history of our country." "we have the best economy in the history of our
country."
Is there anything else I could try that could help to generate something besides the same sequence over and over?
In your character generation I would suggest sampling from the probabilities your model outputs instead of taking the argmax directly. This is what the keras example char-rnn does to get diversity.
This is the code they use for sampling in their example:
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
In your code you've got index = numpy.argmax(prediction)
I'd suggest just replacing that with index = sample(prediction) and experiment with temperatures of your choice. Keep in mind that higher temperatures make your output more random and lower temperatures make it less random.
What the model generates as its output is the probability of the next character given the previous character. And in the text generation process you just take the character with maximum probability. Instead, it might help to inject some stochasticity (i.e. randomness) into this process by sampling the next character based on the probability distribution generated by the model. One easy way to do this is to use np.random.choice function:
# get the probability distribution generated by the model
prediction = model.predict(x, verbose=0)
# sample the next character based on the predicted probabilites
idx = np.random.choice(y.shape[1], 1, p=prediction[0])[0]
# the rest is the same...
This way the next selected character is not always the most probable characters. Rather, all the characters have a chance to be selected guided by the probability distribution generated by your model. This stochasticity not only breaks the repetitive loop, but also it may result in some interesting generated texts.
Additionally, you can further inject stochasticity by introducing softmax temperature in the sampling process, which you can see in the #Primusa's answer which is based on the Keras char-rnn example. Basically, its ideas is that it would re-weight the probability distribution so that you can control how much surprising (i.e. higher temperature/entropy) or predictable (i.e. lower temperature/entropy) the next selected character would be.
I looked at several answers, and was not able to see a clear solution to what I'm trying to do.
I have an LSTM for binary text classification that takes the top 40k words in a corpus, then operates on the first 50 tokens. Prepared like this:
max_words = 40000
max_review_length = 50
embedding_vector_length = 100
batch_size = 128
epochs = 10
all_texts = combo.title.tolist()
lstm_text_tokenizer = Tokenizer(nb_words=max_words)
lstm_text_tokenizer.fit_on_texts(all_texts)
x_train = lstm_text_tokenizer.texts_to_sequences(x_train.title.tolist())
x_test = lstm_text_tokenizer.texts_to_sequences(x_test.title.tolist())
x_test = sequence.pad_sequences(x_test, maxlen=50)
x_train = sequence.pad_sequences(x_train, maxlen=50)
My current model looks like this:
def lstm_cnn_model(max_words, embedding_vector_length, max_review_length):
model = Sequential()
model.add(Embedding(max_words, embedding_vector_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
return model
I also have a 1-dimensial list of meta data for each example, with one value per example. I may have more complex meta data to add in the future.
My question is, what is the best way to combine these two inputs in the training of the model?
It would be wise to now switch to the functional API and create a multi-input network which will take the text as well the meta data:
text_in = Input(shape=(max_review_length,))
meta_in = Input(shape=(1,)) # so 1 meta feature per review
# Embedding(...)(text_in)
# You process text_in however you like
text_features = LSTM(100)(embedded_text)
merged = concatenate([text_features, meta_in]) # (samples, 101)
text_class = Dense(num_classes, activation='softmax')(merged)
model = Model([text_in, meta_in], text_class)
return model
The idea is that the functional API gives you the option to create computation graphs that can use both inputs in a non-sequential way. You can extract features from text, features from meta data and merge them to see if it improves classification. You might want to look at how to encode data for using meta data.
I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set.
Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the first step. Below is the code for my keras classifier and the variable X_new is the numpy array of new features selected.
The code for keras classifier is as under.
def create_model(
dropout=0.2):
n_x_new=X_new.shape[1]
np.random.seed(6000)
model_new = Sequential()
model_new.add(Dense(n_x_new, input_dim=n_x_new, kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.add(Dense(10, kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.add(Dropout(0.2))
model_new.add(Dense(1,kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.compile(loss='binary_crossentropy',optimizer='adam', metrics=['binary_crossentropy'])
return model_new
seed = 7
np.random.seed(seed)
clf=KerasClassifier(build_fn=create_model, epochs=10, batch_size=1000, verbose=0)
param_grid = {'clf__dropout':[0.1,0.2]}
model = Pipeline([('sel', sel),('clf', clf),])
grid = GridSearchCV(estimator=model, param_grid=param_grid,scoring='roc_auc', n_jobs=1)
grid_result = grid.fit(np.concatenate((train_x_upsampled, cross_val_x_upsampled), axis=0), np.concatenate((train_y_upsampled, cross_val_y_upsampled), axis=0))
As I am using Pipline with grid search, I don't understand how my neural network will get the important features selected in the first step. I want to get those important features selected into an array of X_new.
Do I need to implement a custom estimator in between sel and keras model?
If yes, How would I implement one? I know the generic code for custom estimator but I am unable to mold it according to my requirement. The generic code is as under.
class new_features(TransformerMixin):
def transform(self, X):
X_new = sel.transform(X)
return X_new
But this is not working. Is there any way I can solve this problem without using custom estimator in between?
I am trying to use LSTM neural networks in order to make a song composer. Basically this is based of a text generator (tries to predict the next character after looking at a sequence of characters) but instead of characters, it tried to predict notes.
Structure of the midi file that serves as the input (Y-axis is the pitch or note value while X-axis is time):
And this is the predicted note values:
I set an epoch of 50, but it seems that the LSTM's loss rate does not decrease, most of the time its loss rate does not improve.
I suspect this is because there is an overwhelming number of a particular note (in this case, note value 65) which makes the LSTM lazy during training phase and predict 65 each and every time.
I feel like this is a common problem among LSTMs and time-series based learning algorithms. How would I solve a problem like this? If what I mentioned is not the problem, then what is the problem and how do I solve that?
Here is the code I am using to train if you need it:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
seq_length = 100
read_path = '../matrices/input/world-is-mine/world-is-mine-y-0.npy'
raw_text = numpy.load(read_path)
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
# prepare the dataset of input to output pairs encoded as integers
dataX = []
dataY = []
# dataX is the encoding version of the sequence
# dataY is an encoded version of the next prediction
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i+seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length,1))
# normalize
X = X/float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
print 'X: ', X.shape
print 'Y: ', y.shape
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
#model.add(Dropout(0.05))
model.add(LSTM(256))
#model.add(Dropout(0.05))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.
# We are interested in a generalization of the dataset that minimizes the chosen loss function
# We are seeking a balance between generalization of the dataset and overfitting but short of memorization
# define the check point
filepath="../checkpoints/weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(X,y, nb_epoch=50, batch_size=64, callbacks=callbacks_list)
I have no experience on working with music data. From my experience with text data, this seems like a under-fitted model. Increasing the training dataset with different note value should overcome the underfitting problem. It seems like the training examples are not enough for learning the note variation. For example, for char language model, 1 MB data is too small for training a reasonable LSTM model. Also, try to train with smaller sequence length (let's say with 20) first. Smaller sequence length will be easier to learn than the longer one, with limited training data.