Python - Keras Model doesnt converge

Python - Keras Model doesnt converge - python

I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422

There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!

Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

Accuracy of TensorFlow model changes a lot each time i run model.fit

My project is to try and find out if I can predict gender of people speaking near phone from data from gyroscope and accelerometer. I have 315 examples(60sec each) and each example has 2997 lines where each line represents magnitude of vector from gyro/accelerometer xyz axis.
I shuffled input and output by same seed and I normalized input data. I split data on 60|20|20. In this test I try from accelerometer to see if there is male speaking, so output is binary.
When I train data with current model, sometimes I get accuracy as high as 0.68 and as low as 0.36 while loss is almost always around 0.69. I run it in a for loop for 10 times and average is 0.5 accuracy and 0.69 loss.
First question is i tried multiple types of models, learning rates, optimization algorithms etc. but in average i wasnt too successful. Should I try Recurrent NNs and where can i learn it?
Second question is if i train model with accuracy of 68%, is it okay to say the model has 68% accuracy even though i know average is 50%?
model = tf.keras.Sequential()
model.add(layers.Dense(512, activation='relu',input_shape = (2997,), kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
for j in range(10) :
model.add(layers.Dense(1024, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=200,
decay_rate=1,
staircase=True)
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience = 20
)
]
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate = lr_schedule),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics = ['accuracy'])
history = model.fit(
train_vector_examples,
train_vector_labels,
validation_split = 0.25,
epochs =80,
callbacks=callbacks,
verbose=0,
shuffle=False
)
loss1, accuracy = model.evaluate(test_vector_examples, test_vector_labels)

This is from my own experience working with different types of data; to have get good solutions to your questions you should probably study the characteristics of the data closely before coming up with any models/algorithms.
First question: generally speaking, RNNs are good for data that has time dependency, or in other words, for cases where the inputs' order matters (e.g. time series, text). So I think RNNs may not be the best choice for your type of data, as I suppose ordering does not matter in your dataset.
Second question: this really depends on the difficulty of the problem you are trying to solve; but in my opinion 68% is quite low as 50% is basically the same as random choice. You probably want to improve the accuracy further.
Also, from your explanations, I can see that each gyro/accelerometer input has shape of rank 3 (xyz), so maybe you can try some CNN architectures and see how it goes.

Loss function for class imbalanced multi-class classifier in Keras

I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.

First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.

Keras: using mask_zero with padded sequences versus single sequence non padded training

I'm building an LSTM model in Keras to classify entities from sentences. I'm experimenting with both zero padded sequences and the mask_zero parameter, or a generator to train the model on one sentence (or batches of same length sentences) at a time so I don't need to pad them with zeros.
If I define my model as such:
model = Sequential()
model.add(Embedding(input_dim=vocab_size+1, output_dim=200, mask_zero=True,
weights=[pretrained_weights], trainable = True))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(target_size, activation='softmax')))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])
Can I expect the padded sequences with the mask_zero parameter to perform similarly to feeding the model non-padded sequences one sentence at a time? Essentially:
model.fit(padded_x, padded_y, batch_size=128, epochs=n_epochs,
validation_split=0.1, verbose=1)
or
def iter_sentences():
while True:
for i in range(len(train_x)):
yield np.array([train_x[i]]), to_categorical([train_y[i]], num_classes = target_size)
model.fit_generator(iter_sentences(), steps_per_epoch=less_steps, epochs=way_more_epochs, verbose=1)
I'm just not sure if there is a general preference for one method over the other, or the exact effect the mask_zero parameter has on the model.
Note: There are slight parameter differences for the model initialization based on which training method I'm using - I've left those out for brevity.

The biggest difference will be performance and training stability, otherwise padding and then masking is the same as processing single sentence at time.
performance: Well you will train one point at a time which might not exploit any parallelism that is available on the hardware. Often, we adjust the batch size to get the best performance from the machine during training and prediction.
training stability: when you set batch size to 1 you are not longer performing mini-batch training. The training routine will apply updates after every data point which might be detrimental for momentum based algorithms such as Adam. Instead, accumulating gradients over a batch tends to provide more stable convergence especially if the data is noisy.
So to answer the question, no, you can't expect them to perform similarly.

Solving FizzBuzz with Keras

I am trying to solve FizzBuzz using Keras and it works quite well for numbers between 1 and 10.000 (90-100% win rate and close to 0 loss). However, if I try even higher numbers, that is numbers between 1 and 100.000 it doesn't seem to perform well (~50% win rate, loss ~0.3). In fact, it performs quite poorly and I have no clue what I can do to solve this task. So far I am using a very simple neural net architecture with 3 hidden layers:
model = Sequential()
model.add(Dense(2000, input_dim=state_size, activation="relu"))
model.add(Dense(1000, activation="relu"))
model.add(Dense(500, activation="relu"))
model.add(Dense(num_actions, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
I found that the more neurons I have the better it performs, at least for numbers below 10.000.
I am training my neural net in a step-wise fashion, meaning that I am not computing the inputs and targets beforehand, but instead train the network step by step. Again, this works quite well and it shouldn't make a difference right? Here's the main loop:
for epoch in range(np_epochs):
action = random_number()
x_raw = to_binary(action)
x = np.expand_dims(x_raw, 0)
prediction = model.predict(x)
y, victory, _, _ = check_prediction(action, prediction)
memory.append((x_raw, y))
curr_batch_size = min(batch_size, len(memory))
batch = random.sample(memory, curr_batch_size)
inputs = []
targets = []
for i, t in batch:
inputs.append(i)
targets.append(t)
if victory:
wins += 1
loss, accuracy = model.train_on_batch(np.array(inputs), np.array(targets))
As you can see, I am training my network not on decimal numbers but convert them into binary first before feeding it into the net.
Another thing to mention here is that I am using a memory, to make it more like a supervised problem. I thought it may perform better if train on numbers that the neural net has already been trained on. It doesn't seem to make any difference at all.
Is there anything I can do to solve this particular problem with a neural net? I mean is it so hard for a function approximator to figure out the simple math behind FizzBuzz? Am I doing something wrong? Do you suggest a different architecture?
See my code on MachineLabs. You can simply fork my lab and fiddle with it if you want. To view to code, simply click on the 'Editor' tab at the top.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.