Training with keras using fragments of data - python

I train a sequential model (20 dense layers) in keras (python) using default settings and just 1 epoch.
All layers are activated with relu, except the last on that uses sigmoid.
METHOD A:
Feed model with 1,000,000 records of labeled training data.
METHOD B:
Train model with 50,000 records
Save the model
Do some stuff
Load saved model
Train with another 50,000 records
Repeat until all 1,000,000 records are used
Why is there a discrepancy between the above 2 methods?
I always get better accuracy using all data at once, than using it in groups.
What is the reason for that?
model = Sequential()
model.add(Dense(30, input_dim = 27, activation = 'relu'))
...
model.add(Dense(1, input_dim = 10, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = ['accuracy'])
model.load_weights(PreviousWeightsFile)
model.fit(X, Y, verbose = 0)
model.save_weights(WeightsFile)
(exit python and do some stuff)

from the documentation, here the crucial model parameters for your question
initial_epoch: Integer. Epoch at which to start training (useful for
resuming a previous training run).
and
epochs: Integer. Number of epochs to train the model. An epoch is an
iteration over the entire x and y data provided. Note that in
conjunction with initial_epoch, epochs is to be understood as "final
epoch". The model is not trained for a number of iterations given by
epochs, but merely until the epoch of index epochs is reached.
You are not using these parameters therefore you are overwriting your weights and are not resuming training like you could with the epochs parameter. That's the reason why your model always performs worse with method B.

With all the data, the interactions between features and the resultant backpropagation will be more accurate with all present data; this allows for features and the architecture of the model to build upon additional epochs.
When you save and reload you essentially restart this.

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

Keras neural network predicting the same output

I need to develop a neural network with Keras to predict a disease using genetic data. It is known, that predicting this disease is possible even with logistic regression (however the predictions, in this case, are of very poor quality). It's worth mentioning that my data is imbalanced, so I introduced class weights later.
I decided to start with the simplest way to predict it - with a network, analogous to a logistic regression - one hidden layer with one neuron and achieved a bad, yet at least some result - 0.12-0.14 F1 score. Then I tried with 2 hidden and 1 output layers with different amount of neurons in the first hidden layer - from 1 to 8.
It turns out that in some cases it learns something, and in some is predicting the same output for every sample. I displayed the accuracy and loss function over the epochs and this is what I get:
Network loss function by epoch. It's clear that the loss function has roughly the same value, for the training data.
Network accuracy by epoch. It's clear that the accuracy is not improving, but fluctuates from 0 to 1
I searched for similar questions and the suggestions were the following:
Make more neurons - I just have to make it work with 1, 2 or more neurons in the first layer, so I can't add neurons to this one. I increased the amount of neurons in the second hidden layer up to 20, but it then stopped predicting anything with any number oh neurons in the first layer configuration.
Make more layers - I tried adding one more layer, but still have the same problem
To introduce dropout and increase it - what dropout are we talking about if it can learn with just one layer and one neuron in it
Reduce learning rate - decreased it from the default 10^(-3) to 10^(-4)
Reduce batch size - varied it from 500 samples in a minibatch to 1 (stochastic gradient descent)
More epochs - isn't 20 to 50 epochs on a 500'000 sample dataset enough?
Here's the model:
def run_nn_class_weights(data, labels, model):
n_iter = 20
predicted = None
true = None
print('Splitting the data')
x_train, x_valid, y_train, y_valid = train_test_split(data, labels, test_size = 0.05)
#model = create_model()
early_stopping_monitor=EarlyStopping(patience=240)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(labels),
labels)
class_weights = dict(enumerate(class_weights))
hist = model.fit(x_train, y_train, validation_data=[x_valid, y_valid], class_weight=class_weights,
epochs=n_iter, batch_size=500, shuffle=True, callbacks=[early_stopping_monitor],verbose=1)
proba = model.predict(data)
predicted = proba.flatten()
true = labels
return(model, proba, hist)
def old_model_n_pred(n_neurons_1st = 1):
model = Sequential()
model.add(Dense(n_neurons_1st, activation='relu', input_shape=(7516,), kernel_initializer='glorot_normal'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Flatten())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
This is a small network that should be able to converge to something that's not an atractor (getting stuck on a single value).
I suggest taking a look at the weights of all the neurons with ReLu activation.
ReLus are great because get quick calculations; but half of the relu has derivate of zero, which doesn't help with gradient descent. This might be your case.
In guess in yout case the enemy would be the first neuron.
In order to overcome this problem, I would try to do regularize inputs (to have all samples centered around 0.5 and scaled by the standard deviation). If you do this to a ReLU, you'll make it ignore anything under between [-inf, sd].
if that does not fix part of the problem, swich to a different activation function in the first layer. A sigmoid will work very good and it's not too expensive for just one neuron.
Also, take a close look at your input distribution. What your network actually does is doing a sigmoid-like classification, then using between 4 to 8 neurons to "zoom"/correct on the important parts of the function that the first transformation didn't account for.

Tensorflow neural network doesn’t learn

I built a neural network for a university project. The goal is to find out if sensor data (temperature, humidity and light) can predict if the sunrise happened during a given time frame. So, it is a binary classification.
The problem is that the network does not learn. The accuracy converges towards about 0.8 and does not change after about 5 epochs. Same with the loss, which sits at about 0.4921 after a few epochs. I tried several things like changing the activation function or the number of hidden layers, but nothing worked.
I also created a dataset with an equal amount of "sunrise = 1" and "sunrise = 0" data points. The accuracy ended up at exactly 0,5. Therefore I think that there is something wrong with the network setup itself.
Do you have any idea what could be wrong?
Here is my code:
def build_network():
input = keras.Input(shape=(4,25), name="input")
hidden = layers.Dense(1000, activation="sigmoid", name="dense1")(input)
hidden = layers.Dense(1000, activation="sigmoid", name="dense2")(hidden)
hidden = layers.Flatten()(hidden)
hidden = layers.Dense(500, activation="sigmoid", name="dense3")(hidden)
hidden = layers.Dense(500, activation="sigmoid", name="dense4")(hidden)
hidden = layers.Dense(10, activation="sigmoid", name="dense5")(hidden)
output = layers.Dense(1, activation="sigmoid", name="output")(hidden)
model = keras.Model(inputs=input, outputs=output, name="sunrise_model")
return model
def train_model():
training_files = r'data/training'
test_files = r'data/test'
print('reding files...')
train_x, train_y = load_data(training_files)
test_x, test_y = load_data(test_files)
print("training network")
# compile model
model = build_network()
model.compile(
loss=keras.losses.BinaryCrossentropy(from_logits=False),
optimizer=keras.optimizers.RMSprop(),
metrics=["accuracy"],
)
# Train / fit
model.fit(train_x, train_y, batch_size=100, epochs=200)
# evaluate
test_scores = model.evaluate(test_x, test_y, verbose=2)
print("Test loss:", test_scores[0])
print("Test accuracy:", test_scores[1])
Here is the output: loss: 0.4921 - accuracy: 0.8225
Test loss: 0.4921109309196472,
Test accuracy: 0.8225
And here is an example of the data: https://hastebin.com/hazipagija.json
I would use RELU instead of sigmoid as the activation function. What was the learning rate you used? Try a smaller learning rate. Actually I find I get the best results using a variable learning rate. The Keras callback ReduceLROnPlateau makes this easy to do. Documentation is here. I also recommend that you use the Keras callback ModelCheckpoint to save the model with the lowest validation loss then use that model to make predictions on the test set. Documentation is here.I also think your model has to many parameters and will overfit. Add dropout layers to the model to help reduce this problem. I would try reducing the model complexity as a good alternative. Take out in of the layers with 1000 nodes and one of the layers with 500 nodes and see what results you get. I also prefer to use the Adamax optimizer. Documentation is here.. Use the default values.

Determine number of Nodes and Layers based on shape of the data

Is there a way to determine number of nodes and hidden layers based on shape of the data?
Also, is there a way to determine the best activation function based on the topic?
For example, Im making model for fake news prediction. My features are number of words in text, number of words in title, number of questions, number of capital letters etc.
My dataset has 22 features and around 35000 rows. My output should be 0 or 1.
Based on that, how many layers and nodes should I use and what activation functions are the best for this?
This is my net:
model = Sequential()
model.add(Dense(100, input_dim = features.shape[1], activation = 'relu')) # input layer requires input_dim param
model.add(Dense(100, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid')) # sigmoid instead of relu for final probability between 0 and 1
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="mean_squared_error", optimizer=sgd, metrics=['accuracy'])
# call the function to fit to the data training the network)
model.fit(x_train, y_train, epochs = 10, shuffle = True, batch_size=32, validation_data=(x_test, y_test), verbose=1)
scores = model.evaluate(features, results)
print(model.metrics_names[1], scores[1]*100)
Selecting those requires prior experience, otherwise we won't need that much ML Engineers trying different architectures and writing papers.
But for a start I would recommend you take a look at autokeras, It will help with your problem as it's kind of a known problem -Text Classification-, you only need to structure your data as input(X and Y) and then feed that to their Text Classifier which will try different models(You could specify that) to choose the best fitting for your case.
You could find more examples in the docs here
https://autokeras.com/tutorial/text_classification/
import autokeras as ak
# Initialize the text classifier.
clf = ak.TextClassifier(max_trials=10) # It tries 10 different models
# Feed the text classifier with training data.
clf.fit(x_train, y_train)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))
Answer is no and no.
Well these are also hyperparameters. You can select a bunch of them and try all of them to get a rough idea of which is giving you the best result. Yes the same statement holds for activation function as well.
You can use more layers than you need and then use regularization to stop producing an overfitted model. Also if it is too less you can clearly understand the underfitting behavior from the loss curve giving high training error.
There is no formula for determining all these. You have to try different things based on the problem at hand and you will see some of it would work better than the others.
For output softmax layer would be good as this will give you a probability of predictions which you can easily convert to one-hot encoding.

Keras: using mask_zero with padded sequences versus single sequence non padded training

I'm building an LSTM model in Keras to classify entities from sentences. I'm experimenting with both zero padded sequences and the mask_zero parameter, or a generator to train the model on one sentence (or batches of same length sentences) at a time so I don't need to pad them with zeros.
If I define my model as such:
model = Sequential()
model.add(Embedding(input_dim=vocab_size+1, output_dim=200, mask_zero=True,
weights=[pretrained_weights], trainable = True))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(target_size, activation='softmax')))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])
Can I expect the padded sequences with the mask_zero parameter to perform similarly to feeding the model non-padded sequences one sentence at a time? Essentially:
model.fit(padded_x, padded_y, batch_size=128, epochs=n_epochs,
validation_split=0.1, verbose=1)
or
def iter_sentences():
while True:
for i in range(len(train_x)):
yield np.array([train_x[i]]), to_categorical([train_y[i]], num_classes = target_size)
model.fit_generator(iter_sentences(), steps_per_epoch=less_steps, epochs=way_more_epochs, verbose=1)
I'm just not sure if there is a general preference for one method over the other, or the exact effect the mask_zero parameter has on the model.
Note: There are slight parameter differences for the model initialization based on which training method I'm using - I've left those out for brevity.
The biggest difference will be performance and training stability, otherwise padding and then masking is the same as processing single sentence at time.
performance: Well you will train one point at a time which might not exploit any parallelism that is available on the hardware. Often, we adjust the batch size to get the best performance from the machine during training and prediction.
training stability: when you set batch size to 1 you are not longer performing mini-batch training. The training routine will apply updates after every data point which might be detrimental for momentum based algorithms such as Adam. Instead, accumulating gradients over a batch tends to provide more stable convergence especially if the data is noisy.
So to answer the question, no, you can't expect them to perform similarly.

Categories