Please see gist here for the full code. In this question I've pasted what I regard the important part of the code at the bottom.
First I generate a dataset as shown in [2], which consists of 2 features (x and y) and 3 labels (grey, blue and orange). Then I make 4 keras sequential models with identical layer structure, optimizer, loss function, etc. Lastly, I call fit on each model and plot the resulting metrics shown in [3]. As you can see the models all perform differently and I'm wondering why? I've locked the random seed value so each time I run this script I get the exact same result.
In this example the models have identical structures, so I would expect the metric plots to be identical. Eventually I would like to vary the number of layers, layer size, loss function, etc. between the models to see its effects, but that does not seem feasible in this setup. Am I approaching this incorrectly?
An interesting thing to note is that by setting the batch_size to 32 this effect is not as prominent, but it is still present and reproducible (See [4]).
# ---- MAKE MODELS ---- #
NUMBER_OF_MODELS = 4
models = []
for i in range(NUMBER_OF_MODELS):
model = keras.models.Sequential(name=f'{i}')
model.add(keras.layers.Dense(8, activation='relu', input_shape=df_train['features'].values.shape[-1:]))
model.add(keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.CategoricalCrossentropy(),
metrics=[keras.metrics.CategoricalAccuracy()])
model.summary()
models.append(model)
# --------------------- #
# ---- TRAIN MODELS ---- #
histories = []
for model in models:
with tf.device('/cpu:0'):
history = model.fit(x=df_train['features'].values, y=df_train['labels'].values,
validation_data=(df_val['features'].values, df_val['labels'].values),
batch_size=512, epochs=100, verbose=0)
histories.append(history)
# ---------------------- #
you simply need to set the seed every time you define and fit the model
following your code I collapse all in these lines:
NUMBER_OF_MODELS = 4
models = []
histories = []
for i in range(NUMBER_OF_MODELS):
set_seed_TF2(33)
model = keras.models.Sequential(name=f'{i}')
model.add(keras.layers.Dense(8, activation='relu', input_shape=df_train['features'].values.shape[-1:]))
model.add(keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.CategoricalCrossentropy(),
metrics=[keras.metrics.CategoricalAccuracy()])
with tf.device('/cpu:0'):
history = model.fit(x=df_train['features'].values, y=df_train['labels'].values,
validation_data=(df_val['features'].values, df_val['labels'].values),
batch_size=512, epochs=100, verbose=0)
histories.append(history)
models.append(model)
the magic function is set_seed_TF2
def set_seed_TF2(seed):
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
random.seed(seed)
that must be call every time you initialize the model and fit
with this in mind, you can produce every time the same metrics/predictions:
here the running notebook: https://colab.research.google.com/drive/1nHEDI6d3LsRPQUXGiTOYfNOfw954Pu-H?usp=sharing
this works for CPU only
Related
I'm confused about how to implement tfa's SWA optimizer. There are two points here:
When you look at the documentation it points you to [this] model averaging tutorial. That tutorial uses tfa.callbacks.AverageModelCheckpoint, which allows you to
Assign the moving average weights to the model, and save them.
(or) Keep the old non-averaged weights, but the saved model uses the average weights.
Having a distinct ModelCheckpoint that allows you to save moving average weights (rather than the current weights) makes sense. However - it seems like SWA should be managing the weight averaging. That makes me want to set update_weights=False.
Is this correct? The tutorial uses update_weights=True.
There is a note about SWA not updating the BN layers in the documentation. Following the suggestion here I did this,
# original training
model.fit(...)
# updating weights from final run
optimizer.assign_average_vars(model.variables)
# batch-norm-hack: lr=0 as suggested https://stackoverflow.com/a/64376062/607528
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0),
loss=loss,
metrics=metrics)
model.fit(
data,
validation_data=None,
epochs=1,
callbacks=final_callbacks)
before saving my model.
Is this correct?
Thanks!
The easiest way to deal with the batch norm is the following:
First, loop through all layers in your model and reset the moving mean and moving variance in the batch norm layers (in my example I assume the batch norm layers end with "bn"):
for l in model.layers:
if l.name.split('_')[-1] == 'bn': # e.g. conv1_bn
l.moving_mean.assign(tf.zeros_like(l.moving_mean))
l.moving_variance.assign(tf.ones_like(l.moving_variance))
After that run your model for one epoch and set training to true to update the moving average and variance:
count = 0
for x,_ in dataset_train:
_ = model(x, training = True)
count += 1
if count > steps_per_epoch:
break
There are two ways of doing this, the first one is you manually update the weights before saving, like this example from the documentation.
import tensorflow as tf
import tensorflow_addons as tfa
model = tf.Sequential([...])
opt = tfa.optimizers.SWA(
tf.keras.optimizers.SGD(lr=2.0), 100, 10)
model.compile(opt, ...)
model.fit(x, y, ...)
# Update the weights to their mean before saving
opt.assign_average_vars(model.variables)
model.save('model.h5')
The second option is to update the weight through AverageModelCheckpoint if you set update_weights = True. As the collab notebook example shows
avg_callback = tfa.callbacks.AverageModelCheckpoint(filepath=checkpoint_dir,
update_weights=True)
...
#Build Model
model = create_model(moving_avg_sgd)
#Train the network
model.fit(fmnist_train_ds, epochs=5, callbacks=[avg_callback])
Notice that AverageModelCheckpoint also calls assign_average_vars before saving the model, from source code:
def _save_model(self, epoch, logs):
optimizer = self._get_optimizer()
assert isinstance(optimizer, AveragedOptimizerWrapper)
if self.update_weights:
optimizer.assign_average_vars(self.model.variables)
return super()._save_model(epoch, logs)
...
I have a question regarding the evaluation of an LSTM Model. I have trained an LSTM Model and stored it with model.save(...). Now I want load_model and evaluate it on the validation set datasets. Since neural networks are stochastic, I run it several times and compute the mean and the variance of the different metrics I am interested in.
Now I am shocked that after the first run all consecutive runs have the same performance on every metric. I don't think that is right, but I don't know where the error occurs.
So my question is:
what is my mistake in setting up the validation of my model?
and how can I fix that?
Here are the code snippets that should explain what I am doing:
Compile and fit the Model
def compile_and_fit( hparams,
MAX_EPOCHS,
model_path ):
window = WindowGenerator( input_width= hparams[HP_WINDOW_SIZE],
label_width=hparams[HP_WINDOW_SIZE], shift=1,
label_columns=['q_MARI'], batch_size = hparams[HP_BATCH_SIZE])
model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(hparams[HP_NUM_UNITS], return_sequences=True, name="LSTM_1"),
tf.keras.layers.Dropout(hparams[HP_DROPOUT], name="Dropout_1"),
tf.keras.layers.LSTM(hparams[HP_NUM_UNITS], return_sequences=True, name="LSTM_2"),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))
])
learning_rate = hparams[HP_LEARNING_RATE]
model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=tf.optimizers.Adam(learning_rate=learning_rate),
metrics=get_metrics())
history = model.fit(window.train,
epochs=MAX_EPOCHS,
validation_data=window.val,
callbacks= get_callbacks(model_path))
_, a,_,_,_,_ = model.evaluate(window.val)
return a, model, history
Train and safe it
a, model, history = compile_and_fit( hparams = hparams, MAX_EPOCHS = MAX_EPOCHS, model_path = run_path)
model.save(run_path)
Load and evaluate it
model = tf.keras.models.load_model(os.path.join(hparam_path, model_name),
custom_objects={"max_error": max_error, "median_absolute_error": median_absolute_error, "rev_metric": rev_metric, "nse_metric": nse_metric})
model.compile(loss=tf.losses.MeanSquaredError(), optimizer="adam", metrics=get_metrics())
metric_values = np.empty(shape = (nr_runs, len(metrics)), dtype=float)
for j in range(nr_runs):
window = WindowGenerator(input_width= hparam_vals[i], label_width=hparam_vals[i], shift=1,
label_columns=['q_MARI'])
metric_values[j]= np.array(model.evaluate(window.val))
means = metric_values.mean(axis=0)
varis = metric_values.var(axis=0)
print(f'means: {means}, varis: {varis}')
The results I am getting
For setting up the Training I follow those two guides:
https://www.tensorflow.org/tutorials/structured_data/time_series
https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams
LSTM is not stochastic. Evaluation results should be the same for the same data.
There are two steps, when you train the model, randomness will influence the model you trained. However, after that, you saved the model, the prediction result would be same if you use the same model.
In my TensorFlow model I have some data that I feed into a stack of CNNs before it goes into a few fully connected layers. I have implemented that with Keras' Sequential model. However, I now have some data that should not go into the CNN and instead be fed directly into the first fully connected layer because that data contains some values and labels that are part of the input data but that data should not undergo convolutions as it is not image data.
Is such a thing possible with tensorflow.keras or should I do that with tensorflow.nn instead? As far as I understand Keras' sequential models is that the input goes in one end and comes out the other with no special wiring in the middle.
Am I correct that to do this I have to use tensorflow.concat on the data from the last CNN layer and the data that bypasses the CNNs before feeding it into the first fully connected layer?
Here is an simple example in which the operation is to sum the activations from different subnets:
import keras
import numpy as np
import tensorflow as tf
from keras.layers import Input, Dense, Activation
tf.reset_default_graph()
# this represents your cnn model
def nn_model(input_x):
feature_maker = Dense(10, activation='relu')(input_x)
feature_maker = Dense(20, activation='relu')(feature_maker)
feature_maker = Dense(1, activation='linear')(feature_maker)
return feature_maker
# a list of input layers, of course the input shapes can be different
input_layers = [Input(shape=(3, )) for _ in range(2)]
coupled_feature = [nn_model(input_x) for input_x in input_layers]
# assume you take the sum of the outputs
coupled_feature = keras.layers.Add()(coupled_feature)
prediction = Dense(1, activation='relu')(coupled_feature)
model = keras.models.Model(inputs=input_layers, outputs=prediction)
model.compile(loss='mse', optimizer='adam')
# example training set
x_1 = np.linspace(1, 90, 270).reshape(90, 3)
x_2 = np.linspace(1, 90, 270).reshape(90, 3)
y = np.random.rand(90)
inputs_x = [x_1, x_2]
model.fit(inputs_x, y, batch_size=32, epochs=10)
You can actually plot the model to gain more intuition
from keras.utils.vis_utils import plot_model
plot_model(model, show_shapes=True)
The model of the above code looks like this
With a little remodeling and the functional API you can:
#create the CNN - it can also be a sequential
cnn_input = Input(image_shape)
cnn_output = Conv2D(...)(cnn_input)
cnn_output = Conv2D(...)(cnn_output)
cnn_output = MaxPooling2D()(cnn_output)
....
cnn_model = Model(cnn_input, cnn_output)
#create the FC model - can also be a sequential
fc_input = Input(fc_input_shape)
fc_output = Dense(...)(fc_input)
fc_output = Dense(...)(fc_output)
fc_model = Model(fc_input, fc_output)
There is a lot of space for creativity, this is just one of the ways.
#create the full model
full_input = Input(image_shape)
full_output = cnn_model(full_input)
full_output = fc_model(full_output)
full_model = Model(full_input, full_output)
You can use any of the three models in any way you want. They share the layers and the weights, so internally they are the same.
Saving and loading the full model might be quirky. I'd probably save the other two separately and when loading create the full model again.
Notice also that if you save two models that share the same layers, after loading they will probably not share these layers anymore. (Another reason for saving/loading only fc_model and cnn_model, while creating full_model again from code)
I train a sequential model (20 dense layers) in keras (python) using default settings and just 1 epoch.
All layers are activated with relu, except the last on that uses sigmoid.
METHOD A:
Feed model with 1,000,000 records of labeled training data.
METHOD B:
Train model with 50,000 records
Save the model
Do some stuff
Load saved model
Train with another 50,000 records
Repeat until all 1,000,000 records are used
Why is there a discrepancy between the above 2 methods?
I always get better accuracy using all data at once, than using it in groups.
What is the reason for that?
model = Sequential()
model.add(Dense(30, input_dim = 27, activation = 'relu'))
...
model.add(Dense(1, input_dim = 10, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = ['accuracy'])
model.load_weights(PreviousWeightsFile)
model.fit(X, Y, verbose = 0)
model.save_weights(WeightsFile)
(exit python and do some stuff)
from the documentation, here the crucial model parameters for your question
initial_epoch: Integer. Epoch at which to start training (useful for
resuming a previous training run).
and
epochs: Integer. Number of epochs to train the model. An epoch is an
iteration over the entire x and y data provided. Note that in
conjunction with initial_epoch, epochs is to be understood as "final
epoch". The model is not trained for a number of iterations given by
epochs, but merely until the epoch of index epochs is reached.
You are not using these parameters therefore you are overwriting your weights and are not resuming training like you could with the epochs parameter. That's the reason why your model always performs worse with method B.
With all the data, the interactions between features and the resultant backpropagation will be more accurate with all present data; this allows for features and the architecture of the model to build upon additional epochs.
When you save and reload you essentially restart this.
Below is the simple example of multi-class classification task with
IRIS data.
import seaborn as sns
import numpy as np
from sklearn.cross_validation import train_test_split
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.regularizers import l2
from keras.utils import np_utils
#np.random.seed(1335)
# Prepare data
iris = sns.load_dataset("iris")
iris.head()
X = iris.values[:, 0:4]
y = iris.values[:, 4]
# Make test and train set
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.5, random_state=0)
################################
# Evaluate Keras Neural Network
################################
# Make ONE-HOT
def one_hot_encode_object_array(arr):
'''One hot encode a numpy array of objects (e.g. strings)'''
uniques, ids = np.unique(arr, return_inverse=True)
return np_utils.to_categorical(ids, len(uniques))
train_y_ohe = one_hot_encode_object_array(train_y)
test_y_ohe = one_hot_encode_object_array(test_y)
model = Sequential()
model.add(Dense(16, input_shape=(4,),
activation="tanh",
W_regularizer=l2(0.001)))
model.add(Dropout(0.5))
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
# Actual modelling
# If you increase the epoch the accuracy will increase until it drop at
# certain point. Epoch 50 accuracy 0.99, and after that drop to 0.977, with
# epoch 70
hist = model.fit(train_X, train_y_ohe, verbose=0, nb_epoch=100, batch_size=1)
score, accuracy = model.evaluate(test_X, test_y_ohe, batch_size=16, verbose=0)
print("Test fraction correct (NN-Score) = {:.2f}".format(score))
print("Test fraction correct (NN-Accuracy) = {:.2f}".format(accuracy))
My question is how do people usually decide the size of layers?
For example based on code above we have:
model.add(Dense(16, input_shape=(4,),
activation="tanh",
W_regularizer=l2(0.001)))
model.add(Dense(3, activation='sigmoid'))
Where first parameter of Dense is 16 and second is 3.
Why two layers uses two different values for Dense?
How do we choose what's the best value for Dense?
Basically it is just trial and error. Those are called hyperparameters and should be tuned on a validation set (split from your original data into train/validation/test).
Tuning just means trying different combinations of parameters and keep the one with the lowest loss value or better accuracy on the validation set, depending on the problem.
There are two basic methods:
Grid search: For each parameter, decide a range and steps into that range, like 8 to 64 neurons, in powers of two (8, 16, 32, 64), and try each combination of the parameters. This is obviously requires an exponential number of models to be trained and tested and takes a lot of time.
Random search: Do the same but just define a range for each parameter and try a random set of parameters, drawn from an uniform distribution over each range. You can try as many parameters sets you want, for as how long you can. This is just a informed random guess.
Unfortunately there is no other way to tune such parameters. About layers having different number of neurons, that could come from the tuning process, or you can also see it as dimensionality reduction, like a compressed version of the previous layer.
There is no known way to determine a good network structure evaluating the number of inputs or outputs. It relies on the number of training examples, batch size, number of epochs, basically, in every significant parameter of the network.
Moreover, a high number of units can introduce problems like overfitting and exploding gradient problems. On the other side, a lower number of units can cause a model to have high bias and low accuracy values. Once again, it depends on the size of data used for training.
Sadly it is trying some different values that give you the best adjustments. You may choose the combination that gives you the lowest loss and validation loss values, as well as the best accuracy for your dataset, as said in the previous post.
You could do some proportion on your number of units value, something like:
# Build the model
model = Sequential()
model.add(Dense(num_classes * 8, input_shape=(shape_value,), activation = 'relu' ))
model.add(Dropout(0.5))
model.add(Dense(num_classes * 4, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes * 2, activation = 'relu'))
model.add(Dropout(0.2))
#Output layer
model.add(Dense(num_classes, activation = 'softmax'))
The model above shows an example of a categorisation AI system. The num_classes are the number of different categories the system has to choose. For instance, in the iris dataset from Keras, we have:
Iris Setosa
Iris Versicolour
Iris Virginica
num_classes = 3
However, this could lead to worse results than with other random values. We need to adjust the parameters to the training dataset by making some different tries and then analyse the results seeking for the best combination of parameters.
My suggestion is to use EarlyStopping(). Then check the number of epochs and accuracy with test loss.
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
rlp = lrd = ReduceLROnPlateau(monitor = 'val_loss',patience = 2,verbose = 1,factor = 0.8, min_lr = 1e-6)
es = EarlyStopping(verbose=1, patience=2)
his = classifier.fit(X_train, y_train, epochs=500, batch_size = 128, validation_split=0.1, verbose = 1, callbacks=[lrd,es])