Tensorflow model pruning gives 'nan' for training and validation losses

Tensorflow model pruning gives 'nan' for training and validation losses - python

I'm trying to prune a base model that consists of several layers on top of a VGG network. It also contains a user-defined layer named instance_normalization. For pruning to be successful, I've defined the get_prunable_weights function of this layer as follows:
### defined for model pruning
def get_prunable_weights(self):
return self.weights
I used the following function to obtain a to-be-pruned model structure using a base model named model:
def define_prune_model(self, model, img_shape, epochs, batch_size, validation_split=0.1):
num_images = img_shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs
# Define model for pruning.
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.5,
final_sparsity=0.80,
begin_step=0,
end_step=end_step)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model_for_pruning.summary()
return model_for_pruning
Then, I wrote the following function to perform training on this pruning model:
def train_prune_model(self, model_for_pruning, train_images, train_labels,
epochs, batch_size, validation_split=0.1):
callbacks = [
tfmot.sparsity.keras.UpdatePruningStep(),
tfmot.sparsity.keras.PruningSummaries(log_dir='./models/pruned'),
]
model_for_pruning.fit(train_images, train_labels,
batch_size=batch_size, epochs=epochs, validation_split=validation_split,
callbacks=callbacks)
return model_for_pruning
However, when training, I found out that the training and validation losses were all nan, and the final model prediction output was totally zero. However, the base model that passed to define_prune_model has successfully trained and predicted correctly.
How can I solve this? Thank you in advance.

It is difficult to pinpoint the issue without more informations. In particular, can you please give more detail (preferably as code) about your custom instance_normalization layer ?
Assuming that the code is fine: Since you mentioned that the model trains correctly without pruning, could it be that those pruning parameters are too harsh ? After all, those options set 50% of the weights to zero right from the first learning step.
Here is what I would try:
Experiment with a lower level of sparsity (especially initial_sparsity).
Start to apply pruning later during the training (begin_step argument of the pruning schedule). Some even prefer to train the model once without applying pruning at all. Then re-train again with prune_low_magnitude().
Only prune at some steps, giving time for the model to recover between prunings (frequency argument).
Finally should it still fail, the usual cures when encountering nan losses: reduce the learning rate, use regularization or gradient clipping, ...

Related

fit_generator() returns NoneType instead of History object in Mask R CNN

I would like to save the loss data while training my Mask R CNN, but I seem to be missing something. The training is working but I'm getting the Error:
AttributeError: 'NoneType' object has no attribute 'history'
history = model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=EPOCH_NUMBER,
augmentation=augmentation,
layers='heads', custom_callbacks=custom_callbacks)
df = pd.DataFrame(history.history).to_excel(
LOSS_DIR)
I'm not even sure if this is the right approach but it seemed easy enough.
This part of the code calls this function in model.py, to which I only added at the very end the history = fit_generator(...) and return history, it used to just call fit_generator(...):
def train(self, train_dataset, val_dataset, learning_rate, epochs, layers,
augmentation=None, custom_callbacks=None, no_augmentation_sources=None):
"""Train the model.
train_dataset, val_dataset: Training and validation Dataset objects.
learning_rate: The learning rate to train with
epochs: Number of training epochs. Note that previous training epochs
are considered to be done already, so this actually determines
the epochs to train in total rather than in this particular
call.
layers: Allows selecting which layers to train. It can be:
- A regular expression to match layer names to train
- One of these predefined values:
heads: The RPN, classifier and mask heads of the network
all: All the layers
3+: Train Resnet stage 3 and up
4+: Train Resnet stage 4 and up
5+: Train Resnet stage 5 and up
augmentation: Optional. An imgaug (https://github.com/aleju/imgaug)
augmentation. For example, passing imgaug.augmenters.Fliplr(0.5)
flips images right/left 50% of the time. You can pass complex
augmentations as well. This augmentation applies 50% of the
time, and when it does it flips images right/left half the time
and adds a Gaussian blur with a random sigma in range 0 to 5.
augmentation = imgaug.augmenters.Sometimes(0.5, [
imgaug.augmenters.Fliplr(0.5),
imgaug.augmenters.GaussianBlur(sigma=(0.0, 5.0))
])
custom_callbacks: Optional. Add custom callbacks to be called
with the keras fit_generator method. Must be list of type keras.callbacks.
no_augmentation_sources: Optional. List of sources to exclude for
augmentation. A source is string that identifies a dataset and is
defined in the Dataset class.
"""
assert self.mode == "training", "Create model in training mode."
# Pre-defined layer regular expressions
layer_regex = {
# all layers but the backbone
"heads": r"(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
# From a specific Resnet stage and up
"3+": r"(res3.*)|(bn3.*)|(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
"4+": r"(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
"5+": r"(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
# All layers
"all": ".*",
}
if layers in layer_regex.keys():
layers = layer_regex[layers]
# Data generators
train_generator = data_generator(train_dataset, self.config, shuffle=True,
augmentation=augmentation,
batch_size=self.config.BATCH_SIZE,
no_augmentation_sources=no_augmentation_sources)
val_generator = data_generator(val_dataset, self.config, shuffle=True,
batch_size=self.config.BATCH_SIZE)
# Create log_dir if it does not exist
if not os.path.exists(self.log_dir):
os.makedirs(self.log_dir)
# Callbacks
callbacks = [
keras.callbacks.TensorBoard(log_dir=self.log_dir,
histogram_freq=0, write_graph=True, write_images=False),
keras.callbacks.ModelCheckpoint(self.checkpoint_path,
verbose=0, save_weights_only=True),
]
# Add custom callbacks to the list
if custom_callbacks:
callbacks += custom_callbacks
# Train
log("\nStarting at epoch {}. LR={}\n".format(self.epoch, learning_rate))
log("Checkpoint Path: {}".format(self.checkpoint_path))
self.set_trainable(layers)
self.compile(learning_rate, self.config.LEARNING_MOMENTUM)
# Work-around for Windows: Keras fails on Windows when using
# multiprocessing workers. See discussion here:
# https://github.com/matterport/Mask_RCNN/issues/13#issuecomment-353124009
if os.name is 'nt':
workers = 1
else:
workers = multiprocessing.cpu_count()
history = self.keras_model.fit_generator(
train_generator,
initial_epoch=self.epoch,
epochs=epochs,
steps_per_epoch=self.config.STEPS_PER_EPOCH,
callbacks=callbacks,
validation_data=val_generator,
validation_steps=self.config.VALIDATION_STEPS,
max_queue_size=100,
workers=0,
use_multiprocessing=False,
)
self.epoch = max(self.epoch, epochs)
return history
In the documentation of fit_generator() it says that it returns a History Object but it looks like it doesn't? I'm very new to machine learning and working on projects like this in general so I'm sorry if this was a stupid question or if I forgot some crucial information.

I believe that model.fit_generator is deprecated, in TensorFlow 2.2 and higher you can just use model.fit because this now supports generators.
https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit_generator

As of Tensorflow 2.1, if you call fit_generator you will get a warning like this one:
Model.fit_generator (from tensorflow.python.keras.engine.training) is
deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
It prompts you to instead use model.fit which now supports generators.
https://www.codesofinterest.com/2020/09/model-fit-vs-fit-generator-tfkeras.html

How do keras LSTM input and output shapes work?

trainX, trainY, sequence_length=len(train), batch_size=batchTrain
)
val=timeseries_dataset_from_array(
valX, valY, sequence_length=len(val), batch_size=batchVal
)
test=timeseries_dataset_from_array(
testX, testY, sequence_length=len(test), batch_size=batchTest
)
return train, val, test
train, val, test = preprocessor()
model=Sequential()
model.add(LSTM(4,return_sequences=True))
model.add(Dense(2,activation='softmax'))
model.compile(optimizer='Adam', loss="mae")
model.fit(train, epochs=200, verbose=2, validation_data=val, shuffle=False)
I'm trying to make an LSTM from time-series data and when I run the above, the loss doesn't change at all. I'm definitely struggling to understand how lstm input/output shapes work. I've read as much online as I could find, but I can't seem to get the model to learn. I'm under the impression that the first argument is the dimensionality of the output space. I want the lstm to return the whole sequence to the output function.

There are many problems in your model. You final layer is dense with two units and you are using softmax which should be replaced by sigmoid. Since you are using softmax, i guess that you are using this model for classification and not regression.
If you are using a model for classification tasks then you should use BinaryCrossentropy and not MeanAbsoluteError as loss.
To answer the question in full detail, you need to post the additional information. For example: What are you target variables etc.

How to update weights in Stochastic Weight Averaging (SWA) on tensorflow?

I'm confused about how to implement tfa's SWA optimizer. There are two points here:
When you look at the documentation it points you to [this] model averaging tutorial. That tutorial uses tfa.callbacks.AverageModelCheckpoint, which allows you to
Assign the moving average weights to the model, and save them.
(or) Keep the old non-averaged weights, but the saved model uses the average weights.
Having a distinct ModelCheckpoint that allows you to save moving average weights (rather than the current weights) makes sense. However - it seems like SWA should be managing the weight averaging. That makes me want to set update_weights=False.
Is this correct? The tutorial uses update_weights=True.
There is a note about SWA not updating the BN layers in the documentation. Following the suggestion here I did this,
# original training
model.fit(...)
# updating weights from final run
optimizer.assign_average_vars(model.variables)
# batch-norm-hack: lr=0 as suggested https://stackoverflow.com/a/64376062/607528
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0),
loss=loss,
metrics=metrics)
model.fit(
data,
validation_data=None,
epochs=1,
callbacks=final_callbacks)
before saving my model.
Is this correct?
Thanks!

The easiest way to deal with the batch norm is the following:
First, loop through all layers in your model and reset the moving mean and moving variance in the batch norm layers (in my example I assume the batch norm layers end with "bn"):
for l in model.layers:
if l.name.split('_')[-1] == 'bn': # e.g. conv1_bn
l.moving_mean.assign(tf.zeros_like(l.moving_mean))
l.moving_variance.assign(tf.ones_like(l.moving_variance))
After that run your model for one epoch and set training to true to update the moving average and variance:
count = 0
for x,_ in dataset_train:
_ = model(x, training = True)
count += 1
if count > steps_per_epoch:
break

There are two ways of doing this, the first one is you manually update the weights before saving, like this example from the documentation.
import tensorflow as tf
import tensorflow_addons as tfa
model = tf.Sequential([...])
opt = tfa.optimizers.SWA(
tf.keras.optimizers.SGD(lr=2.0), 100, 10)
model.compile(opt, ...)
model.fit(x, y, ...)
# Update the weights to their mean before saving
opt.assign_average_vars(model.variables)
model.save('model.h5')
The second option is to update the weight through AverageModelCheckpoint if you set update_weights = True. As the collab notebook example shows
avg_callback = tfa.callbacks.AverageModelCheckpoint(filepath=checkpoint_dir,
update_weights=True)
...
#Build Model
model = create_model(moving_avg_sgd)
#Train the network
model.fit(fmnist_train_ds, epochs=5, callbacks=[avg_callback])
Notice that AverageModelCheckpoint also calls assign_average_vars before saving the model, from source code:
def _save_model(self, epoch, logs):
optimizer = self._get_optimizer()
assert isinstance(optimizer, AveragedOptimizerWrapper)
if self.update_weights:
optimizer.assign_average_vars(self.model.variables)
return super()._save_model(epoch, logs)
...

LSTM Model not having any variance during evaluation

I have a question regarding the evaluation of an LSTM Model. I have trained an LSTM Model and stored it with model.save(...). Now I want load_model and evaluate it on the validation set datasets. Since neural networks are stochastic, I run it several times and compute the mean and the variance of the different metrics I am interested in.
Now I am shocked that after the first run all consecutive runs have the same performance on every metric. I don't think that is right, but I don't know where the error occurs.
So my question is:
what is my mistake in setting up the validation of my model?
and how can I fix that?
Here are the code snippets that should explain what I am doing:
Compile and fit the Model
def compile_and_fit( hparams,
MAX_EPOCHS,
model_path ):
window = WindowGenerator( input_width= hparams[HP_WINDOW_SIZE],
label_width=hparams[HP_WINDOW_SIZE], shift=1,
label_columns=['q_MARI'], batch_size = hparams[HP_BATCH_SIZE])
model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(hparams[HP_NUM_UNITS], return_sequences=True, name="LSTM_1"),
tf.keras.layers.Dropout(hparams[HP_DROPOUT], name="Dropout_1"),
tf.keras.layers.LSTM(hparams[HP_NUM_UNITS], return_sequences=True, name="LSTM_2"),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))
])
learning_rate = hparams[HP_LEARNING_RATE]
model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=tf.optimizers.Adam(learning_rate=learning_rate),
metrics=get_metrics())
history = model.fit(window.train,
epochs=MAX_EPOCHS,
validation_data=window.val,
callbacks= get_callbacks(model_path))
_, a,_,_,_,_ = model.evaluate(window.val)
return a, model, history
Train and safe it
a, model, history = compile_and_fit( hparams = hparams, MAX_EPOCHS = MAX_EPOCHS, model_path = run_path)
model.save(run_path)
Load and evaluate it
model = tf.keras.models.load_model(os.path.join(hparam_path, model_name),
custom_objects={"max_error": max_error, "median_absolute_error": median_absolute_error, "rev_metric": rev_metric, "nse_metric": nse_metric})
model.compile(loss=tf.losses.MeanSquaredError(), optimizer="adam", metrics=get_metrics())
metric_values = np.empty(shape = (nr_runs, len(metrics)), dtype=float)
for j in range(nr_runs):
window = WindowGenerator(input_width= hparam_vals[i], label_width=hparam_vals[i], shift=1,
label_columns=['q_MARI'])
metric_values[j]= np.array(model.evaluate(window.val))
means = metric_values.mean(axis=0)
varis = metric_values.var(axis=0)
print(f'means: {means}, varis: {varis}')
The results I am getting
For setting up the Training I follow those two guides:
https://www.tensorflow.org/tutorials/structured_data/time_series
https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams

LSTM is not stochastic. Evaluation results should be the same for the same data.

There are two steps, when you train the model, randomness will influence the model you trained. However, after that, you saved the model, the prediction result would be same if you use the same model.

Model performance not improving during federated learning training

I have followed this emnist tutorial to create an image classification experiment (7 classes) with the aim of training a classifier on 3 silos of data with the TFF framework.
Before training begins, I convert the model to a tf keras model using tff.learning.assign_weights_to_keras_model(model,state.model) to evaluate on my validation set. Regardless of the label, the model only predicts one class. This is to be expected as no training of the model has occurred yet. However, I repeat this step after each federated averaging round and the problem persists. All validation images are predicted to one class. I also save the tf keras model weights after each round and make predictions on the test set - no changes.
Some of the steps I have taken to check the source of the issue:
Checked if the tf keras model weights are updating when the FL model is converted after each round - they are updating.
Ensured that the buffer size is greater than the training dataset size for each client.
Compared the predictions to the class distribution in the training datasets. There is a class imbalance but the one class that the model predicts is not necessarily the majority class. Also, it is not always the same class. For the most part, it predicts only class 0.
Increased the number of rounds to 5 and epochs per round to 10. This is computationally very intensive as it is quite a large model being trained with approx 1500 images per client.
Investigated the TensorBoard logs from each training attempt. The training loss is decreasing as the round progresses.
Tried a much simpler model - basic CNN with 2 conv layers. This allowed me to greatly increase the number of epochs and rounds. When evaluating this model on the test set, it predicted 4 different classes but the performance remains very bad. This would indicate that I just would need to increase the number of rounds and epochs for my original model to increase the variation in predictions. This is difficult due the large training time that would be a result.
Model details:
The model uses the XceptionNet as the base model with the weights unfrozen. This performs well on the classification task when all the training images are pooled into a global dataset. Our aim is to hopefully achieve a comparable performance with FL.
base_model = Xception(include_top=False,
weights=weights,
pooling='max',
input_shape=input_shape)
x = GlobalAveragePooling2D()( x )
predictions = Dense( num_classes, activation='softmax' )( x )
model = Model( base_model.input, outputs=predictions )
Here is my training code:
def fit(self):
"""Train FL model"""
# self.load_data()
summary_writer = tf.summary.create_file_writer(
self.logs_dir
)
federated_averaging = self._construct_iterative_process()
state = federated_averaging.initialize()
tfkeras_model = self._convert_to_tfkeras_model( state )
print( np.argmax( tfkeras_model.predict( self.val_data ), axis=-1 ) )
val_loss, val_acc = tfkeras_model.evaluate( self.val_data, steps=100 )
with summary_writer.as_default():
for round_num in tqdm( range( 1, self.num_rounds ), ascii=True, desc="FedAvg Rounds" ):
print( "Beginning fed avg round..." )
# Round of federated averaging
state, metrics = federated_averaging.next(
state,
self.training_data
)
print( "Fed avg round complete" )
# Saving logs
for name, value in metrics._asdict().items():
tf.summary.scalar(
name,
value,
step=round_num
)
print( "round {:2d}, metrics={}".format( round_num, metrics ) )
tff.learning.assign_weights_to_keras_model(
tfkeras_model,
state.model
)
# tfkeras_model = self._convert_to_tfkeras_model(
# state
# )
val_metrics = {}
val_metrics["val_loss"], val_metrics["val_acc"] = tfkeras_model.evaluate(
self.val_data,
steps=100
)
for name, metric in val_metrics.items():
tf.summary.scalar(
name=name,
data=metric,
step=round_num
)
self._checkpoint_tfkeras_model(
tfkeras_model,
round_num,
self.checkpoint_dir
)
def _checkpoint_tfkeras_model(self,
model,
round_number,
checkpoint_dir):
# Obtaining model dir path
model_dir = os.path.join(
checkpoint_dir,
f'round_{round_number}',
)
# Creating directory
pathlib.Path(
model_dir
).mkdir(
parents=True
)
model_path = os.path.join(
model_dir,
f'model_file_round{round_number}.h5'
)
# Saving model
model.save(
model_path
)
def _convert_to_tfkeras_model(self, state):
"""Converts global TFF modle of TF keras model
Takes the weights of the global model
and pushes them back into a standard
Keras model
Args:
state: The state of the FL server
containing the model and
optimization state
Returns:
(model); TF Keras model
"""
model = self._load_tf_keras_model()
model.compile(
loss=self.loss,
metrics=self.metrics
)
tff.learning.assign_weights_to_keras_model(
model,
state.model
)
return model
def _load_tf_keras_model(self):
"""Loads tf keras models
Raises:
KeyError: A model name was not defined
correctly
Returns:
(model): TF keras model object
"""
model = create_models(
model_type=self.model_type,
input_shape=[self.img_h, self.img_w, 3],
freeze_base_weights=self.freeze_weights,
num_classes=self.num_classes,
compile_model=False
)
return model
def _define_model(self):
"""Model creation function"""
model = self._load_tf_keras_model()
tff_model = tff.learning.from_keras_model(
model,
dummy_batch=self.sample_batch,
loss=self.loss,
# Using self.metrics throws an error
metrics=[tf.keras.metrics.CategoricalAccuracy()] )
return tff_model
def _construct_iterative_process(self):
"""Constructing federated averaging process"""
iterative_process = tff.learning.build_federated_averaging_process(
self._define_model,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=0.02 ),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=1.0 ) )
return iterative_process

Increased the number of rounds to 5 ...
Running only a few rounds of federated learning sounds insufficient. One of the earliest Federated Averaging papers (McMahan 2016) required running for hundreds of rounds when the MNIST data had non-iid splits. More recently (Reddi 2020) required thousands of rounds for CIFAR-100. One thing to note is that each "round" is one "step" of the global model. That step may be larger with more client epochs, but these are averaged and diverging clients may reduce the magnitude of the global step.
I also save the tf keras model weights after each round and make predictions on the test set - no changes.
This can be concerning. It will be easier to debug if you could share the code used in the FL training loop.

Note sure this is an answer, but more a liked observation.
I've been trying to characterize the learning process (accuracy and loss) on the Federated Learning for Image Classification notebook tutorial with TFF.
I'm seeing major improvements in speed of convergence by modifying the epoch hyperparameter. Changing epochs from 5, 10, 20 etc. But I'm also seeing major increase in training accuracy. I suspect overfitting is occurring, though then I evaluate on the test set accuracy is still high.
Wondering what is going on. ?
My understanding is that the epoch param controls the # of forward/back prop on each client per round of training. Is this correct ? So ie 10 rounds of training on 10 clients with 10 epochs would be 10 Epochs X 10 Clients X 10 rounds. Realise a lager range of clients is needed etc but I was expecting to see poorer accuracy on the test set.
What can I do to see whats going on. Could I use the evaluation check with something like learning curves to to see if overfitting is occurring ?
test_metrics = evaluation(state.model, federated_test_data) Only appears to give a single data point, how can I get the individual test accuracy for each test example validated?
Appreciate any thoughts you may have on the matter, Colin . . .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow model pruning gives 'nan' for training and validation losses - python

Related

fit_generator() returns NoneType instead of History object in Mask R CNN

How do keras LSTM input and output shapes work?

How to update weights in Stochastic Weight Averaging (SWA) on tensorflow?

LSTM Model not having any variance during evaluation

Model performance not improving during federated learning training

Categories

Resources