Hugging Face not able to reload all weights after training - python

I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)

I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.

Related

Optimizer doesn't work when resuming training

I defined a scalar _log_alpha and an optimizer to otimize it.
self._log_alpha = torch.log(torch.ones(1) * alpha).to(get_device()).requires_grad_(True)
self._log_alpha_optimizer = optim.Adam([self._log_alpha], lr=lr)
If I start training from the very beginning, it works just fine. The _log_alpha changes a little bit every time I call
self._log_alpha_optimizer.zero_grad()
log_alpha_loss.backward()
self._log_alpha_optimizer.step()
However, if I train several steps, and save the optimizer's state_dict and _log_alpha
ckpt = {'log_alpha_optimizer_state_dict': self._log_alpha_optimizer.state_dict(),
'log_alpha': self._log_alpha}
torch.save(ckpt, save_dir)
and then load them to resume training
ckpt = torch.load(load_dir, map_location=torch.device(get_device()))
self._log_alpha_optimizer.load_state_dict(ckpt['log_alpha_optimizer_state_dict'])
self._log_alpha = ckpt['log_alpha']
self._log_alpha.requires_grad_(True)
the _log_alpha won't change anymore.
I also defined some nn.Modules, whose optimizers still work after saving and loading. I wonder which part of the _log_alpha_optimizer is wrong?

Call back not working in tensor flow to stop the training

I have written a call back which stops training when accuracy becomes 99%.But the problem is i get this error .Sometimes if i resolve this error the call back not get called even though acuurqacy becoms 100 %.
'>' not supported between instances of 'NoneType' and 'float'
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('accuracy') > 0.99):
self.model.stop_training = True
def train_mnist():
# Please write your code only where you are indicated.
# please do not remove # model fitting inline comments.
# YOUR CODE SHOULD START HERE
# YOUR CODE SHOULD END HERE
call = myCallback()
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data(path=path)
# YOUR CODE SHOULD START
x_train = x_train/255
y_train = y_train/255
# YOUR CODE SHOULD END HERE
model = tf.keras.models.Sequential([
# YOUR CODE SHOULD START HERE
keras.layers.Flatten(input_shape=(28,28)),
keras.layers.Dense(128,activation='relu'),
keras.layers.Dense(10,activation='softmax')
# YOUR CODE SHOULD END HERE
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# model fitting
history = model.fit(# YOUR CODE SHOULD START HERE
x_train,y_train,epochs=9,callbacks=[call] )
# model fitting
return history.epoch, history.history['acc'][-1]
Two major problems with the above code:
Getting to 100% accuracy on training set almost always means that your model is overfitting. Thats BAD. What you want to do instead is specify the validation_split=.2 parameter in the .fit method, and look for a high accuracy on the validation set.
What you are trying to build in your custom callback is already done in keras.callbacks.EarlyStopping, it even has an option to restore to the best overall model over each epoch. And, by default, it is looking for a validation accuracy, not training accuracy, if you have a validation split.
So, here's what you should do:
Stop using custom callbacks, they take some mastery to get to work. Use EarlyStopping with restore_best instead. like this
Always use validation_split and look for high accuracy in validation set. Like in this quick example.
Did using built-in callbacks resolve your problem?

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

Save Keras model at specific epochs

I am using Keras to do some training on my dataset and it is time consuming to keep running every time to locate the number of epochs needed to get the best results. I tried using callbacks to get the best model, but it just does not work and usually stops too early. Also, saving every N epochs is not an option for me.
What I am trying to do is save the model after some specific epochs are done. Let's say for example, after epoch = 150 is over, it will be saved as model.save(model_1.h5) and after epoch = 152, it will be saved as model.save(model_2.h5) etc... for few specific epochs.
Is there a way to implement this in Keras ? I already searched for a method but no luck so far.
Thank you for any help/suggestion.
Edit
In most cases it's enough to use name formatting suggested by #Toan Tran in his answer.
But if you need some sophisticated logic, you can use a callback, for example
import keras
class CustomSaver(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if epoch == 2: # or save after some epoch, each k-th epoch etc.
self.model.save("model_{}.hd5".format(epoch))
on_epoch_end is called at the end of each epoch; epoch is a number of epoch, latter argument is a logs (you can read about other callback methods in docs). Put the logic into this method (in example it's as simple as possible).
Create saver object and put it into fit method:
import keras
import numpy as np
inp = keras.layers.Input(shape=(10,))
dense = keras.layers.Dense(10, activation='relu')(inp)
out = keras.layers.Dense(1, activation='sigmoid')(dense)
model = keras.models.Model(inp, out)
model.compile(optimizer="adam", loss="binary_crossentropy",)
# Just a noise data for fast working example
X = np.random.normal(0, 1, (1000, 10))
y = np.random.randint(0, 2, 1000)
# create and use callback:
saver = CustomSaver()
model.fit(X, y, callbacks=[saver], epochs=5)
In the bash:
!ls
Out:
model_2.hd5
So, it works.
checkpoint = keras.callbacks.ModelCheckpoint('model{epoch:08d}.h5', period=5)
model.fit(X_train, Y_train, callbacks=[checkpoint])
Did you try checkpoint? period=5 means model is saved after 5 epoch
More details here
Hope this help :)
Well, I can't comment on posts yet. So, I'm adding on to #Toan Tran's answer. With the latest version of Keras, the argument period is deprecated. Instead, we can use save_freq.
In the following example, the model is saved after every epoch.
checkpoint = keras.callbacks.ModelCheckpoint(model_save_path+'/checkpoint_{epoch:02d}', save_freq='epoch')
H=model.fit(x=x_train, y=y_train,epochs=epoch_no,verbose=2, callbacks=[checkpoint])
You can find more details from keras documentation.

How to make predictions with tf.estimator.Estimator from checkpoint?

I just trained a CNN to recognise sunspots with tensorflow. My model is pretty much the same as this.
The problem is that I cannot find anywhere a clear explanation on how to make predictions with the checkpoint generated by the training phase.
Tried using the standard restore method:
saver = tf.train.import_meta_graph('./model/model.ckpt.meta')
saver.restore(sess,'./model/model.ckpt')
but then I cannot figure out how to run it.
Tried using tf.estimator.Estimator.predict() like this:
# Create the Estimator (should reload the last checkpoint but it doesn't)
sunspot_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="./model")
# Set up logging for predictions
# Log the values in the "Softmax" tensor with label "probabilities"
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=50)
# predict with the model and print results
pred_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": pred_data},
shuffle=False)
pred_results = sunspot_classifier.predict(input_fn=pred_input_fn)
print(pred_results)
but what it does is spitting out <generator object Estimator.predict at 0x10dda6bf8>.
While if I use the same code but with tf.estimator.Estimator.evaluate() it works like a charm (reloads the model, performs evaluation and sends it to TensorBoard).
I know there are many similar questions but I couldn't really find the way that worked for me.
sunspot_classifier.predict(input_fn=pred_input_fn) returns generator. So pred_results is generator object. To get value from it you need to iterate it by next(pred_results)
The solution is
print(next(pred_results))

Categories