When I train the ML model that other team members had no problem with,
but I got 'ValueError: No gradients provided for any variable:'
Total error statement is below
ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'lstm/lstm_cell/kernel:0', 'lstm/lstm_cell/recurrent_kernel:0', 'lstm/lstm_cell/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0'].
and below is the block of Jupyter Notebook that makes error
model.layers[2]
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False
model.compile(loss='categorical_crossentropy', optimizer='adam')
epochs = 10
number_pics_per_bath = 3
steps = len(train_descriptions)//number_pics_per_bath
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, wordtoix, max_length, number_pics_per_bath)
model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
model.save('C:/MyAI/archive/model_' + str(i) + '.h5')
i think you also want to see my total code. it is easy, because I almost copied the code of below github link
https://github.com/hlamba28/Automatic-Image-Captioning/blob/master/Automatic%20Image%20Captioning.ipynb
because I don't have the same path or name of file like github,
I changed a little bit of above github code, but i did not change the logical part of code.
and my other team member said he had no problem with our code (that is changed a little bit), and showed me that ML model is trained well and make result intended.
Related
I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.
I am using Keras to do some training on my dataset and it is time consuming to keep running every time to locate the number of epochs needed to get the best results. I tried using callbacks to get the best model, but it just does not work and usually stops too early. Also, saving every N epochs is not an option for me.
What I am trying to do is save the model after some specific epochs are done. Let's say for example, after epoch = 150 is over, it will be saved as model.save(model_1.h5) and after epoch = 152, it will be saved as model.save(model_2.h5) etc... for few specific epochs.
Is there a way to implement this in Keras ? I already searched for a method but no luck so far.
Thank you for any help/suggestion.
Edit
In most cases it's enough to use name formatting suggested by #Toan Tran in his answer.
But if you need some sophisticated logic, you can use a callback, for example
import keras
class CustomSaver(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if epoch == 2: # or save after some epoch, each k-th epoch etc.
self.model.save("model_{}.hd5".format(epoch))
on_epoch_end is called at the end of each epoch; epoch is a number of epoch, latter argument is a logs (you can read about other callback methods in docs). Put the logic into this method (in example it's as simple as possible).
Create saver object and put it into fit method:
import keras
import numpy as np
inp = keras.layers.Input(shape=(10,))
dense = keras.layers.Dense(10, activation='relu')(inp)
out = keras.layers.Dense(1, activation='sigmoid')(dense)
model = keras.models.Model(inp, out)
model.compile(optimizer="adam", loss="binary_crossentropy",)
# Just a noise data for fast working example
X = np.random.normal(0, 1, (1000, 10))
y = np.random.randint(0, 2, 1000)
# create and use callback:
saver = CustomSaver()
model.fit(X, y, callbacks=[saver], epochs=5)
In the bash:
!ls
Out:
model_2.hd5
So, it works.
checkpoint = keras.callbacks.ModelCheckpoint('model{epoch:08d}.h5', period=5)
model.fit(X_train, Y_train, callbacks=[checkpoint])
Did you try checkpoint? period=5 means model is saved after 5 epoch
More details here
Hope this help :)
Well, I can't comment on posts yet. So, I'm adding on to #Toan Tran's answer. With the latest version of Keras, the argument period is deprecated. Instead, we can use save_freq.
In the following example, the model is saved after every epoch.
checkpoint = keras.callbacks.ModelCheckpoint(model_save_path+'/checkpoint_{epoch:02d}', save_freq='epoch')
H=model.fit(x=x_train, y=y_train,epochs=epoch_no,verbose=2, callbacks=[checkpoint])
You can find more details from keras documentation.
We are currently working on a project in which we change a cGAN architecture on Tensorflow to see if we get better results than standard cGANs. Due to the fact that we implement a progressivly growing architecture we would like to reset the AdamOptimizer from Tensorflow after each phase transition. Nonetheless we still did not manage to do so. We tried multiple approaches but either we get the error message "Graph is finalized and cannot be modified" or the parameters do not get reset.
Would be very thankful if somebody could give a hint or a general approach.
You just have to define the optimizer, gather the Adam variables and their initializers. Then, during the training, you can re-initialize the variables by running the initializers.
The following minimal example should point you in the right direction
import tensorflow as tf
x = tf.placeholder(tf.float32, shape=(None, 1))
y_hat = tf.layers.Dense(10)(x)
y = 10
loss = tf.reduce_mean(tf.squared_difference(y_hat, y))
train = tf.train.AdamOptimizer().minimize(loss)
print(tf.all_variables())
adam_vars = [var for var in tf.all_variables() if "adam" in var.name.lower()]
print(adam_vars)
adam_reset = [var.initializer for var in adam_vars]
with tf.Session() as sess:
# do stuff with your model: train, evaluate, whatever
# when the reset condition is met, run:
sess.run(adam_reset)
I am training a model in TensorFlow that works perfectly when I evaluate from the trained model.
However, at various points, I am saving a checkpoint and then loading that checkpoint to run evaluations on it. The loaded network will just output NaNs.
Using tfdbg and running the filter "has_inf_or_nan" when feeding input ends up showing the first NaNs in the network appearing in the moving_mean and moving_variance variables in one of the batch normalization layers.
Saving is being done with the following code:
with self.graph.as_default():
if not self.saver:
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=10000)
save_dir = create_save_dir(path, name)
return self.saver.save(self.session, save_dir, global_step=iteration, write_meta_graph=True)
Loading is being done with the following code:
with self.graph.as_default():
save_dir = create_save_dir(load_dir, load_name)
self.saver = tf.train.import_meta_graph(save_dir + "-" + str(iteration) + ".meta")
self.saver.restore(self.session, save_dir + "-" + str(iteration))
self.input_layer = self.graph.get_tensor_by_name("network/input_layer:0")
self.out_policy_layer = self.graph.get_tensor_by_name("network/out_policy_layer:0")
self.out_value_layer = self.graph.get_tensor_by_name("network/out_value_layer/Tanh:0")
self.is_training = self.graph.get_tensor_by_name("network/is_training:0")
Again, the thing that has me suspecting some issue with my save/load routine is the fact that the network is outputting valid results if I run through the network that has been trained. I am only getting the NaNs when I run something through a network that was loaded.
Editing to add that my batch norm is being created with the following code:
def _conv_block(self, name, input_layer, filter_size, num_input_channels, num_output_channels):
weights = self._create_weights_for_layer(f"{name}_weights",
shape=[filter_size[0],
filter_size[1],
num_input_channels,
num_output_channels],
use_regularizer=self._config.l2_regularizer)
conv = self._conv2d(input_layer, weights, strides=[1, 1, 1, 1], padding="SAME", name=f"{name}_conv")
bn = self._conv_batch_norm(conv, f"{name}_batch_norm")
return tf.nn.relu(bn, name=f"{name}_act")
def _conv_batch_norm(self, input_layer, name):
return tf.layers.batch_normalization(input_layer, axis=CHANNEL_SHAPE_INDEX, center=True, scale=True, training=self.is_training,
momentum=self._config.batch_norm_momentum,
name=name)
Long story short, if you are getting random NaNs in your model, and you have looked at the usual suspects, consider the fact that your hardware could be failing before you waste hundreds of hours.
This was caused by a videocard with dying memory. I assumed we had a software problem, and didn't think of this. The training was happening on a different PC and experienced no issues.
This is how it ended up occurring to us that we had a hardware problem. We previously had experienced randomly occurring NaNs on the old PC. We spent hundreds of hours debugging the model thinking we had an issue. After we made a change we also happened to move to an upgraded PC so we thought our changes fixed it since the NaNs stopped. Then we started running evaluations a month later using that old PC and ran into NaNs again. That is when I posted this. Shortly afterwards I had the realization that there might be a hardware problem.
I'm using keras to build a deep autoencoder. I used its checkpointer to load the model and the weights but the result is always None which I think it means that the checkpoint dosen't work correctly and is not saving weights.
Here is the code how I proceed:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0,
save_best_only=True)
tensorboard = TensorBoard(log_dir='/tmp/autoencoder',
histogram_freq=0,
write_graph=True,
write_images=True)
input_enc = Input(shape=(input_size,))
hidden_1 = Dense(hidden_size1, activation='relu')(input_enc)
hidden_11 = Dense(hidden_size2, activation='relu')(hidden_1)
code = Dense(code_size, activation='relu')(hidden_11)
hidden_22 = Dense(hidden_size2, activation='relu')(code)
hidden_2 = Dense(hidden_size1, activation='relu')(hidden_22)
output_enc = Dense(input_size, activation='tanh')(hidden_2)
autoencoder_yes = Model(input_enc, output_enc)
autoencoder_yes.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
history_yes = autoencoder_yes.fit(df_noyau_norm_y, df_noyau_norm_y,
epochs=200,
batch_size=batch_size,
shuffle = True,
validation_data=(df_test_norm_y, df_test_norm_y),
verbose=1,
callbacks=[checkpointer, tensorboard]).history
autoencoder_yes.save_weights("weights.best.h5")
print(autoencoder_yes.load_weights("weights.best.h5"))
Can somebody help me find out a way to resolve the problem?
Thanks
No, your interpretation of load_weights returning None is not correct. Load weights is a procedure, it does not return anything, and if you assign the return value of a procedure to a variable, it will get the value of None.
So weight saving is probably working fine, its just your interpretation that is wrong.
you should use save_weights_only=True. Without this the whole model is saved not just the weights. To be able to load weights you must save weights like this:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0, save_weights_only=True,
save_best_only=True)
This is expected behavior not an error. The autoencoder_yes.load_weights("weights.best.h5") doesn't actually return anything, so if you try to print the output of this function you will get None as output.
Expected behavior
In the code that you have provided, you have trained the model and saved the weights. So, the autoencoder_yes is a keras.Model object that has the fine-tuned weights.
In the same script if you load the saved weights once again, nothing is supposed to happen, the weights that you saved will get loaded again.
For clarity
Start with another fresh script, build the same model architecture and reload the weights from the h5 file and then do some predictions. In that case it will silently load the pre-trained weights and do the predictions according to that.