Running keras model on colab TPU

Running keras model on colab TPU - python

I am currently trying to train a convolutional neural network CNN using Keras and the Google Colab GPU.
I found this article that discussed the option to increase the training time that is needed to train the model. Since the current training on the GPU is very slow I tried to implement the method from the article. I have the following code:
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd,loss='categorical_crossentropy',metrics=['accuracy'])
def create_train_subsets():
X_train =[]
y_train = []
for i in range(80):
cat = i+1
path = 'train_set/by_cat/{}'.format(cat)
for img in os.listdir(path):
actual_image = Image.open(("train_set/by_cat/{}/{}".format(cat,img)))
X_train.append(actual_image)
y_train.append(cat)
return X_train, y_train
# This address identifies the TPU we'll use when configuring TensorFlow.
x_train, y_train = create_train_subsets()
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tf.logging.set_verbosity(tf.logging.INFO)
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
history = tpu_model.fit(x_train, y_train,
epochs=20,
batch_size=128 * 8,
validation_split=0.2)
tpu_model.save_weights('./tpu_model.h5', overwrite=True)
# tpu_model.evaluate(x_test, y_test, batch_size=128 * 8)
This code however gives back the following error:
InvalidArgumentError: No OpKernel was registered to support Op 'ConfigureDistributedTPU' used by node ConfigureDistributedTPU (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [tpu_embedding_config="", is_global_init=false, embedding_config=""]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
<no registered kernels>
[[ConfigureDistributedTPU]]
I did an extensive search online but I can't seem to find any indication on what it means. Also, I am not understanding the process enough to figure out the exact meaning of the error myself.
Therefore, is there anybody out there that can help me understand what is wrong and maybe also knows a solution on how to solve this.
Thank you in advance!

Related

Hugging Face not able to reload all weights after training

I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)

I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.

I got 'ValueError: No gradients provided for any variable:'

When I train the ML model that other team members had no problem with,
but I got 'ValueError: No gradients provided for any variable:'
Total error statement is below
ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'lstm/lstm_cell/kernel:0', 'lstm/lstm_cell/recurrent_kernel:0', 'lstm/lstm_cell/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0'].
and below is the block of Jupyter Notebook that makes error
model.layers[2]
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False
model.compile(loss='categorical_crossentropy', optimizer='adam')
epochs = 10
number_pics_per_bath = 3
steps = len(train_descriptions)//number_pics_per_bath
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, wordtoix, max_length, number_pics_per_bath)
model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
model.save('C:/MyAI/archive/model_' + str(i) + '.h5')
i think you also want to see my total code. it is easy, because I almost copied the code of below github link
https://github.com/hlamba28/Automatic-Image-Captioning/blob/master/Automatic%20Image%20Captioning.ipynb
because I don't have the same path or name of file like github,
I changed a little bit of above github code, but i did not change the logical part of code.
and my other team member said he had no problem with our code (that is changed a little bit), and showed me that ML model is trained well and make result intended.

Use TPU in Google Colab

I am currently training a neural network with the help of a TPU.
I changed the runtime type and initialized the TPU.
I have the feeling that it is still not faster. I used https://www.tensorflow.org/guide/tpu.
Did I something wrong?
# TPU initialization
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
.
.
.
# experimental_steps_per_execution = 50
model.compile(optimizer=Adam(lr=learning_rate), loss='binary_crossentropy', metrics=['accuracy'], experimental_steps_per_execution = 50)
The summary of my model
Is there anything I still have to consider or adjust?

You need to create TPU strategy:
strategy = tf.distribute.TPUStrategy(resolver).
And than use this strategy properly:
with strategy.scope():
model = create_model()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['sparse_categorical_accuracy'])

Failed to convert a NumPy array to a Tensor (Unsupported object type tensorflow.python.framework.ops.EagerTensor)

I'm trying to implement a simple recurrent network using TensorFlow, but am receiving the above error. I've looked through several answers related to the:
"Failed to convert a NumPy array to a Tensor (Unsupported object type ____)"
error, but none so far have addressed "tensorflow.python.framework.ops.EagerTensor" as the unsupported type. I am receiving this error after trying to implement code from this tutorial (albeit with a different data-set).
The error occurs on the history = model.fit line:
# Define the network
epochs_qty = 50
batch_size_qty = 72
model = Sequential()
model.add(LSTM(epochs_qty, input_shape = (train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss = 'mae', optimizer = 'adam')
# Fit the network
history = model.fit(train_X, train_y, epochs = epochs_qty, batch_size = batch_size_qty, validation_data = (test_X, test_y), verbose = 2, shuffle = False)
The data sets have the following shapes:
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
>> (1762, 1, 2) (1762,) (588, 1, 2) (588,)
I am running the following versions:
Python 3.7.9
Windows 10
tensorflow-gpu-2.3.1
CUDA Toolkit 10.1 Update 1
cuDNN v8.0.3 for CUDA 10.1
I have tried disabling eager execution, but this leads to a pile of additional errors, and does not seem optimal for future code development.
Also, I have tried running this code both locally and through a jupyter notebook. Both result in the exact same error, so it seems like my software setup is not the issue.
Can anyone please suggest where to look next for the cause of this error?

keras dosen't load the model and weights when using checkpoint

I'm using keras to build a deep autoencoder. I used its checkpointer to load the model and the weights but the result is always None which I think it means that the checkpoint dosen't work correctly and is not saving weights.
Here is the code how I proceed:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0,
save_best_only=True)
tensorboard = TensorBoard(log_dir='/tmp/autoencoder',
histogram_freq=0,
write_graph=True,
write_images=True)
input_enc = Input(shape=(input_size,))
hidden_1 = Dense(hidden_size1, activation='relu')(input_enc)
hidden_11 = Dense(hidden_size2, activation='relu')(hidden_1)
code = Dense(code_size, activation='relu')(hidden_11)
hidden_22 = Dense(hidden_size2, activation='relu')(code)
hidden_2 = Dense(hidden_size1, activation='relu')(hidden_22)
output_enc = Dense(input_size, activation='tanh')(hidden_2)
autoencoder_yes = Model(input_enc, output_enc)
autoencoder_yes.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
history_yes = autoencoder_yes.fit(df_noyau_norm_y, df_noyau_norm_y,
epochs=200,
batch_size=batch_size,
shuffle = True,
validation_data=(df_test_norm_y, df_test_norm_y),
verbose=1,
callbacks=[checkpointer, tensorboard]).history
autoencoder_yes.save_weights("weights.best.h5")
print(autoencoder_yes.load_weights("weights.best.h5"))
Can somebody help me find out a way to resolve the problem?
Thanks

No, your interpretation of load_weights returning None is not correct. Load weights is a procedure, it does not return anything, and if you assign the return value of a procedure to a variable, it will get the value of None.
So weight saving is probably working fine, its just your interpretation that is wrong.

you should use save_weights_only=True. Without this the whole model is saved not just the weights. To be able to load weights you must save weights like this:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0, save_weights_only=True,
save_best_only=True)

This is expected behavior not an error. The autoencoder_yes.load_weights("weights.best.h5") doesn't actually return anything, so if you try to print the output of this function you will get None as output.
Expected behavior
In the code that you have provided, you have trained the model and saved the weights. So, the autoencoder_yes is a keras.Model object that has the fine-tuned weights.
In the same script if you load the saved weights once again, nothing is supposed to happen, the weights that you saved will get loaded again.
For clarity
Start with another fresh script, build the same model architecture and reload the weights from the h5 file and then do some predictions. In that case it will silently load the pre-trained weights and do the predictions according to that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running keras model on colab TPU - python

Related

Hugging Face not able to reload all weights after training

I got 'ValueError: No gradients provided for any variable:'

Use TPU in Google Colab

Failed to convert a NumPy array to a Tensor (Unsupported object type tensorflow.python.framework.ops.EagerTensor)

keras dosen't load the model and weights when using checkpoint

Categories

Resources