Use TPU in Google Colab - python

I am currently training a neural network with the help of a TPU.
I changed the runtime type and initialized the TPU.
I have the feeling that it is still not faster. I used https://www.tensorflow.org/guide/tpu.
Did I something wrong?
# TPU initialization
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
.
.
.
# experimental_steps_per_execution = 50
model.compile(optimizer=Adam(lr=learning_rate), loss='binary_crossentropy', metrics=['accuracy'], experimental_steps_per_execution = 50)
The summary of my model
Is there anything I still have to consider or adjust?

You need to create TPU strategy:
strategy = tf.distribute.TPUStrategy(resolver).
And than use this strategy properly:
with strategy.scope():
model = create_model()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['sparse_categorical_accuracy'])

Related

How to stop CUDA from re-initializing for every subprocess which trains a keras model?

I am using CUDA/CUDNN to train multiple tensorflow keras models on my GPU (for an evolutionary algorithm attempting to optimize hyperparameters). Initially, the program would crash with an Out of Memory error after a couple generations. Eventually, I found that using a new sub-process for every model would clear the GPU memory automatically.
However, each process seems to reinitialize CUDA (loading dynamic libraries from the .dll files), which is incredibly time-consuming. Is there any method to avoid this?
Code is pasted below. The function "fitness_wrapper" is called for each individual.
def fitness_wrapper(indiv):
fit = multi.processing.Value('d', 0.0)
if __name__ == '__main__':
process = multiprocessing.Process(target=fitness, args=(indiv, fit))
process.start()
process.join()
return (fit.value,)
def fitness(indiv, fit):
model = tf.keras.Sequential.from_config(indiv['architecture'])
optimizer_dict = indiv['optimizer']
opt = tf.keras.optimizers.Adam(learning_rate=optimizer_dict['lr'], beta_1=optimizer_dict['b1'],
beta_2=optimizer_dict['b2'],
epsilon=optimizer_dict['epsilon'])
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(data_split[0], data_split[2], batch_size=32, epochs=5)
fit = model.evaluate(data_split[1], data_split[3])[1]
Turns out the solution is to just use tensorflow.backend.clear_session() after each model (rather than the subprocesses). I had tried this before and it didn't work, but for some reason this time it fixed everything.
You should also delete the model and call reset_default_graph(), apparently.
def fitness(indiv, fit):
model = tf.keras.Sequential.from_config(indiv['architecture'])
optimizer_dict = indiv['optimizer']
opt = tf.keras.optimizers.Adam(learning_rate=optimizer_dict['lr'], beta_1=optimizer_dict['b1'],
beta_2=optimizer_dict['b2'],
epsilon=optimizer_dict['epsilon'])
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(data_split[0], data_split[2], batch_size=32, epochs=5)
fit = model.evaluate(data_split[1], data_split[3])[1]
del model
tf.keras.backend.clear_session()
tf.compat.v1.reset_default_graph()
return fit

Running keras model on colab TPU

I am currently trying to train a convolutional neural network CNN using Keras and the Google Colab GPU.
I found this article that discussed the option to increase the training time that is needed to train the model. Since the current training on the GPU is very slow I tried to implement the method from the article. I have the following code:
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd,loss='categorical_crossentropy',metrics=['accuracy'])
def create_train_subsets():
X_train =[]
y_train = []
for i in range(80):
cat = i+1
path = 'train_set/by_cat/{}'.format(cat)
for img in os.listdir(path):
actual_image = Image.open(("train_set/by_cat/{}/{}".format(cat,img)))
X_train.append(actual_image)
y_train.append(cat)
return X_train, y_train
# This address identifies the TPU we'll use when configuring TensorFlow.
x_train, y_train = create_train_subsets()
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tf.logging.set_verbosity(tf.logging.INFO)
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
history = tpu_model.fit(x_train, y_train,
epochs=20,
batch_size=128 * 8,
validation_split=0.2)
tpu_model.save_weights('./tpu_model.h5', overwrite=True)
# tpu_model.evaluate(x_test, y_test, batch_size=128 * 8)
This code however gives back the following error:
InvalidArgumentError: No OpKernel was registered to support Op 'ConfigureDistributedTPU' used by node ConfigureDistributedTPU (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [tpu_embedding_config="", is_global_init=false, embedding_config=""]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
<no registered kernels>
[[ConfigureDistributedTPU]]
I did an extensive search online but I can't seem to find any indication on what it means. Also, I am not understanding the process enough to figure out the exact meaning of the error myself.
Therefore, is there anybody out there that can help me understand what is wrong and maybe also knows a solution on how to solve this.
Thank you in advance!

keras load model error trying to load a weight file containing 17 layers into a model with 0 layers

I am currently working on vgg16 model with keras.
I fine tune vgg model with some of my layer.
After fitting my model (training), I save my model with model.save('name.h5').
It can be saved without problem.
However, when I try to reload the model with load_model function, it shows the error:
You are trying to load a weight file containing 17 layers into a model
with 0 layers
Did anyone meet this problem before?
My keras verion is 2.2.
Here is part of my code ...
from keras.models import load_model
vgg_model = VGG16(weights='imagenet',include_top=False,input_shape=(224,224,3))
global model_2
model_2 = Sequential()
for layer in vgg_model.layers:
model_2.add(layer)
for layer in model_2.layers:
layer.trainable= False
model_2.add(Flatten())
model_2.add(Dense(128, activation='relu'))
model_2.add(Dropout(0.5))
model_2.add(Dense(2, activation='softmax'))
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_2.fit(x=X_train,y=y_train,batch_size=32,epochs=30,verbose=2)
model_2.save('name.h5')
del model_2
model_2 = load_model('name.h5')
Actually I do not delete the model and then load_model immediately,
just for showing my problem.
It seems that this problem is related with the input_shape parameter of the first layer. I had this problem with a wrapper layer (Bidirectional) which did not have an input_shape parameter set. In code:
model.add(Bidirectional(LSTM(units=units, input_shape=(None, feature_size)), merge_mode='concat'))
did not work for loading my old model because the input_shape is only defined for the LSTM layer not the outer one. Instead
model.add(Bidirectional(LSTM(units=units), input_shape=(None, feature_size), merge_mode='concat'))
worked because the wrapper Birectional layer now has an input_shape parameter. Maybe you should check if the VGG net input_shape parameter is set or not or you should add a single input_layer to your model with the correct input_shape parameter.
I spent 6 hours looking around for a solution.. to apply me trained model.
finally i tried VGG16 as model and using h5 weights i´ve trained on my own and Great!
weights_model='C:/Anaconda/weightsnew2.h5' # my already trained weights .h5
vgg=applications.vgg16.VGG16()
cnn=Sequential()
for capa in vgg.layers:
cnn.add(capa)
cnn.layers.pop()
for layer in cnn.layers:
layer.trainable=False
cnn.add(Dense(2,activation='softmax'))
cnn.load_weights(weights_model)
def predict(file):
x = load_img(file, target_size=(longitud, altura))
x = img_to_array(x)
x = np.expand_dims(x, axis=0)
array = cnn.predict(x)
result = array[0]
respuesta = np.argmax(result)
if respuesta == 0:
print("Gato")
elif respuesta == 1:
print("Perro")
In case anyone is still wondering about this error:
I had the same Problem and spent days figuring out, whats causing it. I have a copy of my whole code and dataset on another system on which it worked. I noticed that it is something about the training, because without training my model, saving and loading was no problem.
The only difference between my systems was, that I was using tensorflow-gpu on my main system and for this reason, the tensorflow base version was a little bit lower (1.14.0 instead of 2.2.0). So all I had to do was using
model.fit_generator()
instead of
model.fit()
before saving it. And it works

keras dosen't load the model and weights when using checkpoint

I'm using keras to build a deep autoencoder. I used its checkpointer to load the model and the weights but the result is always None which I think it means that the checkpoint dosen't work correctly and is not saving weights.
Here is the code how I proceed:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0,
save_best_only=True)
tensorboard = TensorBoard(log_dir='/tmp/autoencoder',
histogram_freq=0,
write_graph=True,
write_images=True)
input_enc = Input(shape=(input_size,))
hidden_1 = Dense(hidden_size1, activation='relu')(input_enc)
hidden_11 = Dense(hidden_size2, activation='relu')(hidden_1)
code = Dense(code_size, activation='relu')(hidden_11)
hidden_22 = Dense(hidden_size2, activation='relu')(code)
hidden_2 = Dense(hidden_size1, activation='relu')(hidden_22)
output_enc = Dense(input_size, activation='tanh')(hidden_2)
autoencoder_yes = Model(input_enc, output_enc)
autoencoder_yes.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
history_yes = autoencoder_yes.fit(df_noyau_norm_y, df_noyau_norm_y,
epochs=200,
batch_size=batch_size,
shuffle = True,
validation_data=(df_test_norm_y, df_test_norm_y),
verbose=1,
callbacks=[checkpointer, tensorboard]).history
autoencoder_yes.save_weights("weights.best.h5")
print(autoencoder_yes.load_weights("weights.best.h5"))
Can somebody help me find out a way to resolve the problem?
Thanks
No, your interpretation of load_weights returning None is not correct. Load weights is a procedure, it does not return anything, and if you assign the return value of a procedure to a variable, it will get the value of None.
So weight saving is probably working fine, its just your interpretation that is wrong.
you should use save_weights_only=True. Without this the whole model is saved not just the weights. To be able to load weights you must save weights like this:
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0, save_weights_only=True,
save_best_only=True)
This is expected behavior not an error. The autoencoder_yes.load_weights("weights.best.h5") doesn't actually return anything, so if you try to print the output of this function you will get None as output.
Expected behavior
In the code that you have provided, you have trained the model and saved the weights. So, the autoencoder_yes is a keras.Model object that has the fine-tuned weights.
In the same script if you load the saved weights once again, nothing is supposed to happen, the weights that you saved will get loaded again.
For clarity
Start with another fresh script, build the same model architecture and reload the weights from the h5 file and then do some predictions. In that case it will silently load the pre-trained weights and do the predictions according to that.

How do I add a top dense layer to ResNet50 in Keras?

I read this very helpful Keras tutorial on transfer learning here:
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
I am thinking that this is probably very applicable to the fish data here, and started going down that route. I tried to follow the tutorial as much as I could. The code is a mess as I was just tyring to figure out how everything works, but it can be found here:
https://github.com/MrChristophRivera/ClassifiyingFish/blob/master/notebooks/Anthony/Resnet50%2BTransfer%20Learning%20Attempt.ipynb
For brevity, here are the steps I did here:
model = ResNet50(top_layer = False, weights="imagenet"
# I would resize the image to that of the standard input size of ResNet50.
datagen=ImageDataGenerator(1./255)
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=32,
class_mode=None,
shuffle=False)
# predict on the training data
bottleneck_features_train = model.predict_generator(generator,
nb_train_samples)
print(bottleneck_features_train)
file_name = join(save_directory, 'tbottleneck_features_train.npy')
np.save(open(file_name, 'wb'), bottleneck_features_train)
# Then I would use this output to feed my top layer and train it. Let's
say I defined
# it like so:
top_model = Sequential()
# Skipping some layers for brevity
top_model.add(Dense(8, activation='relu')
top_model.fit(train_data, train_labels)
top_model.save_weights(top_model_weights_path).
At this time, I have the weights saved. The next step would be to add the top layer to ResNet50. The tutorial simply did it like so:
# VGG16 model defined via Sequential is called bottom_model.
bottom_model.add(top_model)
The problem is when I try to do that this fails because "model does not have property add". My guess is that ResNet50 was defined in a different way. At any rate, my question is: How can I add this top model with the loaded weights to the bottom model? Can anyone give helpful pointers?
Try:
input_to_model = Input(shape=shape_of_your_image)
base_model = model(input_to_model)
top_model = Flatten()(base_model)
top_model = Dense(8, activation='relu')
...
Your problem comes from the fact that Resnet50 is defined in a so called functional API. I would also advise you to use different activation function because having relu as an output activation might cause problems. Moreover - your model is not compiled.

Categories