Change device allocations in a trained and reloaded Keras model

Change device allocations in a trained and reloaded Keras model - python

I have a Keras model which was trained on 8 gpu's. This means the model has blocks like: with tf.device('gpu:0'). Now I want to apply transfer learning with another pc which has 4 gpus's. However, this results in an error, most likely because the model was trained on more gpus's (error: could not set cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM). In the error log I can also see a warning that tensorflow is trying to colocate gradients on device GPU 0-7. Is there a way to adapt or clear the devices in a trained model which is configured with Keras?
FYI: I don't have a meta graph file, because the model was also saved with Keras and not with the tensorflow saver function
Current attempts
I tried to change the layer properties, but this did not make it work:
track = 0
for i in range(len(model.layers)):
if model.layers[i].name[:6] == 'lambda':
model.layers[i].arguments['n_gpus'] = n_gpus
if model.layers[i].arguments['part'] > n_gpus-1:
model.layers[i].arguments['part'] = np.arange(n_gpus)[track]
track += 1
if track > n_gpus-1:
track = 0
In addition, I tried to set the number of visible devices, which also didn't work:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
Script to create a model split over 8 gpus
"""
to_multi_gpu & slice_batch by: https://github.com/fchollet/keras/issues/2436
baseline_model by: http://machinelearningmastery.com/
"""
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Dense, Input, Lambda, merge
import tensorflow as tf
def slice_batch(x, n_gpus, part):
"""
Divide the input batch into [n_gpus] slices, and obtain slice no. [part]
i.e. if len(x)=10, then slice_batch(x, 2, 1) will return x[5:].
x: input batch (input shape of model)
n_gpus: number of gpus
part: id of current gpu
return: sliced model per gpu
"""
sh = K.shape(x)
L = sh[0] // n_gpus
if part == n_gpus - 1:
return x[part*L:]
return x[part*L:(part+1)*L]
def to_multi_gpu(model, n_gpus):
"""
Given a keras [model], return an equivalent model which parallelizes
the computation over [n_gpus] GPUs.
Each GPU gets a slice of the input batch, applies the model on that slice
and later the outputs of the models are concatenated to a single
tensor, hence the user sees a model that behaves the same as the original.
model: sequential model created with the Keras library
n_gpus: number of gpus
return: model divided over n_gpus
"""
# Only divide model over multiple gpus if there is more than one
if n_gpus > 1:
with tf.device('/cpu:0'):
x = Input(model.input_shape[1:])#, name=model.input_names[0]
towers = []
# Divide model over gpus
for g in range(n_gpus):
# Work on GPU number g.
with tf.device('/gpu:' + str(g)):
# Obtain the g-th slice of the batch.
slice_g = Lambda(slice_batch, lambda shape: shape,
arguments={'n_gpus':n_gpus, 'part':g})(x)
# Apply model on the batch slice.
towers.append(model(slice_g))
# Merge multi-gpu outputs with cpu
with tf.device('/cpu:0'):
merged = merge(towers, mode='concat', concat_axis=0)
return Model(input=[x], output=merged)
else:
return model
def baseline_model(num_pixels, num_classes, n_gpus):
# create model
model = Sequential()
model.add(Dense(num_pixels, input_dim=num_pixels, init='normal', activation='relu'))
model.add(Dense(num_classes, init='normal', activation='softmax'))
model = to_multi_gpu(model, n_gpus)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
if __name__ == '__main__':
model = baseline_model(784, 9, 8)

Using the settings below solved it. However, now the model is running on cpu instead of gpu. As I am fine-tuning this model on the last layer this is not a big issue. But if you want to re-load and train the complete model this answer might not be satisfactory.
Important settings are os.environ['CUDA_VISIBLE_DEVICES'] = "" and allow_soft_placement=True.
The first masks all the gpu's and the second makes Tensorflow automatically allocate the model on the available devices (in this case CPU).
Sample code
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""
import tensorflow as tf
from keras.models import load_model
from keras import backend as K
if __name__ == '__main__':
model = load_model('baseline_model.h5')
init = tf.global_variables_initializer()
gpu_options = tf.GPUOptions(allow_growth=True)
# Add ops to save and restore all the variables.
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True,\
log_device_placement=True)) as sess:
K.set_session(sess)
sess.run(init)
tf.train.start_queue_runners(sess=sess)
# Call model.fit here
sess.close()

Related

How to visualize a keras neural network with trained weights?

I have created a sequential model using keras package similar to this:
from keras.layers import Dense
from keras.models import Sequential
model = Sequential()
# Adding the input layer and the first hidden layer
model.add(Dense(6, activation='relu',input_dim = 11))
# Adding the second hidden layer (The real model contains many hidden layers)
model.add(Dense(6,activation='relu'))
# Adding the output layer
model.add(Dense( 1, activation = 'sigmoid'))
Then I used keras visualizer to get a visualization of the neural network without weights.
# Compiling the ANN
classifier.compile(optimizer = 'Adamax', loss = 'binary_crossentropy',metrics=['accuracy'])
model_history=classifier.fit(X_train, y_train.to_numpy(), batch_size = 10, epochs = 100)
I want to print trained weights of the model for this kind of visualization. Is there any library or module that I can use for that? Any suggestion will be helpful. Here is the picture of the designed neural network without printing weights.

I want to print trained weights of the model to this kind of visualization. Is there any library or module that I can use for that?
Option1: deepreplay
There is a workaround in the form of package\module so-called Deep Replay you can import as a library for resolving your problem.
Thanks to this package, you can visualize\animate and the most probably print trained weights using the following example:
# install FFMPEG (to generate animations)
#!apt-get install ffmpeg
# install actual deepreplay package
#!pip install deepreplay
from keras.initializers import glorot_normal, glorot_uniform, he_normal, he_uniform
from keras.layers import Dense
from keras.models import Sequential
from deepreplay.callbacks import ReplayData
from deepreplay.datasets.ball import load_data
from deepreplay.plot import compose_plots, compose_animations
from deepreplay.replay import Replay
from matplotlib import pyplot as plt
plt.rcParams['animation.ffmpeg_path'] = '/usr/bin/ffmpeg'
X, y = load_data(n_dims=10)
activation = 'relu'
initializer_name = 'he_uniform'
initializer = eval(initializer_name)(seed=13)
title = 'Activation: ReLU - Initializer: {}'.format(initializer_name)
group_name = 'relu_{}'.format(initializer_name)
filename = f'{group_name}_{initializer_name}_{activation}_weight_initializers.h5'
# Model builder function
def build_model(n_layers, input_dim, units, activation, initializer):
if isinstance(units, list):
assert len(units) == n_layers
else:
units = [units] * n_layers
model = Sequential()
# Adds first hidden layer with input_dim parameter
model.add(Dense(units=units[0],
input_dim=input_dim,
activation=activation,
kernel_initializer=initializer,
name='h1'))
# Adds remaining hidden layers
for i in range(2, n_layers + 1):
model.add(Dense(units=units[i-1],
activation=activation,
kernel_initializer=initializer,
name='h{}'.format(i)))
# Adds output layer
model.add(Dense(units=1, activation='sigmoid', kernel_initializer=initializer, name='o'))
# Compiles the model
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])
return model
replaydata = ReplayData(X, y, filename=filename, group_name=group_name)
# Create the MLP model with 5 layers within 10 input neurons and 100 unists in hidden and output layers
model = build_model(n_layers=5, input_dim=10, units=100, activation=activation, initializer=initializer)
# fit the model over 10 epochs with batch size of 16
model.fit(X, y, epochs=10, batch_size=16, callbacks=[replaydata])
# Plot the results
replay = Replay(replay_filename=filename, group_name=group_name)
fig = plt.figure(figsize=(12, 6))
ax_zvalues = plt.subplot2grid((2, 2), (0, 0))
ax_weights = plt.subplot2grid((2, 2), (0, 1))
ax_activations = plt.subplot2grid((2, 2), (1, 0))
ax_gradients = plt.subplot2grid((2, 2), (1, 1))
wv = replay.build_weights(ax_weights)
gv = replay.build_gradients(ax_gradients)
# Z-values
zv = replay.build_outputs(ax_zvalues, before_activation=True,
exclude_outputs=True, include_inputs=False)
# Activations
av = replay.build_outputs(ax_activations, exclude_outputs=True, include_inputs=False)
# Save plots
fig = compose_plots([zv, wv, av, gv], epoch=0, title=title)
fig.savefig('part2.png', format='png', dpi=120)
# Animate & save mp4
sample_anim = compose_animations([zv, wv, av, gv])
sample_anim.save('part2.mp4', dpi=120, fps=5)
visulize output results using violin plots over 10 epochs for simple:
So the top right subplot shows Weights change through the layers over ten epochs. The other subplots illustrate the performance of Z-values, Activation functions, and Gradients changes.
Note1: if you are interested to interpret violin plots, please check these posts: post1 , post2, post3
Note2: Please notice that the training process starts with some initializers, which can have different weighing at the beginning. The common initialization schemes are as follows:
Random
Xavier / Glorot
He
By default, kernel initializer is glorot_uniform when you use keras module (reference), but you can check this post and this paper Understanding the difficulty of training deep feedforward neural networks for further info. It is also possible to initialize weights in NN manually. You can check this post.
Note3: Recently, this package has a bug and can't be implemented in Google Colab Notebook, which is still an open issue; its GH Repo as well as post in SoF. So it is better to try it on your own local machine, hopefully.
Option2:wandb
There is another ML-based tool, so-called W&B (Weights and Biases) you can import as a library for resolving your problem.
once you sign-up and login into your account based on instructions, you can use this API to track and visualize all the pieces of your ML pipeline, including Weights and Biases and other parameters in your pipeline:
import wandb
from wandb.keras import WandbCallback
# Step1: Initialize W&B run
wandb.init(project='project_name')
# 2. Save model inputs and hyperparameters
config = wandb.config
config.learning_rate = 0.01
# Model training code here ...
import tensorflow as tf
from tensorflow import keras
loss=tf.keras.losses.MeanSquaredError()
Optimiser=tf.keras.optimizers.Adam(learning_rate =0.001)
model.compile(loss=loss, optimizer=Optimiser, metrics=['accuracy'])
wandb.log({"loss": loss})
# Step 3: Add WandbCallback
model.fit(X, y, epochs=10, batch_size=16, callbacks=[WandbCallback()])
once you run your model, you can check graph info in the Model section which is selected\shown on the left side with blue color:
hope this answer helps you out, and if it is so, you can accept it as an answer ✅.

tf.keras: how to improve performance with large-ish embedding layer

I am training an LSTM model with embedding input layer with a vocabulary size of approximately 100,000. While profiling the training via tensorboard, I discovered that most of the training time is spent on "Kernel Launch" (58%), followed by "All Others" (36%). In other words the GPU is idle most of the time due to overhead. The high kernel launch time seems to be driven by the size of the embedding layer.
My question is: how can I improve the training speed? Is it inevitable that most of the training time is spent on kernel launch when working with a large-ish embedding? Increasing the batch size (currently at 128) would help since the kernel launch time doesn't depend on the batch size, but 128 is already on the high side.
Not sure what exactly falls under "All Others"?
I am working on a Tesla T4 GPU with Tensorflow 2.2.0, but I see the same behavior using the nightly build.
Following the RNN tutorial on tensorflow.org (https://www.tensorflow.org/tutorials/text/text_classification_rnn), here is an example that highlights the performance issues:
import tensorflow_datasets as tfds
import tensorflow as tf
from datetime import datetime
from tqdm.auto import tqdm
### retrieve data ###
# use imdb_reviews dataset from TFDS
dataset = tfds.load('imdb_reviews',as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
### get encoder ###
# initialize tokenizer
tokenizer = tfds.features.text.Tokenizer()
# build vocabulary
def addOrUpdate(d,token):
d[token] = d.get(token,0)+1
vocab = dict()
dataset_iter = iter(train_dataset)
for el in tqdm(dataset_iter):
text = el[0].numpy().decode("utf-8")
for token in tokenizer.tokenize(text):
addOrUpdate(vocab,token)
# shrink vocabulary (MIN_COUNT>1 significantly reduces model dimension)
MIN_COUNT = 1
vocab_subset = set([k for k,v in vocab.items() if v >= MIN_COUNT])
print("Using vocabulary subset with min_count={:}: {:,} words, ".format(MIN_COUNT,len(vocab_subset)))
# create encoder
encoder = tfds.features.text.TokenTextEncoder(vocab_subset)
### Prepare the data for training ###
def encode(text_tensor, label):
encoded_text = encoder.encode(text_tensor.numpy())
return encoded_text, label
def encode_map_fn(text,label):
# encode
encoded_text, label = tf.py_function(encode,
inp=[text, label],
Tout=(tf.int64, tf.int64))
# set shapes
encoded_text.set_shape([None])
label.set_shape([])
return encoded_text, label
train_dataset = train_dataset.map(encode_map_fn)
test_dataset = test_dataset.map(encode_map_fn)
BUFFER_SIZE = 25000
BATCH_SIZE = 128
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
### create the model ###
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 256, mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
### Train the model ###
# create tensorboard callback
log_path = 'logs_'+datetime.now().strftime("%Y%m%d_%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_path,
profile_batch = '10,20')
history = model.fit(train_dataset, epochs=1, steps_per_epoch=30,
callbacks=[tensorboard_callback])
Same code in a Colab Notebook: https://colab.research.google.com/drive/1WoAShXR2cGOYWPQoKdh4IGlhZh4FAK7o?usp=sharing

I haven't tried your code, but from looking at it, I guess the following issue might be related:
If a GPU is present but eager execution is enabled, Embedding layers are still placed on the CPU.
See https://github.com/tensorflow/tensorflow/issues/44194 (it includes a workaround).

How to bypass portion of neural network in TensorFlow for some (but not all) features

In my TensorFlow model I have some data that I feed into a stack of CNNs before it goes into a few fully connected layers. I have implemented that with Keras' Sequential model. However, I now have some data that should not go into the CNN and instead be fed directly into the first fully connected layer because that data contains some values and labels that are part of the input data but that data should not undergo convolutions as it is not image data.
Is such a thing possible with tensorflow.keras or should I do that with tensorflow.nn instead? As far as I understand Keras' sequential models is that the input goes in one end and comes out the other with no special wiring in the middle.
Am I correct that to do this I have to use tensorflow.concat on the data from the last CNN layer and the data that bypasses the CNNs before feeding it into the first fully connected layer?

Here is an simple example in which the operation is to sum the activations from different subnets:
import keras
import numpy as np
import tensorflow as tf
from keras.layers import Input, Dense, Activation
tf.reset_default_graph()
# this represents your cnn model
def nn_model(input_x):
feature_maker = Dense(10, activation='relu')(input_x)
feature_maker = Dense(20, activation='relu')(feature_maker)
feature_maker = Dense(1, activation='linear')(feature_maker)
return feature_maker
# a list of input layers, of course the input shapes can be different
input_layers = [Input(shape=(3, )) for _ in range(2)]
coupled_feature = [nn_model(input_x) for input_x in input_layers]
# assume you take the sum of the outputs
coupled_feature = keras.layers.Add()(coupled_feature)
prediction = Dense(1, activation='relu')(coupled_feature)
model = keras.models.Model(inputs=input_layers, outputs=prediction)
model.compile(loss='mse', optimizer='adam')
# example training set
x_1 = np.linspace(1, 90, 270).reshape(90, 3)
x_2 = np.linspace(1, 90, 270).reshape(90, 3)
y = np.random.rand(90)
inputs_x = [x_1, x_2]
model.fit(inputs_x, y, batch_size=32, epochs=10)
You can actually plot the model to gain more intuition
from keras.utils.vis_utils import plot_model
plot_model(model, show_shapes=True)
The model of the above code looks like this

With a little remodeling and the functional API you can:
#create the CNN - it can also be a sequential
cnn_input = Input(image_shape)
cnn_output = Conv2D(...)(cnn_input)
cnn_output = Conv2D(...)(cnn_output)
cnn_output = MaxPooling2D()(cnn_output)
....
cnn_model = Model(cnn_input, cnn_output)
#create the FC model - can also be a sequential
fc_input = Input(fc_input_shape)
fc_output = Dense(...)(fc_input)
fc_output = Dense(...)(fc_output)
fc_model = Model(fc_input, fc_output)
There is a lot of space for creativity, this is just one of the ways.
#create the full model
full_input = Input(image_shape)
full_output = cnn_model(full_input)
full_output = fc_model(full_output)
full_model = Model(full_input, full_output)
You can use any of the three models in any way you want. They share the layers and the weights, so internally they are the same.
Saving and loading the full model might be quirky. I'd probably save the other two separately and when loading create the full model again.
Notice also that if you save two models that share the same layers, after loading they will probably not share these layers anymore. (Another reason for saving/loading only fc_model and cnn_model, while creating full_model again from code)

Error in Keras custom loss function - tensotflow

I am fairly new to tensorflow and I was following the answer to the question below in order to build a custom loss function in Keras that considers only the top 20 predictions.
How can I sort the values in a custom Keras / Tensorflow Loss Function?
However, when I try to compile my model using this code I get the following error about dimensions
InvalidArgumentError: input must have last dimension >= k = 20 but is 1 for 'loss_21/dense_65_loss/TopKV2' (op: 'TopKV2') with input shapes: [?,1], [] and with computed input tensors: input[1] = <20>.
A simplified version of the code that re-produces the error is the following.
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizers import SGD
top = 20
def top_loss(y_true, y_pred):
y_pred_top_k, y_pred_ind_k = tf.nn.top_k(y_pred, top)
loss_per_sample = tf.reduce_mean(tf.reduce_sum(y_pred_top_k,
axis=-1))
return loss_per_sample
model = Sequential()
model.add(Dense(50, input_dim=201))
model.add(Dense(1))
sgd = SGD(lr=0.01, decay=0, momentum=0.9)
model.compile(loss=top_loss, optimizer=sgd)
and the error is thrown at the following line of the top_loss function when the model is compiled.
y_pred_top_k, y_pred_ind_k = tf.nn.top_k(y_pred, top)
It seems that y_pred in compile time is by default of shape [?,1] while the tf.nn.top_k function expects dimension at least higher than 'k` (i.e. 20).
Do I have to cast y_pred to something so that tf.nn.top_k knows it is of the correct dimensions?

Use:
y_pred_top_k, y_pred_ind_k = tf.nn.top_k(y_pred[:,0], top)
y_pred[:,0] gets the predicted values of the full batch as a rank 1 tensor.
Another Problem:
However, you will still end up with problem with the last batch. Say your batch size is 32 and your train data is of size 100 then the last batch will be of size less then 20 and so tf.nn.top_k will result in a run time error for the last batch. Just make sure your last batch size is >= 20 to avoid this issue. However a much better way is to check if the current batch is less then 20 and if so adjust your k value to be used in the top_k
Code
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizers import SGD
top = tf.constant(20)
def top_loss(y_true, y_pred):
result = tf.cond(tf.math.greater(top_, tf.shape(y_true)[0]),
lambda: tf.shape(y_true)[0], lambda: top)
y_pred_top_k, y_pred_ind_k = tf.nn.top_k(y_pred[:,0], result)
loss_per_sample = tf.reduce_mean(tf.reduce_sum(y_pred_top_k,
axis=-1))
return loss_per_sample
model = Sequential()
model.add(Dense(50, input_dim=201))
model.add(Dense(1))
sgd = SGD(lr=0.01, decay=0, momentum=0.9)
model.compile(loss=top_loss, optimizer=sgd)

How to get intermediate output when using tf.keras.application

tf.keras.application contains many famous neural network link VGG, densenet, mobilenet and so on. Take tf.keras.application.MobileNet as an example, what I am interested in is not only the final output, but also the output of the intermediate layer, how could I get all these output when retraining the network.
May be model.get_output_at(index) helps. However, every time I call this function, I get a DeferredTensor because I cannot forward the data at the same time. Does a convenient way exists?
Thanks in advance~

I suggest you to read the keras documentation:
One simple way is to create a new Model that will output the layers that you are interested in:
from keras.models import Model
model = ... # create the original model
layer_name = 'my_layer'
intermediate_layer_model = Model(inputs=model.input,
outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(data)
Alternatively, you can build a Keras function that will return the output of a certain layer given a certain input, for example:
from keras import backend as K
# with a Sequential model
get_3rd_layer_output = K.function([model.layers[0].input],
[model.layers[3].output])
layer_output = get_3rd_layer_output([x])[0]
Similarly, you could build a Theano and TensorFlow function directly.
Note that if your model has a different behavior in training and testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you will need to pass the learning phase flag to your function:
get_3rd_layer_output = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[3].output])
# output in test mode = 0
layer_output = get_3rd_layer_output([x, 0])[0]
# output in train mode = 1
layer_output = get_3rd_layer_output([x, 1])[0]
Here is another similar answer written by fchollet himself:
How can I get hidden layer representation of the given data?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change device allocations in a trained and reloaded Keras model - python

Related

How to visualize a keras neural network with trained weights?

tf.keras: how to improve performance with large-ish embedding layer

How to bypass portion of neural network in TensorFlow for some (but not all) features

Error in Keras custom loss function - tensotflow

How to get intermediate output when using tf.keras.application

Categories

Resources