I am training an LSTM model with embedding input layer with a vocabulary size of approximately 100,000. While profiling the training via tensorboard, I discovered that most of the training time is spent on "Kernel Launch" (58%), followed by "All Others" (36%). In other words the GPU is idle most of the time due to overhead. The high kernel launch time seems to be driven by the size of the embedding layer.
My question is: how can I improve the training speed? Is it inevitable that most of the training time is spent on kernel launch when working with a large-ish embedding? Increasing the batch size (currently at 128) would help since the kernel launch time doesn't depend on the batch size, but 128 is already on the high side.
Not sure what exactly falls under "All Others"?
I am working on a Tesla T4 GPU with Tensorflow 2.2.0, but I see the same behavior using the nightly build.
Following the RNN tutorial on tensorflow.org (https://www.tensorflow.org/tutorials/text/text_classification_rnn), here is an example that highlights the performance issues:
import tensorflow_datasets as tfds
import tensorflow as tf
from datetime import datetime
from tqdm.auto import tqdm
### retrieve data ###
# use imdb_reviews dataset from TFDS
dataset = tfds.load('imdb_reviews',as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
### get encoder ###
# initialize tokenizer
tokenizer = tfds.features.text.Tokenizer()
# build vocabulary
def addOrUpdate(d,token):
d[token] = d.get(token,0)+1
vocab = dict()
dataset_iter = iter(train_dataset)
for el in tqdm(dataset_iter):
text = el[0].numpy().decode("utf-8")
for token in tokenizer.tokenize(text):
addOrUpdate(vocab,token)
# shrink vocabulary (MIN_COUNT>1 significantly reduces model dimension)
MIN_COUNT = 1
vocab_subset = set([k for k,v in vocab.items() if v >= MIN_COUNT])
print("Using vocabulary subset with min_count={:}: {:,} words, ".format(MIN_COUNT,len(vocab_subset)))
# create encoder
encoder = tfds.features.text.TokenTextEncoder(vocab_subset)
### Prepare the data for training ###
def encode(text_tensor, label):
encoded_text = encoder.encode(text_tensor.numpy())
return encoded_text, label
def encode_map_fn(text,label):
# encode
encoded_text, label = tf.py_function(encode,
inp=[text, label],
Tout=(tf.int64, tf.int64))
# set shapes
encoded_text.set_shape([None])
label.set_shape([])
return encoded_text, label
train_dataset = train_dataset.map(encode_map_fn)
test_dataset = test_dataset.map(encode_map_fn)
BUFFER_SIZE = 25000
BATCH_SIZE = 128
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
### create the model ###
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 256, mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
### Train the model ###
# create tensorboard callback
log_path = 'logs_'+datetime.now().strftime("%Y%m%d_%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_path,
profile_batch = '10,20')
history = model.fit(train_dataset, epochs=1, steps_per_epoch=30,
callbacks=[tensorboard_callback])
Same code in a Colab Notebook: https://colab.research.google.com/drive/1WoAShXR2cGOYWPQoKdh4IGlhZh4FAK7o?usp=sharing
I haven't tried your code, but from looking at it, I guess the following issue might be related:
If a GPU is present but eager execution is enabled, Embedding layers are still placed on the CPU.
See https://github.com/tensorflow/tensorflow/issues/44194 (it includes a workaround).
Related
I have a text classification that I am trying to do using BERT. Below is the code I am using. The model training code(below) works fine but I am facing issue with the prediction part
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 5e-5
nlabels = 26
# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=nlabels,
output_attentions=False,
output_hidden_states=False)
# optimizer Adam
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
bert_history = model.fit(ds_tr_encoded, epochs=number_of_epochs)
I am getting the output using the following
preds = model.predict(ds_te_encoded)
pred_labels_idx = np.argmax(preds['logits'], axis=1)
The issue I am facing is that the shape of pred_labels_idx is not the same as ds_te_encoded
len(pred_labels_idx) #426820
tf.data.experimental.cardinality(ds_te_encoded) #<tf.Tensor: shape=(), dtype=int64, numpy=21341>
Not sure why this is happening.
Since ds_te_encoded is of type tf.data.Dataset and you call cardinality(...), the cardinality in your case is simply the rounded number of batches and not the number of samples. So I am assuming you are using a batch size of 20, because 426820/20 = 21341. That is probably what is causing the confusion.
I'm beginner in Python and ML. I was practising this Iris Data set to create a ML model using tensor flow 2.0.
I parsed the csv and trained the model using the dataset. I'm able to get 90 % training accuracy and 91 % validation accuracy during my model creation.
import tensorflow as tf
import numpy as np
from sklearn import preprocessing
csv_data = np.loadtxt('iris_training.csv',delimiter=',')
target_all = csv_data[:,-1]
csv_data = csv_data[:,0:-1]
# Shuffling the input
shuffled_indices = np.arange(csv_data.shape[0])
np.random.shuffle(shuffled_indices)
shuffled_inputs = csv_data[shuffled_indices]
shuffled_targets = target_all[shuffled_indices]
# Standardize the Inputs
shuffled_inputs = preprocessing.scale(shuffled_inputs)
# Split date into train , validation and test
total_count = shuffled_inputs.shape[0]
train_data_count = int(0.8*total_count)
validation_data_count = int(0.1*total_count)
test_data_count = total_count - train_data_count - validation_data_count
train_inputs = shuffled_inputs[:train_data_count]
train_targets = shuffled_targets[:train_data_count]
validation_inputs = shuffled_inputs[train_data_count:train_data_count+validation_data_count]
validation_targets = shuffled_targets[train_data_count:train_data_count+validation_data_count]
test_inputs = shuffled_inputs[train_data_count+validation_data_count:]
test_targets = shuffled_targets[train_data_count+validation_data_count:]
print(len(train_inputs))
print(len(validation_inputs))
print(len(test_inputs))
# Model Creation
input_size = 4
hidden_layer_size = 100
output_size = 3
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(hidden_layer_size, input_dim=input_size, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(hidden_layer_size, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(output_size, activation=tf.nn.softmax))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(train_inputs,train_targets, epochs=10, validation_data=(validation_inputs, validation_targets), verbose=2)
prediction = model.predict(test_inputs)
Point me if there is something in my code that i could do to improve the accuracy of my model for this simple Iris Dataset.
File Used for training my Model : Iris Csv
As for your model, you can try to do hyperparameters tuning,
Setting the learning rate to a lower value
Increase the epoch
Add more training dataset since you have a small set of the dataset.
The neural network shines when there is a good amount of data for the training.
You can also add more layers to the model, add dropouts to avoid overfitting
as well as using different activation functions.
These are the common factors that affect model performance.
I have come across some odd behaviours when training CNNs with Tensorflow 2.0 and would appreciate any help in solving them.
I am doing transfer learning (just training the classification head) using the pre-trained networks available in 'tensorflow.keras.applications' and have noticed the following:
For the first epoch, validation metrics are always zero, no matter what I do.
When training after the first epoch, the training metrics improve as you would expect, but the validation metrics essentially are random guesses, even when the EXACT same dataset is used as a training and a validation dataset. It is like it isn't using the model being trained to do its evaluation.
I have tried, VGG16, MobileNetV2, and ResNet50V2, and they all exhibit the same behaviours.
The configurations I am able to reproduce this on are:
Ubuntu 18.04LTS, Nvidia RTX2080ti with driver version 430.50, CUDA10.0, TensorFlow-gpu==2.0.0
MacBook Pro, TensorFlow==2.0.0 (cpu)
Both are running in Conda environments and I have installed TensorFlow with pip. I have put some sample code to show the essence of my workflow down below just in case I am doing anything obviously stupid.Any help would be very appreciated as I am at a loss as to how to fix it.
def parse_function(example_proto):
image_feature_description = {
'label': tf.io.FixedLenFeature([], tf.int64),
'image_raw': tf.io.FixedLenFeature([], tf.string)
}
parsed_example = tf.io.parse_single_example(example_proto, image_feature_description)
image = tf.io.decode_image(
parsed_example['image_raw'],
channels = 3,
dtype = tf.float32,
expand_animations = False
)
image = tf.image.per_image_standardization(image)
label = tf.one_hot(parsed_example['label'], 24, dtype=tf.float32)
return (image, label)
def load_dataset(TFRecord_dir, record_name):
record_files = tf.io.matching_files(os.path.join(TFRecord_dir, record_name + '.tfrecords-????'))
shards = tf.data.TFRecordDataset(record_files)
shards = shards.shuffle(tf.cast(tf.shape(record_files)[0], tf.int64))
dataset = shards.map(map_func=parse_function)
dataset = dataset.batch(batch_size=16, drop_remainder = True)
dataset = dataset.prefetch(16)
return dataset
base_model = tf.keras.applications.ResNet50V2(
input_shape=(224,224,3),
weights='imagenet',
include_top = False
)
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(24, activation = 'softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[
tf.keras.metrics.CategoricalAccuracy(),
tf.keras.metrics.TopKCategoricalAccuracy(),
tf.keras.metrics.Precision(),
tf.keras.metrics.Recall()
])
train_dataset = load_dataset(train_dir, 'train')
model.fit(train_dataset,
verbose = 1,
epochs= 5,
validation_data = train_dataset)
model.evaluate(train_dataset)
When training after the first epoch, the training metrics improve as
you would expect, but the validation metrics essentially are random
guesses, even when the EXACT same dataset is used as a training and a
validation dataset. It is like it isn't using the model being trained
to do its evaluation.
This means that your network is not able to learn everything and it just overfitting. Random guesses means that you have 1/n accuracy where n is the number of classes.
You may want to modify the learning_rate to a much lower value (1e-5) for start and then even unfreeze some of the lower layers(close to your GAP+Dropout+Dense).
I no longer have this problem since started using the docker images provided instead. There must have been something installed incorrectly but I don't know what.
Also, for anyone in the same position, I found out during debugging that if you are normalising your images using image = (image/127.5) - 1 as in transfer learning with pre-trained CNN turtorial change to image = tf.image.per_image_standardization(image) as it exhibits the same behaviour, even in the docker container, i.e.training metrics would improve, but the validation metrics would remain random on the same dataset used to train.
I have the problem, that I am not able to reproduce my results with Keras and ThensorFlow.
It seems like recently there has been a workaround published on the Keras documentation site for this issue but somehow it doesn't work for me.
What I am doing wrong?
I'm using a Jupyter Notebook on a MBP Retina (without Nvidia GPU).
# ** Workaround from Keras Documentation **
import numpy as np
import tensorflow as tf
import random as rn
# The below is necessary in Python 3.2.3 onwards to
# have reproducible behavior for certain hash-based operations.
# See these references for further details:
# https://docs.python.org/3.4/using/cmdline.html#envvar-PYTHONHASHSEED
# https://github.com/fchollet/keras/issues/2280#issuecomment-306959926
import os
os.environ['PYTHONHASHSEED'] = '0'
# The below is necessary for starting Numpy generated random numbers
# in a well-defined initial state.
np.random.seed(42)
# The below is necessary for starting core Python generated random numbers
# in a well-defined state.
rn.seed(12345)
# Force TensorFlow to use single thread.
# Multiple threads are a potential source of
# non-reproducible results.
# For further details, see: https://stackoverflow.com/questions/42022950/which-seeds-have-to-be-set-where-to-realize-100-reproducibility-of-training-res
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
from keras import backend as K
# The below tf.set_random_seed() will make random number generation
# in the TensorFlow backend have a well-defined initial state.
# For further details, see: https://www.tensorflow.org/api_docs/python/tf/set_random_seed
tf.set_random_seed(1234)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
# ** Workaround end **
# ** Start of my code **
# LSTM and CNN for sequence classification in the IMDB dataset
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn import metrics
# fix random seed for reproducibility
#np.random.seed(7)
# ... importing data and so on ...
# create the model
embedding_vecor_length = 32
neurons = 91
epochs = 1
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(neurons))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_logarithmic_error', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=epochs, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Used Python version:
Python 3.6.3 |Anaconda custom (x86_64)| (default, Oct 6 2017, 12:04:38)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
The workaround is already included in the code (without effect).
With everytime I do the training part I get different results.
When resetting the kernel of the Jupyter Notebook, 1st time corresponds with the first time and 2nd time with 2nd time.
So after resetting I will always get for example 0.7782 at the first run, 0.7732 on the second run etc.
But results without kernel reset are always different each time I run it.
I would be helpful for any suggestion!
I had exactly the same problem and managed to solve it by closing and restarting the tensorflow session every time I run the model. In your case it should look like this:
#START A NEW TF SESSION
np.random.seed(0)
tf.set_random_seed(0)
sess = tf.Session(graph=tf.get_default_graph())
K.set_session(sess)
embedding_vecor_length = 32
neurons = 91
epochs = 1
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(neurons))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_logarithmic_error', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=epochs, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
#CLOSE TF SESSION
K.clear_session()
I ran the following code and had reproducible results using GPU and tensorflow backend:
print datetime.now()
for i in range(10):
np.random.seed(0)
tf.set_random_seed(0)
sess = tf.Session(graph=tf.get_default_graph())
K.set_session(sess)
n_classes = 3
n_epochs = 20
batch_size = 128
task = Input(shape = x.shape[1:])
h = Dense(100, activation='relu', name='shared')(task)
h1= Dense(100, activation='relu', name='single1')(h)
output1 = Dense(n_classes, activation='softmax')(h1)
model = Model(task, output1)
model.compile(loss='categorical_crossentropy', optimizer='Adam')
model.fit(x_train, y_train_onehot, batch_size = batch_size, epochs=n_epochs, verbose=0)
print(model.evaluate(x=x_test, y=y_test_onehot, batch_size=batch_size, verbose=0))
K.clear_session()
And obtained this output:
2017-10-23 11:27:14.494482
0.489712882132
0.489712893813
0.489712892765
0.489712854426
0.489712882132
0.489712864011
0.486303713004
0.489712903398
0.489712892765
0.489712903398
What I understood is that if you don't close your tf session (you are doing it by running in a new kernel) you keep sampling the same "seeded" distribution.
My answer is the following, which uses Keras with Tensorflow as backend. Within your nested for loop, where one typically iterates through the various parameters you wish to explore for your model's development, immediately add this function after your last for loop.
for...
for...
reset_keras()
.
.
.
where the reset function is defined as
def reset_keras():
sess = tf.keras.backend.get_session()
tf.keras.backend.clear_session()
sess.close()
sess = tf.keras.backend.get_session()
np.random.seed(1)
tf.set_random_seed(2)
PS: The function above also actually avoids your nvidia GPU from building up too much memory (which happens after many iteration) so that it eventually becomes very slow...so the function restores GPU performance and maintains results as reproducible.
Looks like a bug in TensorFlow / Keras not sure. When setting the Keras back-end to CNTK the results are reproducible.
I even tried with several versions of TensorFlow from 1.2.1 till 1.13.1. All the TensorFlow versions results doesn't agree with multiple runs even when the random seeds are set.
The thing that worked for me was to run the training every time in a new console. In addition to this I also have this parameters set:
RANDOM_STATE = 42
os.environ['PYTHONHASHSEED'] = str(RANDOM_STATE)
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
tf.set_random_seed(RANDOM_STATE)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
intra_op_parallelism could also be a bigger value
I have a Keras model which was trained on 8 gpu's. This means the model has blocks like: with tf.device('gpu:0'). Now I want to apply transfer learning with another pc which has 4 gpus's. However, this results in an error, most likely because the model was trained on more gpus's (error: could not set cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM). In the error log I can also see a warning that tensorflow is trying to colocate gradients on device GPU 0-7. Is there a way to adapt or clear the devices in a trained model which is configured with Keras?
FYI: I don't have a meta graph file, because the model was also saved with Keras and not with the tensorflow saver function
Current attempts
I tried to change the layer properties, but this did not make it work:
track = 0
for i in range(len(model.layers)):
if model.layers[i].name[:6] == 'lambda':
model.layers[i].arguments['n_gpus'] = n_gpus
if model.layers[i].arguments['part'] > n_gpus-1:
model.layers[i].arguments['part'] = np.arange(n_gpus)[track]
track += 1
if track > n_gpus-1:
track = 0
In addition, I tried to set the number of visible devices, which also didn't work:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
Script to create a model split over 8 gpus
"""
to_multi_gpu & slice_batch by: https://github.com/fchollet/keras/issues/2436
baseline_model by: http://machinelearningmastery.com/
"""
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Dense, Input, Lambda, merge
import tensorflow as tf
def slice_batch(x, n_gpus, part):
"""
Divide the input batch into [n_gpus] slices, and obtain slice no. [part]
i.e. if len(x)=10, then slice_batch(x, 2, 1) will return x[5:].
x: input batch (input shape of model)
n_gpus: number of gpus
part: id of current gpu
return: sliced model per gpu
"""
sh = K.shape(x)
L = sh[0] // n_gpus
if part == n_gpus - 1:
return x[part*L:]
return x[part*L:(part+1)*L]
def to_multi_gpu(model, n_gpus):
"""
Given a keras [model], return an equivalent model which parallelizes
the computation over [n_gpus] GPUs.
Each GPU gets a slice of the input batch, applies the model on that slice
and later the outputs of the models are concatenated to a single
tensor, hence the user sees a model that behaves the same as the original.
model: sequential model created with the Keras library
n_gpus: number of gpus
return: model divided over n_gpus
"""
# Only divide model over multiple gpus if there is more than one
if n_gpus > 1:
with tf.device('/cpu:0'):
x = Input(model.input_shape[1:])#, name=model.input_names[0]
towers = []
# Divide model over gpus
for g in range(n_gpus):
# Work on GPU number g.
with tf.device('/gpu:' + str(g)):
# Obtain the g-th slice of the batch.
slice_g = Lambda(slice_batch, lambda shape: shape,
arguments={'n_gpus':n_gpus, 'part':g})(x)
# Apply model on the batch slice.
towers.append(model(slice_g))
# Merge multi-gpu outputs with cpu
with tf.device('/cpu:0'):
merged = merge(towers, mode='concat', concat_axis=0)
return Model(input=[x], output=merged)
else:
return model
def baseline_model(num_pixels, num_classes, n_gpus):
# create model
model = Sequential()
model.add(Dense(num_pixels, input_dim=num_pixels, init='normal', activation='relu'))
model.add(Dense(num_classes, init='normal', activation='softmax'))
model = to_multi_gpu(model, n_gpus)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
if __name__ == '__main__':
model = baseline_model(784, 9, 8)
Using the settings below solved it. However, now the model is running on cpu instead of gpu. As I am fine-tuning this model on the last layer this is not a big issue. But if you want to re-load and train the complete model this answer might not be satisfactory.
Important settings are os.environ['CUDA_VISIBLE_DEVICES'] = "" and allow_soft_placement=True.
The first masks all the gpu's and the second makes Tensorflow automatically allocate the model on the available devices (in this case CPU).
Sample code
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""
import tensorflow as tf
from keras.models import load_model
from keras import backend as K
if __name__ == '__main__':
model = load_model('baseline_model.h5')
init = tf.global_variables_initializer()
gpu_options = tf.GPUOptions(allow_growth=True)
# Add ops to save and restore all the variables.
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True,\
log_device_placement=True)) as sess:
K.set_session(sess)
sess.run(init)
tf.train.start_queue_runners(sess=sess)
# Call model.fit here
sess.close()