Keras model.fit hangs

Keras model.fit hangs - python

Attempting to fit a Keras model on an audio_dataset_from_directory results in the kernel apparently not responding. The following code reproduces my problem (tested in VScode and Jupyter Notebook):
import tensorflow.keras as keras
import pandas as pd
import os
# Create architecture of model
inputs = keras.layers.Input((None, 1))
rnn = keras.layers.SimpleRNN(200)(inputs)
output = keras.layers.Dense(1)(rnn)
# Compile model
model = keras.Model(inputs, output)
model.compile(loss="mean_squared_error")
# Load data
data = pd.read_csv(".\\files\\metadata.csv", index_col="title")
data = keras.utils.audio_dataset_from_directory(
".\\files\\songs",
labels=data["UserLikes"].to_list(),
label_mode="int",
ragged=True,
shuffle=True,
)
# Fit model
model.fit(data, epochs=1, verbose=2)
In this code, data["UserLikes"] (and thus y in the Keras dataset) consists of integers in the range [0, inf). An audiofile is processed by Keras as Tensors of floats of shape (timesteps, channels=1). The total size of the audiofiles is merely 320 MB. The goal of the code is to predict the amount of likes a song gets.
The result of this code is nothing: Everytime I run it, the code gets stuck on model.fit. Sometimes the application (i.e., VScode or Jupyter Notebook) even crashes.
Any advice would be greatly appreciated.

Related

Saving and loading some models takes a very long time in Keras

I've noticed that when doing the following workflow:
load a pre-trained model from keras.applications with weights from ImageNet
fine-train this model with new data
save the fine-tuned model to an hdf5 file with model.save('file.h5')
re-load the model somewhere else with load_model('file.h5')
The saving and loading steps can take a really long time when using some models.
When using VGG16 or VGG19 or MobileNet, saving and loading happen very quickly (a few seconds at most).
However when using NasNet, InceptionV3 or DenseNet121 then both saving and loading can take up to 10 to 30 minutes each, as illustrated in the following examples:
from keras.layers import GlobalAveragePooling2D
from keras.layers.core import Dense
from keras.models import Model
# VGG16
model_ = keras.applications.vgg16.VGG16(weights='imagenet', include_top=False)
x = GlobalAveragePooling2D()(model_.output)
x = Dense(16, activation='softmax')(x)
my_model = Model(inputs=model_.input, outputs=x)
my_model.fit(some_data)
my_model.save('file.h5') # takes 2 seconds
load_model('file.h5') # takes 2 seconds
# NASNetMobile
model_ = keras.applications.nasnet.NASNetMobile(weights='imagenet', include_top=False)
x = GlobalAveragePooling2D()(model_.output)
x = Dense(16, activation='softmax')(x)
my_model = Model(inputs=model_.input, outputs=x)
my_model.fit(some_data)
my_model.save('file.h5') # takes 10 minutes
load_model('file.h5') # takes 5 minutes
# DenseNet121
model_ = keras.applications.densenet.DenseNet121(weights='imagenet', include_top=False)
x = GlobalAveragePooling2D()(model_.output)
x = Dense(16, activation='softmax')(x)
my_model = Model(inputs=model_.input, outputs=x)
my_model.fit(some_data)
my_model.save('file.h5') # takes 10 minutes
load_model('file.h5') # takes 5 minutes
When querying the command line to monitor the file being created while saving, we can see file.h5 being slowly built up, at around 100Kb per minute in the worst cases, and then suddenly when it reaches 22Mb it very quickly completes to the full size (80-100Mb depending on the model).
I was wondering if that's "standard behaviour", just because these models are inherently complex and then such long saving/loading durations are expected, or is it not normal? Also, can something be done to mitigate this?
Configuration used:
Keras 2.2 with TensorFlow backend
TensorFlow-GPU 1.13
Python 3.6
CUDA 10.1
running on an AWS Deep Learning EC2 pre-configured instance

I'm having a similar experience trying to load a ResNet 50 in TF 2.0 and Keras. Not sure what's up, but I see 100% CPU utilization on a single-core (out of 64 available CPU cores).

Get summary of tensorflow model

Model.summary() gives me a this output
Now how can i check sequential_1 layers and sequential_3 layer?
I want whole model summary but it gives two sequential so that means two model are combined so how can i get summary of both model?
I only have model.h5 file nothing else

Models saved in .h5 format includes everything about the model.
To inspect the layers summary inside the Model in a Model, like in your case.
You could extract the layers, then call the summary method from each of them.
ie.
layer_summary = [layer.summary() for layer in loaded_model.layers]
Here is the complete code I used in reproducing your scenario.
import tensorflow as tf
print('Running Tensorflow version {}'.format(tf.__version__)) # Tensorflow 2.1.0
model_path = '/content/keras_model.h5'
loaded_model = tf.keras.models.load_model(model_path)
loaded_model.summary()
inp = loaded_model.input
layer_summary = [layer.summary() for layer in loaded_model.layers]
I've also used the model.h5 file you uploaded.

fastai error predicting with exported/reloaded model: "Input type and weight type should be the same"

Whenever I export a fastai model and reload it, I get this error (or a very similar one) when I try and use the reloaded model to generate predictions on a new test set:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
Minimal reprodudeable code example below, you just need to update your FILES_DIR variable to where the MNIST data gets deposited on your system:
from fastai import *
from fastai.vision import *
# download data for reproduceable example
untar_data(URLs.MNIST_SAMPLE)
FILES_DIR = '/home/mepstein/.fastai/data/mnist_sample' # this is where command above deposits the MNIST data for me
# Create FastAI databunch for model training
tfms = get_transforms()
tr_val_databunch = ImageDataBunch.from_folder(path=FILES_DIR, # location of downloaded data shown in log of prev command
train = 'train',
valid_pct = 0.2,
ds_tfms = tfms).normalize()
# Create Model
conv_learner = cnn_learner(tr_val_databunch,
models.resnet34,
metrics=[error_rate]).to_fp16()
# Train Model
conv_learner.fit_one_cycle(4)
# Export Model
conv_learner.export() # saves model as 'export.pkl' in path associated with the learner
# Reload Model and use it for inference on new hold-out set
reloaded_model = load_learner(path = FILES_DIR,
test = ImageList.from_folder(path = f'{FILES_DIR}/valid'))
preds = reloaded_model.get_preds(ds_type=DatasetType.Test)
Output:
"RuntimeError: Input type (torch.cuda.FloatTensor) and weight type
(torch.cuda.HalfTensor) should be the same"
Stepping through the code statement by statement, everything works fine until the last line pred = ... which is where the torch error above pops up.
Relevant software versions:
Python 3.7.3
fastai 1.0.57
torch 1.2.0
torchvision 0.4.0

So the answer to this ended up being relatively simple:
1) As noted in my comment, training in mixed precision mode (setting conv_learner to_fp16()) caused the error with the exported/reloaded model
2) To train in mixed precision mode (which is faster than regular training) and enable export/reload of the model without errors, simply set the model back to default precision before exporting.
...In code, simply changing the example above:
# Export Model
conv_learner.export()
to:
# Export Model (after converting back to default precision for safe export/reload
conv_learner = conv_learner.to_fp32()
conv_learner.export()
...and now the full (reproduceable) code example above runs without errors, including the prediction after model reload.

Your model is in half precision if you have .to_fp16, which would be the same if you would model.half() in PyTorch.
Actually if you trace the code .to_fp16 will call model.half()
But there is a problem. If you convert the batch norm layer also to half precision you may get the convergence problem.
This is why you would typically do this in PyTorch:
model.half() # convert to half precision
for layer in model.modules():
if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
layer.float()
This will convert any layer to half precision other than batch norm.
Note that code from PyTorch forum is also OK, but just for nn.BatchNorm2d.
Then make sure your input is in half precision using to() like this:
import torch
t = torch.tensor(10.)
print(t)
print(t.dtype)
t=t.to(dtype=torch.float16)
print(t)
print(t.dtype)
# tensor(10.)
# torch.float32
# tensor(10., dtype=torch.float16)
# torch.float16

AssertionError when using MirroredStrategy: isinstance(x, dataset_ops.DatasetV2)

I am trying to use MirroredStrategy to fit my sequential model using two Titan Xp GPUs. I am using tensorflow 2.0 alpha on ubuntu 16.04.
I successfully run the code snippet from the tensorflow documentation:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss='mse', optimizer='sgd')
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)
However, when I try to train on my data, which is a sparse matrix of shape (using adam optimizer and binary crossentropy):
Shape X_train: (91422, 65545)
Shape y_train: (91422, 1)
I receive an assertion error in _distribution_standardize_user_data at
assert isinstance(x, dataset_ops.DatasetV2)
In the TensorFlow code, line 2166 in training.py seems to be causing this assertion error.
Can someone explain to me what the problem with my data could be?

I got similar error when using dataset= strategy.experimental_distribute_dataset(train_dataset) with model.fit(dataset) .
I after I remove the strategy.experimental_distribute_dataset. It works fine. It is similar to the TF document where they said that keras.Model.fit() handle everything automatically and we need distributed dataset manually only when we want to do customized training with tf.GradientTape().
You can go through the offical tutorial of MNIST for more info

Seems like you are feed dataset into model.fit, model.fit are expecting an numpy.ndarray.

Warning `tried to deallocate nullptr` when using tensorflow eager execution with tf.keras

As per the tensorflow team suggestion, I'm getting used to tensorflow's eager execution with tf.keras. However, whenever I train a model, I receive a warning (EDIT: actually, I receive this warning repeated many times, more than once per training step, flooding my standard output):
E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr
The warning doesn't seem to affect the quality of the training but I wonder what it means and if it is possible to get rid of it.
I use a conda virtual environment with python 3.7 and tensorflow 1.12 running on a CPU. (EDIT: a test with python 3.6 gives the same results.) A minimal code that reproduces the warnings follows. Interestingly, it is possible to comment the line tf.enable_eager_execution() and see that the warnings disappear.
import numpy as np
import tensorflow as tf
tf.enable_eager_execution()
N_EPOCHS = 50
N_TRN = 10000
N_VLD = 1000
# the label is positive if the input is a number larger than 0.5
# a little noise is added, just for fun
x_trn = np.random.random(N_TRN)
x_vld = np.random.random(N_VLD)
y_trn = ((x_trn + np.random.random(N_TRN) * 0.02) > 0.5).astype(float)
y_vld = ((x_vld + np.random.random(N_VLD) * 0.02) > 0.5).astype(float)
# a simple logistic regression
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(1, input_dim=1))
model.add(tf.keras.layers.Activation('sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(),
# optimizer=tf.keras.optimizers.Adam(), # doesn't work at all with tf eager execution
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train model on dataset
model.fit(
x_trn, y_trn,
epochs=N_EPOCHS,
validation_data=(x_vld, y_vld),
)
model.summary()

Quick solutions:
It did not appear when I ran the same script in TF 1.11 while the optimization was performed to reach the same final validation accuracy on a synthetic dataset.
OR
Suppress the errors/warning using the native os module (adapted from https://stackoverflow.com/a/38645250/2374160). ie; by setting the Tensorflow logging environment variable to not show any error messages.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
More info:
Solving this error in the correct way may require familiarity with MKL library calls and its interfacing on Tensorflow which is written in C (this is beyond my current TF expertise)
In my case, this memory deallocation error occurred whenever the
apply_gradients() method of an optimizer was called. In your script, it is called when the model is being fitted to the training data.
This error is raised from here: tensorflow/core/common_runtime/mkl_cpu_allocator.h
I hope this helps as a temporary solution for convenience.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras model.fit hangs - python

Related

Saving and loading some models takes a very long time in Keras

Get summary of tensorflow model

fastai error predicting with exported/reloaded model: "Input type and weight type should be the same"

AssertionError when using MirroredStrategy: isinstance(x, dataset_ops.DatasetV2)

Warning `tried to deallocate nullptr` when using tensorflow eager execution with tf.keras

Categories

Resources