I’ve a pretrained quantized model which I trained on Colab, I moved the files on my system to run ONNX runtime inference. When loading the model however with
quantized_model = torch.load('quantizedmodel.pt')
My kernel proceeds to die, non-quantized models seem to load just fine.
My torch version is ‘1.11.0’, one thing I’ve done different is that I was mapping the model to CUDA on Colab, however I was mapping to the device instead with map_location = "cpu:0", I tried changing it back to cuda device with cuda:0 instead to no development.
Running nvidia-smi gives me:
Which seems to check out. Looking for some help with debugging this.
Related
While training a TensorFlow Model in Jupyter, the kernel dies before the first epoch.
The model I am using is a DeepLab with input size 256 on a ResNet50 encoder. I cannot show the model summary because it is too long to fit in the question.
This issue only happens with this specific model and does not occur with others that I have used.
Here is the output of the cell when I try to train the model:
Epoch 1/100
2023-01-07 12:22:01.752760: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-01-07 12:22:05.727903: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.
Canceled future for execute_request message before replies were done
This issue occurs in both VSCode Jupyter and Jupyter Notebook/Lab.
I have tried restarting the kernel, reinstalling tensorflow, creating a new environment, and using the nomkl library. I am on an M1 MacBook Pro running Tensorflow 2.11.0 (macos). The python version is 3.10.
Problem solved by running in Colab. I just downloaded the weights and log files from there.
I am training a CNN model to classify simple images (squares and crosses) and everything works just fine when I use the cpu but when I use the gpu everything works until the training starts and i get this error:
2022-06-15 04:25:49.158944: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
And then the program just stops.
Does anyone have an idea how to fix this?
if you use pycharm, you can select the "Emulate terminal in output console" option to print detailed error information.
Run->Edit Configration->Execution ->Emulate terminal in output console
On windows, maybe CUDA is missing zlibwapi.dll file, and you can download it and move it to bin of cuda.
I'm trying to run a simple LSTM model with following code
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(32,
input_shape=x_train_single.shape[-2:]))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='mae')
single_step_history = model.fit(train_data_single, epochs=EPOCHS,
steps_per_epoch=EVALUATION_INTERVAL)
The error happened when it trying to fit the model
tensorflow.python.framework.errors_impl.UnknownError: [_Derived_] Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_3107]
There's another error like this
2020-02-22 19:08:06.478567: W tensorflow/core/kernels/data/cache_dataset_ops.cc:820] The calling
iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the
dataset, the partially cached contents of the dataset will be discarded. This can happen if you have
an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use
`dataset.take(k).cache().repeat()` instead.
I tried all methods on this question which doesn't work for me
my envrionment is
tensorflow-gpu 2.0
CUDA v10
CuDNN 7.6.5
Solution
OK.. I found that I didn't have the latest Nvidia driver, so I upgraded, and works
Answering here for the benefit of the community even if the user has provided the solution.
Upgrading Nvidia driver to the latest has resolved the issue.
You can update NVIDIA manually from here here by selecting the product details and OS, you’re going to have to download the most recent drivers from their website. You’ll then have to run the installer and overwrite the old driver.
Try below
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
I am working with a rather large network (98 million parameters), I am using the Keras ModelCheckPoint callback to save my weights as follows, when I reload my saved weights using keras, I can see that the loading operation adds approximately 10 operations per layer in my graph. This results in a huge memory increase of my total network. Is this expected behavior? And if so, are there any known work arounds?
Details:
I am using: tf.keras.callbacks.ModelCheckpoint with "save_weights_only=True" as argument to save the weights
The code for loading it is:
model.load_weights(path_to_existing_weights)
where model is a custom keras model.
I am using Tensorflow 1.14 and Keras 2.3.0
Anyone that has any ideas?
This seems to me to be unexpected behavior but I can't see anything obvious that you are doing wrong. Are you sure there were no changes to your model between the time you saved the weights and the time you reloaded the weights? All I can suggest is try to do the same thing except this time in the callback change it to save the entire model. Then reload the model then check the graph. I also ran across this, doubt it is the problem but I would check it out
In order to save your Keras models as HDF5 files, e.g. via keras.callbacks.ModelCheckpoint,
Keras uses the h5py Python package. It is a dependency of Keras and should be installed by default.
If you are unsure if h5py is installed you can open a Python shell and load the module via
import h5py If it imports without error it is installed, otherwise you can find detailed installation
instructions here: http://docs.h5py.org/en/latest/build.html
Perhaps you might try reinstalling it.
In the context of deep neural networks training, the training works faster when it uses the GPU as the processing unit.
This is done by configuring CudNN optimizations and changing the processing unit in the environment variables with the following line (Python 2.7 and Keras on Windows):
os.environ["THEANO_FLAGS"] = "floatX=float32,device=gpu,optimizer_including=cudnn,gpuarray.preallocate=0.8,dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic,dnn.include_path=e:/toolkits.win/cuda-8.0.61/include,dnn.library_path=e:/toolkits.win/cuda-8.0.61/lib/x64"
The output is then:
Using gpu device 0: TITAN Xp (CNMeM is disabled, cuDNN 5110)
The problem is that the GPU memory is limited compared to the RAM (12GB and 128GB respectively), and the training is only one phase of the whole flow. Therefore I want to change back to CPU once the training is completed.
I've tried the following line, but it has no effect:
os.environ["THEANO_FLAGS"] = "floatX=float32,device=cpu"
My questions are:
Is it possible to change from GPU to CPU and vice-versa during runtime? (technically)
If yes, how can I do it programmatically in Python? (2.7, Windows, and Keras with Theano backend).
Yes this is possible at least for the tensorflow backend. You just have to also import tensorflow and put your code into the following with:
with tf.device('/cpu:0'):
your code
with tf.device('/gpu:0'):
your code
I am unsure if this also works for theano backend. However, switching from one backend to the other one is just setting a flag beforehand so this should not provide too much trouble.