Jupyter Kernel Dies when Running TensorFlow Model - python

While training a TensorFlow Model in Jupyter, the kernel dies before the first epoch.
The model I am using is a DeepLab with input size 256 on a ResNet50 encoder. I cannot show the model summary because it is too long to fit in the question.
This issue only happens with this specific model and does not occur with others that I have used.
Here is the output of the cell when I try to train the model:
Epoch 1/100
2023-01-07 12:22:01.752760: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-01-07 12:22:05.727903: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.
Canceled future for execute_request message before replies were done
This issue occurs in both VSCode Jupyter and Jupyter Notebook/Lab.
I have tried restarting the kernel, reinstalling tensorflow, creating a new environment, and using the nomkl library. I am on an M1 MacBook Pro running Tensorflow 2.11.0 (macos). The python version is 3.10.

Problem solved by running in Colab. I just downloaded the weights and log files from there.

Related

Dead Kernal when loading a quantized model: PyTorch

I’ve a pretrained quantized model which I trained on Colab, I moved the files on my system to run ONNX runtime inference. When loading the model however with
quantized_model = torch.load('quantizedmodel.pt')
My kernel proceeds to die, non-quantized models seem to load just fine.
My torch version is ‘1.11.0’, one thing I’ve done different is that I was mapping the model to CUDA on Colab, however I was mapping to the device instead with map_location = "cpu:0", I tried changing it back to cuda device with cuda:0 instead to no development.
Running nvidia-smi gives me:
Which seems to check out. Looking for some help with debugging this.

Training CNN model using keras works with CPU but not with GPU

I am training a CNN model to classify simple images (squares and crosses) and everything works just fine when I use the cpu but when I use the gpu everything works until the training starts and i get this error:
2022-06-15 04:25:49.158944: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
And then the program just stops.
Does anyone have an idea how to fix this?
if you use pycharm, you can select the "Emulate terminal in output console" option to print detailed error information.
Run->Edit Configration->Execution ->Emulate terminal in output console
On windows, maybe CUDA is missing zlibwapi.dll file, and you can download it and move it to bin of cuda.

Python neural network with Keras runs on CPU, but crashes on the GPU

I implemented a neural network that learns to play PacMan using gym,box2d and gym[atari] with Keras models. The training was very slow so I tried to make in run on my GTX 1060 Max-Q.
I installed the latest version of Tensorflow, installed CUDA 11.0 and cuDNN 8.0.4.30. The program opens all the libraries succesfully, detects the GPU correctly, creates the Tensor device, starts the first frame of the render, freezes for about 9 seconds and then exits with code -1073740791 (0xC0000409).
Why is this happening and what can I do to fix it?
-1073740791 (0xC0000409) is a stack buffer overflow on windows machines
Here's some documentation for it.
You need to make the training file smaller or run it on a better PC

Fail to find the dnn implementation for LSTM

I'm trying to run a simple LSTM model with following code
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(32,
input_shape=x_train_single.shape[-2:]))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='mae')
single_step_history = model.fit(train_data_single, epochs=EPOCHS,
steps_per_epoch=EVALUATION_INTERVAL)
The error happened when it trying to fit the model
tensorflow.python.framework.errors_impl.UnknownError: [_Derived_] Fail to find the dnn implementation.
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_3107]
There's another error like this
2020-02-22 19:08:06.478567: W tensorflow/core/kernels/data/cache_dataset_ops.cc:820] The calling
iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the
dataset, the partially cached contents of the dataset will be discarded. This can happen if you have
an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use
`dataset.take(k).cache().repeat()` instead.
I tried all methods on this question which doesn't work for me
my envrionment is
tensorflow-gpu 2.0
CUDA v10
CuDNN 7.6.5
Solution
OK.. I found that I didn't have the latest Nvidia driver, so I upgraded, and works
Answering here for the benefit of the community even if the user has provided the solution.
Upgrading Nvidia driver to the latest has resolved the issue.
You can update NVIDIA manually from here here by selecting the product details and OS, you’re going to have to download the most recent drivers from their website. You’ll then have to run the installer and overwrite the old driver.
Try below
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

Do pretrained tensorflow models need to be used by machines with the same versions?

I trained a cnn on a Linux machine with keras/tensorflow but can’t get the pretrained model to run on my Raspberry Pi. The model was made on Ubuntu 16.04 with Python 3.6.7, tensorflow version 1.7.0, CuDNN 7.0.5 and CUDA 9. I am trying to run it on the Raspberry Pi 3 Model B+ with Python 3.5.3 and tensorflow version 1.13.1.
I have no problem loading and running the pretrained model on the same machine it was created on. The issue is only when I try to run that same pretrained model on the RPi system. I end up getting a segmentation fault.
I tried updating the Linux machine that created the model to tensorflow 1.12 but after tensorflow 1.12 successfully installed, I got "Failed to get convolution algorithm. This is probably because cuDNN failed to initialize" errors, so I'd rather not go down that route. I want to know if it's possible to just use this pretrained model with tensorflow 1.13.1 on the RPi.
Here's what I'm doing on the RPi:
>>> import tensorflow as tf
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: compiletime version 3.4 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.5
return f(*args, **kwds)
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: builtins.type size changed, may indicate binary incompatibility. Expected 432, got 412
>>> print(tf.__version__)
1.13.1
>>> from keras.models import load_model
Using TensorFlow backend.
>>> model = load_model(save_dir+model_name)
WARNING:tensorflow:From /home/pi/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/pi/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
2019-03-25 17:08:11.471364: W tensorflow/core/framework/allocator.cc:124] Allocation of 209715200 exceeds 10% of system memory.
2019-03-25 17:12:55.123877: W tensorflow/core/framework/allocator.cc:124] Allocation of 209715200 exceeds 10% of system memory.
Backend terminated (returncode: -11)
Fatal Python error: Segmentation fault
I need some guidance on whether this is happening - are the versions incompatible? Maybe the model is too large for RPi (doubt it - it's a fairly shallow model with 18 layers)? The other forum posts I've seen with segmentation faults seemed a lot more dire (e.g., they can't even write standard commands in the Terminal without seeing a segmentation error) - this segmentation fault only happens (and happens repeatably) through the above commanding.
Any advice/help greatly appreciated!

Categories