Databricks notebook hanging with pytorch

Databricks notebook hanging with pytorch - python

We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.
Situation
We are training a ML model with pytorch in the Databricks notebook UI
The training uses mlflow to register a model
At the end of the cell we print a statement "Done with training"
We are using a single node cluster with
Databricks Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, GPU, Scala 2.12)
Node type: Standard_NC6s_v3
Observations
In the Databricks notebook UI we see the cell running pytorch training and showing the intermediate logs of the training
After awhile the model is registered in mlflow but we don't see this log in the Databricks notebook UI
We can also see the print statement "Done with training" in the driver logs. We don't see this statement in the Databricks notebook UI
Code
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])
with mlflow.start_run(experiment_id = experiment_id) as run:
trainer.fit(model, train_loader, val_loader)
mlflow.log_param("param1", param1)
mlflow.log_param("param2", param2)
mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
print("Done with training")
Packages
mlflow-skinny==1.25.1
torch==1.10.2+cu111
torchvision==0.11.3+cu111
Solutions that I tried that did not work
Tried adding cache deletion, but that did not work
# Cleaning up to avoid any open processes...
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()
Tried forcing exiting the notebook, but also did not work
parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)

I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader
Disable pin_memory
Set num_workers to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)
For example:
train_loader = DataLoader(
train_dataset,
batch_size=32,
num_workers=1,
pin_memory=False,
shuffle=True,
)
This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.

https://stackoverflow.com/a/72473053/10555941
There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.
This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.

Related

Jupyter Notebook GPU memory release after training model

How could we clear up the GPU memory after finishing a deep learning model training with Jupyter notebook. The problem is, no matter what framework I am sticking to (tensorflow, pytorch) the memory stored in the GPU do not get released except I kill the process manually or kill the kernel and restart the Jupyter. Do you have any idea how we can possible get rid of this problem by automating the steps?

The only walkaround I found was to use threading. Executing the Training with a subprocess.
An example:
def Training(arguments):
....
....
return model
if __name__=='__main__':
Subprocess = Process(# The complete function defined above
target = Training,
# Pass the arguments defined in the complete function
# above.
# Note - The comma after the arguments : In order for
# Python
# to understand this is a tuple
args = (arguments, ))
# Starting the defined subprocess
Subprocess.start()
# Wait for the subprocess to get completed
Subprocess.join()

How to have multiple MLFlow runs in parallel?

I'm not very familiar with parallelization in Python and I'm getting an error when trying to train a model on multiple training folds in parallel. Here's a simplified version of my code:
def train_test_model(fold):
# here I train the model etc...
# now I want to save the parameters and metrics
with mlflow.start_run():
mlflow.log_param("run_name", run_name)
mlflow.log_param("modeltype", modeltype)
# and so on...
if __name__=="__main__":
pool = ThreadPool(processes = num_trials)
# run folds in parallel
pool.map(lambda fold:train_test_model(fold), folds)
I'm getting the following error:
Exception: Run with UUID 23e9bb6d22674a518e48af9c51252860 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
The documentation says that mlflow.start_run() starts a new run and makes it active which is the root of my problem. Every thread starts a MLFlow run for its corresponding fold and makes it active while I need the runs to run in parallel i.e. all be active(?) and save parameters/metrics of the corresponding fold. How can I solve that issue?

I found a solution, maybe it will be useful for someone else. You can see details with code examples here: https://github.com/mlflow/mlflow/issues/3592

How to reconnect to the ongoing process on GoogleColab

I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.

Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"

first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.

It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.

Queue manager for experiments in neural network training

I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?

Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program

Distributed Tensorflow is getting stuck at sess.run()

I want to run tensorflow on multiple machines, multiple GPUs. As an initial step, trying out distributed tensorflow on single machine (following tensorflow tutorial https://www.tensorflow.org/how_tos/distributed/)
Bellow are the lines after which sess.run() stucks
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
a = tf.constant(8)
b = tf.constant(9)
sess = tf.Session('grpc://localhost:2222')
Everything is working fine till here, but when I am running sess.run(), it stucks.
sess.run(tf.mul(a,b))
If anybody has already worked on distributed tensorflow, Please let me know the solution or other tutorial which works fine.

By default, Distributed TensorFlow will block until all servers named in the tf.train.ClusterSpec have started. This happens during the first interaction with the server, which will typically be the first sess.run() call. Therefore, if you haven't also started a server listening on localhost:2223, then TensorFlow will block until you do.
There are a few solutions to this problem, depending on your later goals:
Start a server on localhost:2223. In another process, run the following script:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)
server.join() # Wait forever for incoming connections.
Remove task 1 from the original tf.train.ClusterSpec:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# ...
Specify a "device filter" when you create the tf.Session so that the session only uses task 0.
# ...
sess = tf.Session("grpc://localhost:2222",
config=tf.ConfigProto(device_filters=["/job:local/task:0"]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.