How could we clear up the GPU memory after finishing a deep learning model training with Jupyter notebook. The problem is, no matter what framework I am sticking to (tensorflow, pytorch) the memory stored in the GPU do not get released except I kill the process manually or kill the kernel and restart the Jupyter. Do you have any idea how we can possible get rid of this problem by automating the steps?
The only walkaround I found was to use threading. Executing the Training with a subprocess.
An example:
def Training(arguments):
....
....
return model
if __name__=='__main__':
Subprocess = Process(# The complete function defined above
target = Training,
# Pass the arguments defined in the complete function
# above.
# Note - The comma after the arguments : In order for
# Python
# to understand this is a tuple
args = (arguments, ))
# Starting the defined subprocess
Subprocess.start()
# Wait for the subprocess to get completed
Subprocess.join()
Related
We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.
Situation
We are training a ML model with pytorch in the Databricks notebook UI
The training uses mlflow to register a model
At the end of the cell we print a statement "Done with training"
We are using a single node cluster with
Databricks Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, GPU, Scala 2.12)
Node type: Standard_NC6s_v3
Observations
In the Databricks notebook UI we see the cell running pytorch training and showing the intermediate logs of the training
After awhile the model is registered in mlflow but we don't see this log in the Databricks notebook UI
We can also see the print statement "Done with training" in the driver logs. We don't see this statement in the Databricks notebook UI
Code
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])
with mlflow.start_run(experiment_id = experiment_id) as run:
trainer.fit(model, train_loader, val_loader)
mlflow.log_param("param1", param1)
mlflow.log_param("param2", param2)
mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
print("Done with training")
Packages
mlflow-skinny==1.25.1
torch==1.10.2+cu111
torchvision==0.11.3+cu111
Solutions that I tried that did not work
Tried adding cache deletion, but that did not work
# Cleaning up to avoid any open processes...
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()
Tried forcing exiting the notebook, but also did not work
parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)
I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader
Disable pin_memory
Set num_workers to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)
For example:
train_loader = DataLoader(
train_dataset,
batch_size=32,
num_workers=1,
pin_memory=False,
shuffle=True,
)
This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.
https://stackoverflow.com/a/72473053/10555941
There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.
This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.
I'm not very familiar with parallelization in Python and I'm getting an error when trying to train a model on multiple training folds in parallel. Here's a simplified version of my code:
def train_test_model(fold):
# here I train the model etc...
# now I want to save the parameters and metrics
with mlflow.start_run():
mlflow.log_param("run_name", run_name)
mlflow.log_param("modeltype", modeltype)
# and so on...
if __name__=="__main__":
pool = ThreadPool(processes = num_trials)
# run folds in parallel
pool.map(lambda fold:train_test_model(fold), folds)
I'm getting the following error:
Exception: Run with UUID 23e9bb6d22674a518e48af9c51252860 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
The documentation says that mlflow.start_run() starts a new run and makes it active which is the root of my problem. Every thread starts a MLFlow run for its corresponding fold and makes it active while I need the runs to run in parallel i.e. all be active(?) and save parameters/metrics of the corresponding fold. How can I solve that issue?
I found a solution, maybe it will be useful for someone else. You can see details with code examples here: https://github.com/mlflow/mlflow/issues/3592
I am trying to parallelize the load of the data and the learning in a pytorch project. From main thread I create 2 threads, one loading the next batch while the other is learning from the current batch. The loading threading is transferring the loaded data through a Queue.
My problem is: the program suddenly stops at random state of execution without any error message (debug execution or not). Sometimes at the first epoch, sometimes after 7 epoch (50min)... Is it possible to have an error message somehow ? Does anyone have encountered that problem ?
It makes me think about memory leakage, but I check all the code about the shared data. I have also seen things about prints not being thread safe so I removed them... Please also note that the code didn't had this problem before parallelization.
I am using:
conda environement.
Threading.thread
pytorch
Windows server
Update: Apparently the pytorch code referencing cuda learning doesn't like to be called from a separate thread. If I keep my cuda learning in the main thread, it stays alive...
Code: Since I have less unexpected crashes with the cuda learning in the main thread. I only kept one thread. (it also makes more sense)
part of the main:
dataQueue = queue.Queue()
dataAvailable = threading.Event()
doneComputing = threading.Event()
# Create new thread
loadThread = LoadingThread(1,dataQueue)
#learningThread.daemon = True
loadThread.daemon=True
# Add threads to thread list
threads = []
threads.append(loadThread)
print(" >> Parrallel process start")
# Start new Thread
loadThread.start()
doneComputing.set()
# Learning process
for i_import in range(import_init['n_import_train']):
# Wait for data loaded
dataAvailable.wait()
data = dataQueue.get()
dataAvailable.clear()
# DO learning
net,net_prev,y_all,yhat_all = doLearning(data, i_import, net, net_prev, y_all, yhat_all)
doneComputing.set()
# Wait for all threads to complete
for t in threads:
t.join()
One interesting fact is that it seems that the program crashes more often if the model sent to cuda is heavy ... Could tat be only a cuda problem?
I'm using keras to make and test different types of Neural Nets and need data to compare them. I need data on the cpu and memory used during the training and testing. This is in python and as I looked around I found lot of suggestions for psutil. However everything I see seems to grab the current usage.
What is current usage? Like the amount of memory used at that specific moment? How do I use it to get the total cpu and memory used by the entire program, or at least the portion of the code where the NN is training and testing. Thanks for any help!
psutil is a good recommendation to collect that type of information. If you incorporate this code into your existing keras code, you can collect information about the cpu usage of your process at the time the cpu_times() method is called
import psutil
process = psutil.Process()
print(process.cpu_times())
The meaning of the value returned by cpu_times() is explained here. It is cumulative, so if you want to know how much CPU time your keras code used altogether, just run it before you exit the python script.
To get the memory usage information for your process, at the particular time you make the call to memory_info() you can run this on the same process object we declared before
print(process.memory_info())
The exact meaning of the cpu and memory results depend on what platform you're using. The memory info structure is explained here
A more comprehensive example shows how you could use the Advanced Python Scheduler to take cpu and memory measurements in the background as you run your keras training
import psutil
import time
import os
from apscheduler.schedulers.background import BackgroundScheduler
process = psutil.Process()
def get_info():
print(process.cpu_times(), process.memory_info())
if __name__ == '__main__':
scheduler = BackgroundScheduler()
scheduler.add_job(get_info, 'interval', seconds=3)
scheduler.start()
# run the code you want to measure here
# replace this nonsense loop
now = time.time()
finish = now + 60
while time.time() < finish:
print("Some progress message: {}".format(time.time()))
time.sleep(10)
I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?
Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program