Distributed Tensorflow is getting stuck at sess.run() - python

I want to run tensorflow on multiple machines, multiple GPUs. As an initial step, trying out distributed tensorflow on single machine (following tensorflow tutorial https://www.tensorflow.org/how_tos/distributed/)
Bellow are the lines after which sess.run() stucks
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
a = tf.constant(8)
b = tf.constant(9)
sess = tf.Session('grpc://localhost:2222')
Everything is working fine till here, but when I am running sess.run(), it stucks.
sess.run(tf.mul(a,b))
If anybody has already worked on distributed tensorflow, Please let me know the solution or other tutorial which works fine.

By default, Distributed TensorFlow will block until all servers named in the tf.train.ClusterSpec have started. This happens during the first interaction with the server, which will typically be the first sess.run() call. Therefore, if you haven't also started a server listening on localhost:2223, then TensorFlow will block until you do.
There are a few solutions to this problem, depending on your later goals:
Start a server on localhost:2223. In another process, run the following script:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)
server.join() # Wait forever for incoming connections.
Remove task 1 from the original tf.train.ClusterSpec:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# ...
Specify a "device filter" when you create the tf.Session so that the session only uses task 0.
# ...
sess = tf.Session("grpc://localhost:2222",
config=tf.ConfigProto(device_filters=["/job:local/task:0"]))

Related

Limit number of cores used on server for tensorflow 2 and keras

I try to run a Python script that trains several Neural Networks using TensorFlow and Keras. The problem is that I cannot restrict the number of cores used on the server, even though it works on my local desktop.
The basic structure is that I have defined a function run_net that runs the neural net. This function is called with different parameters in parallel using joblib (see below). Additionally, I have tried running the function iteratively with different parameters which didn't solve the problem.
Parallel(n_jobs=1, backend="multiprocessing")(
delayed(run_net)
If I run that on my local Windows Desktop, everything works fine. However, if I try to run the same script on our institute's server with 48 cores and check CPU usage using htop command, all cores are used. I already tried setting n_jobs in joblib Parallel to 1 and it looks like CPU usage goes to 100% once the tensorflow models are trained.
I already searched for different solutions and the main one that I found is the one below. I define that before running the parallel jobs shown above. I also tried placing the code below before every fit or predict method of the model.
NUM_PARALLEL_EXEC_UNITS = 5
config = tf.compat.v1.ConfigProto(
intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS,
inter_op_parallelism_threads=2,
device_count={"CPU": NUM_PARALLEL_EXEC_UNITS},
)
session = tf.compat.v1.Session(config=config)
K.set_session(session)
At this point, I am quite lost and have no idea how to make Tensorflow and/or Keras use a limited number of cores as the server I am using is shared across the institute.
The server is running linux. However, I don't know which exact distribution/version it is. I am very new to running code on a server.
These are the versions I am using:
python == 3.10.8
tensorflow == 2.10.0
keras == 2.10.0
If you need any other information, I am happy to provide that.
Edit 1
Both the answer suggested in this thread doesn't work as well as using only these commands:
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
after trying some things, I have found a solution to my problem. With the following code, I can restrict the number of CPUs used:
os.environ["OMP_NUM_THREADS"] = "5"
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
Note, that I have no idea how many CPUs will be used in the end. I noticed that it isn't five cores being used but more. As I don't really care about the exact number of cores but just that I don't use all cores, I am fine with that solution for now. If anybody knows how to calculate the number of cores used from the information provided above, let me know.

Databricks notebook hanging with pytorch

We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.
Situation
We are training a ML model with pytorch in the Databricks notebook UI
The training uses mlflow to register a model
At the end of the cell we print a statement "Done with training"
We are using a single node cluster with
Databricks Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, GPU, Scala 2.12)
Node type: Standard_NC6s_v3
Observations
In the Databricks notebook UI we see the cell running pytorch training and showing the intermediate logs of the training
After awhile the model is registered in mlflow but we don't see this log in the Databricks notebook UI
We can also see the print statement "Done with training" in the driver logs. We don't see this statement in the Databricks notebook UI
Code
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])
with mlflow.start_run(experiment_id = experiment_id) as run:
trainer.fit(model, train_loader, val_loader)
mlflow.log_param("param1", param1)
mlflow.log_param("param2", param2)
mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
print("Done with training")
Packages
mlflow-skinny==1.25.1
torch==1.10.2+cu111
torchvision==0.11.3+cu111
Solutions that I tried that did not work
Tried adding cache deletion, but that did not work
# Cleaning up to avoid any open processes...
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()
Tried forcing exiting the notebook, but also did not work
parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)
I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader
Disable pin_memory
Set num_workers to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)
For example:
train_loader = DataLoader(
train_dataset,
batch_size=32,
num_workers=1,
pin_memory=False,
shuffle=True,
)
This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.
https://stackoverflow.com/a/72473053/10555941
There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.
This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.

Jupyter Notebook GPU memory release after training model

How could we clear up the GPU memory after finishing a deep learning model training with Jupyter notebook. The problem is, no matter what framework I am sticking to (tensorflow, pytorch) the memory stored in the GPU do not get released except I kill the process manually or kill the kernel and restart the Jupyter. Do you have any idea how we can possible get rid of this problem by automating the steps?
The only walkaround I found was to use threading. Executing the Training with a subprocess.
An example:
def Training(arguments):
....
....
return model
if __name__=='__main__':
Subprocess = Process(# The complete function defined above
target = Training,
# Pass the arguments defined in the complete function
# above.
# Note - The comma after the arguments : In order for
# Python
# to understand this is a tuple
args = (arguments, ))
# Starting the defined subprocess
Subprocess.start()
# Wait for the subprocess to get completed
Subprocess.join()

How to have multiple MLFlow runs in parallel?

I'm not very familiar with parallelization in Python and I'm getting an error when trying to train a model on multiple training folds in parallel. Here's a simplified version of my code:
def train_test_model(fold):
# here I train the model etc...
# now I want to save the parameters and metrics
with mlflow.start_run():
mlflow.log_param("run_name", run_name)
mlflow.log_param("modeltype", modeltype)
# and so on...
if __name__=="__main__":
pool = ThreadPool(processes = num_trials)
# run folds in parallel
pool.map(lambda fold:train_test_model(fold), folds)
I'm getting the following error:
Exception: Run with UUID 23e9bb6d22674a518e48af9c51252860 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
The documentation says that mlflow.start_run() starts a new run and makes it active which is the root of my problem. Every thread starts a MLFlow run for its corresponding fold and makes it active while I need the runs to run in parallel i.e. all be active(?) and save parameters/metrics of the corresponding fold. How can I solve that issue?
I found a solution, maybe it will be useful for someone else. You can see details with code examples here: https://github.com/mlflow/mlflow/issues/3592

How to get rid of theano.gof.compilelock?

I am using the joblib library to run multiple NN on my multiple CPU at once. the idea is that I want to make a final prediction as the average of all the different NN predictions. I use keras and theano on the backend.
My code works if I set n_job=1 but fails for anything >1.
Here is the error message:
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
Using Theano backend.
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '6088' (I am process '6032')
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '6088' (I am process '6032')
The code I use is rather simple (it works for n_job=1)
from joblib import Parallel, delayed
result = Parallel(n_jobs=1,verbose=1, backend="threading")(delayed(myNNfunction)(arguments,i,X_train,Y_train,X_test,Y_test) for i in range(network))
For information (I don't know if this is relevant), this is my parameters for keras:
os.environ['KERAS_BACKEND'] = 'theano'
os.environ["MKL_THREADING_LAYER"] = "GNU"
os.environ['MKL_NUM_THREADS'] = '3'
os.environ['GOTO_NUM_THREADS'] = '3'
os.environ['OMP_NUM_THREADS'] = '3'
I have tried to use the technique proposed here but it didn't change a thing. To be precise I have created a file in C:\Users\myname.theanorc with this in it:
[global]
base_compiledir=/tmp/%(user)s/theano.NOBACKUP
I've read somewhere (I can't find the link sorry) that on windows machines I shouldn't call the file .theanorc.txt but only .theanorc ; in any case it doesn't work.
Would you know what I am missing?

Categories