PyTorch copy parameter gets stuck in multiprocessing if parameters too big

PyTorch copy parameter gets stuck in multiprocessing if parameters too big - python

I'm trying to code an Asynchronous Actor Critic in PyTorch based on this repo: https://github.com/seungeunrho/minimalRL/blob/master/a3c.py
but I'm changing the ActorCritic class to use the one I coded myself.
Basically I have a class A3C, an instance of it, global_model, with shared memory and I use torch.multiprocessing to open some Processes in order to train the model in parallel. In each process at the beginning I have to create a new instance of the model, called local_model, in order to proceed with the training, but the process gets stuck in the initialization of the local model even though the one of the global model works every time.
Trying to debugging it I can see that it enters the A3C.init function and the SharedActorCritic.init too, but there it stops just after I put the checkpoint print. However if I print whatever expression contains list(critic_param_gen) magically everything works. I also noted that printing just critic_param_gen won't do.
Any idea of why is that?
Also a similar thing happens if I use local_model = copy.deepcopy(global_model) as a function create_local_model, i.e. only works if that print is present.
In pseudo-code:
import torch.multiprocessiA3Cng as mp
import torch.nn as nn
import itertools as it
debug = True
A3C(nn.Module):
def __init__(self, model, n_features):
...
self.AC_architecture = SharedActorCritic(model, n_features)
class SharedActorCritic(nn.Module):
def __init__(self, model, n_features):
super(SharedActorCritic, self).__init__()
self.shared_architecture = model(n_features) # inherits from nn.Module
self.actor = SharedActor(n_features) # inherits from nn.Module
self.critic = SharedCritic(n_features) # inherits from nn.Module
self.critic_target = BaseCritic(model, n_features) # inherits from nn.Module
critic_param_gen = it.chain(self.shared_architecture.parameters(), self.critic.parameters())
print("checkpoint")
if debug: print(list(critic_param_gen)) # this makes the whole thing work
for trg_params, params in zip(self.critic_target.parameters(), critic_param_gen ):
trg_params.data.copy_(params.data)
def create_local_model(model, n_features):
local_model = A3C(model, n_features)
print("Process ended")
# in the main
global_model = Model() # works
global_model.share_memory() # doesn't really matter
p = mp.Process(target=create_local_model, args=(model, n_features, ))
p.start()
print("Process started")
p.join()
----
# output if debug is True
Process started
checkpoint
[ ...actual list of critic_param_gen ... ]
Process ended
# output if debug is False
Process started
checkpoint
# and then runs forever
Edit: solved the mystery about the print statement thanks to snakecharmerb. I created a minimal reproducible example. It seems that if the network is large enough, the copy operation breaks if executed in a process, but not outside of it (since global model can be instantiated).
import torch.nn as nn
import torch.multiprocessing as mp
import copy
class Net(nn.Module):
def __init__(self, n_features=256, n_layers=8):
super(Net, self).__init__()
self.net1 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))
self.net2 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))
for p1, p2 in zip(self.net1.parameters(), self.net2.parameters()):
p1.data.copy_(p2.data)
def forward(self, x):
return self.net(x)
def create_local_model_v1(global_model):
local_model = copy.deepcopy(global_model)
print("Process ended")
%%time
global_model = Net(16,2)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()
# Output
Global model created
Process ended
Process started
CPU times: user 3 ms, sys: 11.9 ms, total: 14.9 ms
Wall time: 45.1 ms
%%time
global_model = Net(256,8)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()
# Output - Gets stuck
Global model created
Process started

TLDR: use torch.multiprocessing.spawn
I'm not quite skilled enough to determine the exact cause and solution to this error, but the problem occurs at this point in torch/nn/parameter.py:
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
This gets called during the deep copy process. To investigate a little more, I put together a somewhat more detailed experiment to test what parameters and environments cause the hang. The jist of the results is that the size of the model is not an issue, but rather how many features / issues can cause problems. For me, 256 features causes the hang, regardless of how many layers. Another more curious issue is that when I remove the part of initialization where the parameters from net1 get copied to net2, the hang disappears, however if I don't send anything to another process then everything works fine. Finally, when using the spawn function, everything works just fine until the number of layers exceeds 256.
I need to caveat everything about the hang, as far as I can tell it is a deadlock, but it may be just some extremely slow process. This is highly unlikely, because it seems as though all activity stops, however I couldn't confirm that it's a deadlock because I when I went for backtrace of the C code during the hang, all I got was memory address (to really confirm everything I guess I need to rebuild torch with some debugging options...). Anyways, I'm about 99% confident it's a deadlock, probably being caused by something in multiprocessing somewhere. The reason my confidence is so high is that the code won't even react to signals. If everything were working as expected, I would expect the program to at least allow me to print out a traceback from a signal handler, but nothing.
I found the following blog post to be somewhat nice:
The tragic tale of the deadlocking Python queue
Other than that, my opinion at this point is f*** combining torch and multiprocessing.
If anyone cares to see the code for the experiments I ran or the result, let me know.

Related

What might cause a GPU to hang/get stuck when using multiprocessing in Python?

TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?
Normal version:
I am using PyTorch in combination with Optuna (a hyperparameter optimization framework; basically starts different trials for one model with different parameters, see: https://optuna.readthedocs.io/en/stable/) for my model training on a setup with 4 GPUs. Here, I've been looking for a way to distribute the workload more efficiently on the GPUs, hence I explored the multiprocessing library.
The core of the multiprocessing code looks like following:
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue, params, signals):
self.gpu_queue = gpu_queue
# create dataset
# ...
def __call__(self, trial: optuna.Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
val = trainer(trial, gpu=gpu_i, ...)
return val
And in main, optuna study and optuna optimize are initiated with:
study = optuna.create_study(direction="minimize", sampler = optuna.samplers.TPESampler(seed=17)) # storage = "sqlite:///trials.db")
study.optimize(Objective(GpuQueue(), ..., n_jobs=4))
Same implementation can be found in this StackOverflow post (used as inspiration): Is there a way to pass arguments to multiple jobs in optuna?
What this code does is that every trial gets its own GPU, hence the GPU usage and distribution is better than other methods. However it happens often that a GPU is stuck and just 'shuts itself off' and does not finish the trial, hence the code actually never finishes running and that GPU is never freed.
Say, for example, that I am running 100 trials, then trial 1,2,3,4 get assigned GPUs 0,1,2,3 (not always in that order), and whenever a GPU is freed, say GPU 2, it takes on trial 5, etc. The issue is, it can happen that the trial that the GPU is assigned to 'quits' in the process and never finishes the trial, hence not taking on another trial and resulting in the run with many trials not completing.
I suspected a deadlock, but apparently Queue() is thread-safe (see: Is Python multiprocessing.Queue thread safe?).
Any clue on what can cause the hang and what I can look for?

PyTorch multiprocessing error with Hogwild

I've encountered a mysterious bug while trying to implement Hogwild with torch.multiprocessing. In particular, one version of the code runs fine, but when I add in a seemingly unrelated bit of code before the multiprocessing step, this somehow causes an error during the multiprocessing step: RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork
I reproduced the error in a minimal code sample, pasted below. If I comment out the two lines of code m0 = Model(); train(m0) which carry out a non-parallel training run on a separate model instance, then everything runs fine. I can't figure out how these lines could be causing a problem.
I'm running PyTorch 1.5.1 and Python 3.7.6 on a Linux machine, training on CPU only.
import torch
import torch.multiprocessing as mp
from torch import nn
def train(model):
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
for _ in range(10000):
opt.zero_grad()
# We train the model to output the value 4 (arbitrarily)
loss = (model(0) - 4)**2
loss.backward()
opt.step()
# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
def __init__(self):
super().__init__()
self.x = nn.Parameter(torch.ones(3))
def forward(self, x):
return torch.sum(self.x)
############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################
num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(model,))
p.start()
processes.append(p)
for p in processes:
p.join()
print(model.x)

If you modify your code to create new processes like this:
processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
p = ctx.Process(target=train, args=(model,))
it seems to run fine (rest of code same as yours, tested on pytorch 1.5.0 / python 3.6 / NVIDIA T4 GPU).
I'm not completely sure what is carried over from the non-parallel run to the parallel run; I tried creating a completely new model for the two runs (with its own class), and/or deleting anything from the original, and/or making sure to delete any tensors and free up memory, and none of that made any difference.
What did make a difference was making sure that .backward() never got called outside of mp.Process() before it was called by a function within mp.Process(). I think what may be carried over is an autograd thread; if the thread exists before multiprocessing with the default fork method it fails, if the thread is created after fork it seems to work okay, and if using spawn it also works okay.
Btw: That's a really interesting question - thank you especially for digesting it to a minimal example!

You missed this:
if __name__ == '__main__':
which is very important for multi-processing!

Is there a way to pass arguments to multiple jobs in optuna?

I am trying to use optuna for searching hyper parameter spaces.
In one particular scenario I train a model on a machine with a few GPUs.
The model and batch size allows me to run 1 training per 1 GPU.
So, ideally I would like to let optuna spread all trials across the available GPUs
so that there is always 1 trial running on each GPU.
In the docs it says, I should just start one process per GPU in a separate terminal like:
CUDA_VISIBLE_DEVICES=0 optuna study optimize foo.py objective --study foo --storage sqlite:///example.db
I want to avoid that because the whole hyper parameter search continues in multiple rounds after that. I don't want to always manually start a process per GPU, check when all are finished, then start the next round.
I saw study.optimize has a n_jobs argument.
At first glance this seems to be perfect.
E.g. I could do this:
import optuna
def objective(trial):
# the actual model would be trained here
# the trainer here would need to know which GPU
# it should be using
best_val_loss = trainer(**trial.params)
return best_val_loss
study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=8)
This starts multiple threads each starting a training.
However, the trainer within objective somehow needs to know which GPU it should be using.
Is there a trick to accomplish that?

After a few mental breakdowns I figured out that I can do what I want using a multiprocessing.Queue. To get it into the objective function I need to define it as a lambda function or as a class (I guess partial also works). E.g.
from contextlib import contextmanager
import multiprocessing
N_GPUS = 2
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue):
self.gpu_queue = gpu_queue
def __call__(self, trial: Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
best_val_loss = trainer(**trial.params, gpu=gpu_i)
return best_val_loss
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(Objective(GpuQueue()), n_trials=100, n_jobs=8)

If you want a documented solution of passing arguments to objective functions used by multiple jobs, then Optuna docs present two solutions:
callable classes (it can be combined with multiprocessing),
lambda function wrapper (caution: simpler, but does not work with multiprocessing).
If you are prepared to take a few shortcuts, then you can skip some boilerplate by passing global values (constants such as number of GPUs used) directly (via python environment) to the __call__() method (rather than as arguments of __init__()).
The callable classes solution was tested to work (in optuna==2.0.0) with the two multiprocessing backends (loky/multiprocessing) and remote database backends (mariadb/postgresql).

To overcome the problem if introduced a global variable that tracks, which GPU is currently in use, which can then be read out in the objective function. The code looks like this.
EPOCHS = n
USED_DEVICES = []
def objective(trial):
time.sleep(random.uniform(0, 2)) #used because all n_jobs start at the same time
gpu_list = list(range(torch.cuda.device_count())
unused_gpus = [x for x in gpu_list if x not in USED_DEVICES]
idx = random.choice(unused_gpus)
USED_DEVICES.append(idx)
unused_gpus.remove(idx)
DEVICE = f"cuda:{idx}"
model = define_model(trial).to(DEVICE)
#... YOUR CODE ...
for epoch in range(EPOCHS):
# ... YOUR CODE ...
if trial.should_prune():
USED_DEVICES.remove(idx)
raise optuna.exceptions.TrialPruned()
#remove idx from list to reuse in next trial
USED_DEVICES.remove(idx)

Why multiprocessing suddenly stops after a certain number of tasks?

I'm trying to write some code to parallelize a bunch of tasks. Basically, the script is organized as the following.
import multiprocessing as mp
def obj_train(x):
return x.train()
class ServerModel(nn.Module):
self.S = nn.Parameter(torch.rand(x, y), requires_grad=True)
class ClientModel(nn.Module):
self.S = nn.Parameter(torch.rand(x, y), requires_grad=True)
self.U = nn.Parameter(torch.rand(x, y), requires_grad=True)
class Server:
def __init__(self, model):
self.model = model
...
def train(clients):
for i, c in enumerate(clients):
sd = c.model.state_dict()
sd['S'] = self.model.S
c.model.load_state_dict(sd)
self.c_list = random.sample(clients, 200)
pool = mp.Pool(mp.cpu_count()-1)
results = pool.map(obj_train, self.c_list)
pool.close()
pool.join()
print("Training complete")
class Client:
def __init__(self, client_id, model, train_set):
self.id = client_id
self.model = model
self.train_set = train_set
def train(self):
self.optimizer = optim.SGD([self.model.S, self.model.U])
for i in self.train_set:
loss = self.model(i)
loss.backward()
self.optimizer.step()
print("Trained client %d", self.id)
return self.model.S
if __name__ == '__main__':
...
server = Server(server_model)
clients = [Client(u, ClientModel(), train_set[u]) for u in range(n_clients)]
server.train(clients)
Ok, the problem is in multiprocessing. I tried with a lot of approaches but all of them gives me the same problem. Server should manage the training of 200 clients, but after a certain number of trainings (it depends on the approach, but approx 50-100), the script completely stucks and cores of the CPU stop working.
Have you any ideas? Other approaches I tried are for example mp.Pool and with ProcessPoolExecutor.
Thank you for your help.

Could it be that you hit the maximum number of processes/threads your machine is able to handle?
It is common, for example, when moving a web crawler from development to production that the machine does not allow more processes.
I would give a look at the file
/etc/sysctl.d
and in case increase the number of possible processes for the machine to handle.
Another reason might be that you capped RAM limit or something similar, try to give another quick look at the command
htop
followed by
free -m
and see what they tell you. It might be a hardware problem. While from a software it might be that the library you are using https://docs.python.org/2/library/multiprocessing.html has a hard-coded limit. Also here you can easily set it higher within the library parameters.
Last but not least, try to find the problem incrementally. I would test it with with 2 processes and increment slowly to see when the application starts having issues. And at that point it would probably be even clearer what the issue was. Good luck!

the missing example | pre-fetch and pre-process data using threads

seems there many open questions about the usage of TensorFlow out there and some developer of tensorflow here active on stackoverflow. Here is another question. I want to generate training data on-the-fly in other thread(s) using numpy or something which does not belongs to TensorFlow. But, I do not want to go through re-compiling the entire TensorFlow source again and again. I simply waiting for another way. "tf.py_func" seems to be a workaround. But the
This is related to [how-to-prefetch-data-using-a-custom-python-function-in-tensorflow][1]
Here is my MnWE (minmal-not-working-example):
Update (now there is an output but a race-condition, too):
import numpy as np
import tensorflow as tf
import threading
import os
import glob
import random
import matplotlib.pyplot as plt
IMAGE_ROOT = "/graphics/projects/data/mscoco2014/data/images/"
files = ["train/COCO_train2014_000000178763.jpg",
"train/COCO_train2014_000000543841.jpg",
"train/COCO_train2014_000000364433.jpg",
"train/COCO_train2014_000000091123.jpg",
"train/COCO_train2014_000000498916.jpg",
"train/COCO_train2014_000000429865.jpg",
"train/COCO_train2014_000000400199.jpg",
"train/COCO_train2014_000000230367.jpg",
"train/COCO_train2014_000000281214.jpg",
"train/COCO_train2014_000000041920.jpg"];
# --------------------------------------------------------------------------------
def pre_process(data):
"""Pre-process image with arbitrary functions
does not only use tf.functions, but arbitrary
"""
# here is the place to do some fancy stuff
# which might be out of the scope of tf
return data[0:81,0,0].flatten()
def populate_queue(sess, thread_pool, qData_enqueue_op ):
"""Put stuff into the data queue
is responsible such that there is alwaays data to process
for tensorflow
"""
# until somebody tell me I can stop ...
while not thread_pool.should_stop():
# get a random image from MS COCO
idx = random.randint(0,len(files))-1
data = np.array(plt.imread(os.path.join(IMAGE_ROOT,files[idx])))
data = pre_process(data)
# put into the queue
sess.run(qData_enqueue_op, feed_dict={data_input: data})
# a simple queue for gather data (just to keep it currently simple)
qData = tf.FIFOQueue(100, [tf.float32], shapes=[[9,9]])
data_input = tf.placeholder(tf.float32)
qData_enqueue_op = qData.enqueue([tf.reshape(data_input,[9,9])])
qData_dequeue_op = qData.dequeue()
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
# init all variables
sess.run(init_op)
# coordinate of pool of threads
thread_pool = tf.train.Coordinator()
# start fill in data
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData_enqueue_op))
t.start()
# Can I use "tf.train.start_queue_runners" here
# How to use multiple threads?
try:
while not thread_pool.should_stop():
print "iter"
# HERE THE SILENCE BEGIN !!!!!!!!!!!
batch = sess.run([qData_dequeue_op])
print batch
except tf.errors.OutOfRangeError:
print('Done training -- no more data')
finally:
# When done, ask the threads to stop.
thread_pool.request_stop()
# now they should definetely stop
thread_pool.request_stop()
thread_pool.join([t])
I basically have three question:
What's wrong with this code? It runs into an endless loss (which is not debug-able). See Line "HERE THE SILENCE BEGIN ..."
How to extend this code to use more threads?
Is it worth to convert to tf.Record large datasets or data which can be generated on the fly?

You have a mistake on this line:
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData))
It should be qData_enqueue_op instead of qData. Otherwise your enqueue operations fail, and you get stuck trying to dequeue from queue of size 0. I saw this when trying to run your code and getting
TypeError: Fetch argument <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> of <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> has invalid type <class 'google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue'>, must be a string or Tensor. (Can not convert a FIFOQueue into a Tensor or Operation.)
Regarding other questions:
You don't need to start queue runners in this example because you don't have any. Queue runners are created by input producers like string_input_producer which is essentially FIFO queue + logic to launch threads. You are replicating 50% of queue runner functionality by launching your own threads that do enqueue ops. (the other 50% is closing the queue)
RE: converting to tf.record -- Python has this thing called Global Interpreter Lock which means that two bits of Python code can't execute concurrently. In practice that's mitigated by the fact that a lot of the time is spent in numpy C++ code or IO ops (which release GIL). So I think it's a matter of checking if you are able to achieve required parallelism using Python pre-processing pipelines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.