I'm trying to write some code to parallelize a bunch of tasks. Basically, the script is organized as the following.
import multiprocessing as mp
def obj_train(x):
return x.train()
class ServerModel(nn.Module):
self.S = nn.Parameter(torch.rand(x, y), requires_grad=True)
class ClientModel(nn.Module):
self.S = nn.Parameter(torch.rand(x, y), requires_grad=True)
self.U = nn.Parameter(torch.rand(x, y), requires_grad=True)
class Server:
def __init__(self, model):
self.model = model
...
def train(clients):
for i, c in enumerate(clients):
sd = c.model.state_dict()
sd['S'] = self.model.S
c.model.load_state_dict(sd)
self.c_list = random.sample(clients, 200)
pool = mp.Pool(mp.cpu_count()-1)
results = pool.map(obj_train, self.c_list)
pool.close()
pool.join()
print("Training complete")
class Client:
def __init__(self, client_id, model, train_set):
self.id = client_id
self.model = model
self.train_set = train_set
def train(self):
self.optimizer = optim.SGD([self.model.S, self.model.U])
for i in self.train_set:
loss = self.model(i)
loss.backward()
self.optimizer.step()
print("Trained client %d", self.id)
return self.model.S
if __name__ == '__main__':
...
server = Server(server_model)
clients = [Client(u, ClientModel(), train_set[u]) for u in range(n_clients)]
server.train(clients)
Ok, the problem is in multiprocessing. I tried with a lot of approaches but all of them gives me the same problem. Server should manage the training of 200 clients, but after a certain number of trainings (it depends on the approach, but approx 50-100), the script completely stucks and cores of the CPU stop working.
Have you any ideas? Other approaches I tried are for example mp.Pool and with ProcessPoolExecutor.
Thank you for your help.
Could it be that you hit the maximum number of processes/threads your machine is able to handle?
It is common, for example, when moving a web crawler from development to production that the machine does not allow more processes.
I would give a look at the file
/etc/sysctl.d
and in case increase the number of possible processes for the machine to handle.
Another reason might be that you capped RAM limit or something similar, try to give another quick look at the command
htop
followed by
free -m
and see what they tell you. It might be a hardware problem. While from a software it might be that the library you are using https://docs.python.org/2/library/multiprocessing.html has a hard-coded limit. Also here you can easily set it higher within the library parameters.
Last but not least, try to find the problem incrementally. I would test it with with 2 processes and increment slowly to see when the application starts having issues. And at that point it would probably be even clearer what the issue was. Good luck!
Related
TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?
Normal version:
I am using PyTorch in combination with Optuna (a hyperparameter optimization framework; basically starts different trials for one model with different parameters, see: https://optuna.readthedocs.io/en/stable/) for my model training on a setup with 4 GPUs. Here, I've been looking for a way to distribute the workload more efficiently on the GPUs, hence I explored the multiprocessing library.
The core of the multiprocessing code looks like following:
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue, params, signals):
self.gpu_queue = gpu_queue
# create dataset
# ...
def __call__(self, trial: optuna.Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
val = trainer(trial, gpu=gpu_i, ...)
return val
And in main, optuna study and optuna optimize are initiated with:
study = optuna.create_study(direction="minimize", sampler = optuna.samplers.TPESampler(seed=17)) # storage = "sqlite:///trials.db")
study.optimize(Objective(GpuQueue(), ..., n_jobs=4))
Same implementation can be found in this StackOverflow post (used as inspiration): Is there a way to pass arguments to multiple jobs in optuna?
What this code does is that every trial gets its own GPU, hence the GPU usage and distribution is better than other methods. However it happens often that a GPU is stuck and just 'shuts itself off' and does not finish the trial, hence the code actually never finishes running and that GPU is never freed.
Say, for example, that I am running 100 trials, then trial 1,2,3,4 get assigned GPUs 0,1,2,3 (not always in that order), and whenever a GPU is freed, say GPU 2, it takes on trial 5, etc. The issue is, it can happen that the trial that the GPU is assigned to 'quits' in the process and never finishes the trial, hence not taking on another trial and resulting in the run with many trials not completing.
I suspected a deadlock, but apparently Queue() is thread-safe (see: Is Python multiprocessing.Queue thread safe?).
Any clue on what can cause the hang and what I can look for?
I've encountered a mysterious bug while trying to implement Hogwild with torch.multiprocessing. In particular, one version of the code runs fine, but when I add in a seemingly unrelated bit of code before the multiprocessing step, this somehow causes an error during the multiprocessing step: RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork
I reproduced the error in a minimal code sample, pasted below. If I comment out the two lines of code m0 = Model(); train(m0) which carry out a non-parallel training run on a separate model instance, then everything runs fine. I can't figure out how these lines could be causing a problem.
I'm running PyTorch 1.5.1 and Python 3.7.6 on a Linux machine, training on CPU only.
import torch
import torch.multiprocessing as mp
from torch import nn
def train(model):
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
for _ in range(10000):
opt.zero_grad()
# We train the model to output the value 4 (arbitrarily)
loss = (model(0) - 4)**2
loss.backward()
opt.step()
# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
def __init__(self):
super().__init__()
self.x = nn.Parameter(torch.ones(3))
def forward(self, x):
return torch.sum(self.x)
############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################
num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(model,))
p.start()
processes.append(p)
for p in processes:
p.join()
print(model.x)
If you modify your code to create new processes like this:
processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
p = ctx.Process(target=train, args=(model,))
it seems to run fine (rest of code same as yours, tested on pytorch 1.5.0 / python 3.6 / NVIDIA T4 GPU).
I'm not completely sure what is carried over from the non-parallel run to the parallel run; I tried creating a completely new model for the two runs (with its own class), and/or deleting anything from the original, and/or making sure to delete any tensors and free up memory, and none of that made any difference.
What did make a difference was making sure that .backward() never got called outside of mp.Process() before it was called by a function within mp.Process(). I think what may be carried over is an autograd thread; if the thread exists before multiprocessing with the default fork method it fails, if the thread is created after fork it seems to work okay, and if using spawn it also works okay.
Btw: That's a really interesting question - thank you especially for digesting it to a minimal example!
You missed this:
if __name__ == '__main__':
which is very important for multi-processing!
I'm trying to code an Asynchronous Actor Critic in PyTorch based on this repo: https://github.com/seungeunrho/minimalRL/blob/master/a3c.py
but I'm changing the ActorCritic class to use the one I coded myself.
Basically I have a class A3C, an instance of it, global_model, with shared memory and I use torch.multiprocessing to open some Processes in order to train the model in parallel. In each process at the beginning I have to create a new instance of the model, called local_model, in order to proceed with the training, but the process gets stuck in the initialization of the local model even though the one of the global model works every time.
Trying to debugging it I can see that it enters the A3C.init function and the SharedActorCritic.init too, but there it stops just after I put the checkpoint print. However if I print whatever expression contains list(critic_param_gen) magically everything works. I also noted that printing just critic_param_gen won't do.
Any idea of why is that?
Also a similar thing happens if I use local_model = copy.deepcopy(global_model) as a function create_local_model, i.e. only works if that print is present.
In pseudo-code:
import torch.multiprocessiA3Cng as mp
import torch.nn as nn
import itertools as it
debug = True
A3C(nn.Module):
def __init__(self, model, n_features):
...
self.AC_architecture = SharedActorCritic(model, n_features)
class SharedActorCritic(nn.Module):
def __init__(self, model, n_features):
super(SharedActorCritic, self).__init__()
self.shared_architecture = model(n_features) # inherits from nn.Module
self.actor = SharedActor(n_features) # inherits from nn.Module
self.critic = SharedCritic(n_features) # inherits from nn.Module
self.critic_target = BaseCritic(model, n_features) # inherits from nn.Module
critic_param_gen = it.chain(self.shared_architecture.parameters(), self.critic.parameters())
print("checkpoint")
if debug: print(list(critic_param_gen)) # this makes the whole thing work
for trg_params, params in zip(self.critic_target.parameters(), critic_param_gen ):
trg_params.data.copy_(params.data)
def create_local_model(model, n_features):
local_model = A3C(model, n_features)
print("Process ended")
# in the main
global_model = Model() # works
global_model.share_memory() # doesn't really matter
p = mp.Process(target=create_local_model, args=(model, n_features, ))
p.start()
print("Process started")
p.join()
----
# output if debug is True
Process started
checkpoint
[ ...actual list of critic_param_gen ... ]
Process ended
# output if debug is False
Process started
checkpoint
# and then runs forever
Edit: solved the mystery about the print statement thanks to snakecharmerb. I created a minimal reproducible example. It seems that if the network is large enough, the copy operation breaks if executed in a process, but not outside of it (since global model can be instantiated).
import torch.nn as nn
import torch.multiprocessing as mp
import copy
class Net(nn.Module):
def __init__(self, n_features=256, n_layers=8):
super(Net, self).__init__()
self.net1 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))
self.net2 = nn.Sequential(*nn.ModuleList([nn.Linear(n_features, n_features) for _ in range(n_layers)]))
for p1, p2 in zip(self.net1.parameters(), self.net2.parameters()):
p1.data.copy_(p2.data)
def forward(self, x):
return self.net(x)
def create_local_model_v1(global_model):
local_model = copy.deepcopy(global_model)
print("Process ended")
%%time
global_model = Net(16,2)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()
# Output
Global model created
Process ended
Process started
CPU times: user 3 ms, sys: 11.9 ms, total: 14.9 ms
Wall time: 45.1 ms
%%time
global_model = Net(256,8)
print("Global model created")
p = mp.Process(target=create_local_model_v1, args=(global_model,))
p.start()
print("Process started")
p.join()
# Output - Gets stuck
Global model created
Process started
TLDR: use torch.multiprocessing.spawn
I'm not quite skilled enough to determine the exact cause and solution to this error, but the problem occurs at this point in torch/nn/parameter.py:
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
This gets called during the deep copy process. To investigate a little more, I put together a somewhat more detailed experiment to test what parameters and environments cause the hang. The jist of the results is that the size of the model is not an issue, but rather how many features / issues can cause problems. For me, 256 features causes the hang, regardless of how many layers. Another more curious issue is that when I remove the part of initialization where the parameters from net1 get copied to net2, the hang disappears, however if I don't send anything to another process then everything works fine. Finally, when using the spawn function, everything works just fine until the number of layers exceeds 256.
I need to caveat everything about the hang, as far as I can tell it is a deadlock, but it may be just some extremely slow process. This is highly unlikely, because it seems as though all activity stops, however I couldn't confirm that it's a deadlock because I when I went for backtrace of the C code during the hang, all I got was memory address (to really confirm everything I guess I need to rebuild torch with some debugging options...). Anyways, I'm about 99% confident it's a deadlock, probably being caused by something in multiprocessing somewhere. The reason my confidence is so high is that the code won't even react to signals. If everything were working as expected, I would expect the program to at least allow me to print out a traceback from a signal handler, but nothing.
I found the following blog post to be somewhat nice:
The tragic tale of the deadlocking Python queue
Other than that, my opinion at this point is f*** combining torch and multiprocessing.
If anyone cares to see the code for the experiments I ran or the result, let me know.
I am trying to use optuna for searching hyper parameter spaces.
In one particular scenario I train a model on a machine with a few GPUs.
The model and batch size allows me to run 1 training per 1 GPU.
So, ideally I would like to let optuna spread all trials across the available GPUs
so that there is always 1 trial running on each GPU.
In the docs it says, I should just start one process per GPU in a separate terminal like:
CUDA_VISIBLE_DEVICES=0 optuna study optimize foo.py objective --study foo --storage sqlite:///example.db
I want to avoid that because the whole hyper parameter search continues in multiple rounds after that. I don't want to always manually start a process per GPU, check when all are finished, then start the next round.
I saw study.optimize has a n_jobs argument.
At first glance this seems to be perfect.
E.g. I could do this:
import optuna
def objective(trial):
# the actual model would be trained here
# the trainer here would need to know which GPU
# it should be using
best_val_loss = trainer(**trial.params)
return best_val_loss
study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=8)
This starts multiple threads each starting a training.
However, the trainer within objective somehow needs to know which GPU it should be using.
Is there a trick to accomplish that?
After a few mental breakdowns I figured out that I can do what I want using a multiprocessing.Queue. To get it into the objective function I need to define it as a lambda function or as a class (I guess partial also works). E.g.
from contextlib import contextmanager
import multiprocessing
N_GPUS = 2
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue):
self.gpu_queue = gpu_queue
def __call__(self, trial: Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
best_val_loss = trainer(**trial.params, gpu=gpu_i)
return best_val_loss
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(Objective(GpuQueue()), n_trials=100, n_jobs=8)
If you want a documented solution of passing arguments to objective functions used by multiple jobs, then Optuna docs present two solutions:
callable classes (it can be combined with multiprocessing),
lambda function wrapper (caution: simpler, but does not work with multiprocessing).
If you are prepared to take a few shortcuts, then you can skip some boilerplate by passing global values (constants such as number of GPUs used) directly (via python environment) to the __call__() method (rather than as arguments of __init__()).
The callable classes solution was tested to work (in optuna==2.0.0) with the two multiprocessing backends (loky/multiprocessing) and remote database backends (mariadb/postgresql).
To overcome the problem if introduced a global variable that tracks, which GPU is currently in use, which can then be read out in the objective function. The code looks like this.
EPOCHS = n
USED_DEVICES = []
def objective(trial):
time.sleep(random.uniform(0, 2)) #used because all n_jobs start at the same time
gpu_list = list(range(torch.cuda.device_count())
unused_gpus = [x for x in gpu_list if x not in USED_DEVICES]
idx = random.choice(unused_gpus)
USED_DEVICES.append(idx)
unused_gpus.remove(idx)
DEVICE = f"cuda:{idx}"
model = define_model(trial).to(DEVICE)
#... YOUR CODE ...
for epoch in range(EPOCHS):
# ... YOUR CODE ...
if trial.should_prune():
USED_DEVICES.remove(idx)
raise optuna.exceptions.TrialPruned()
#remove idx from list to reuse in next trial
USED_DEVICES.remove(idx)
I want to run two endless parallel loops. One is reading data from a server and updates an object with a number. The other is doing nothing else then reading it and in case of change, processing it. Does not have to be in sync or so. So my questions are :
In case of write from one side and read from another, does Python have issues with it ?
In case I get a sync problem, do I need to lock the read/write processes ? Any other way
I should do it ?
What is best to use, thread or threading ?
As the next step, I will read from 100 sites and update 100 objects,
and read from 100 loops for the changes. Is it recommend to use Multiprocessing from the
beginning so I can scale without problems ? Do I need at the read and write issues ?
Any help is appreciated.
Short answer is, whatever you think will be understandable for you.
Meaning, your code should make sense to you for learning purposes..
Here's an example, it's light and easy to use.
Getting values from and to the thread is easy..
It's not actual multi-threading tho (same CPU core)
from threading import *
class worker(Thread):
def __init__(self, input=0):
self.input = input
Thread.__init__(self)
self.start()
def run(self):
while 1:
self.input += 1
x = worker(-100)
y = worker(x.input)
print y.input
This is just an example to show that the Y thread can access the data in x.. in practice this can be dangerous considering that both threads will be updating the same variable :) (In short: -100 will be calculated twice per cycle, -98, -96, -94.. etc)
Will not span across multiple CPU's
Easy to use ( accessing data across threads is easy )
Logical code, if you're not familar with queue systems or distributed systems
Will raise a error if the OS can't create more threads (a "limitation")
from threading import Thread
from Queue import Queue
class producer(Thread):
def __init__(self,queue):
Thread.__init__(self)
self.queue=queue
self.start()
def run(self):
while 1:
self.queue.put(update_value())
class consumer(Thread):
def __init__(self,queue):
Thread.__init__(self)
self.queue=queue
self.start()
def run(self):
while True:
value = queue.get()
do_whatever_you_want(value)
queue = Queue()
producer(queue)
consumer(queue)
notice that you can scale by using 100 producer and one consumer (and of course one queue) 100 threads should be ok but things would be different if you wanted 10000