multiprocessing.Pool.map stuck at last Process - python

I am running a program in pycharm on a linux server which uses multiprocessing.Pool().map for increased performance.
The code looks something like this:
import multiprocessing
from functools import partial
for episode in episodes:
with multiprocessing.Pool() as mpool:
func_part = partial(worker_function)
mpool.map(func_part, range(step))
The weird thing is that it runs perfectly fine on my Windows 10 Laptop but as soon as I try to run it on a linux server the program gets stuck at the exact last Process measurement count 241/242, so right before proceeding to the next iteration of the loop e.g. the next episode.
No error message given. I am running pycharm on both machines. The Step layer is where I placed the multiprocessing.Pool().map function.
Edit:
I've added mpool.close() and mpool.join() but it does seem to have no effect:
import multiprocessing
from functools import partial
for episode in episodes:
with multiprocessing.Pool() as mpool:
func_part = partial(worker_function)
mpool.map(func_part, range(step))
mpool.close()
mpool.join()
It still gets stuck at the last process.
Edit2:
This is the worker function:
def worker_func(steplength, episode, episodes, env, agent, state, log_data_qvalues, log_data, steps):
env.time_ = step
action = agent.act(state, env) # given the state, the agent acts (eps-greedy) either by choosing randomly or relying on its own prediction (weights are considered here to sum up the q-values of all objectives)
next_state, reward = env.steplength(action, state) # given the action, the environment gives back the next_state and the reward for the transaction for all objectives seperately
agent.remember(state, action, reward, next_state, env.future_reward) # agent puts the experience in his memory
q_values = agent.model.predict(np.reshape(state, [1, env.state_size])) # This part is not necessary for the framework, but lets the agent predict every time_ to
start = 0 # to store the development of the prediction and to investigate the development of the Q-values
machine_start = 0
for counter, machine in enumerate(env.list_of_machines):
liste = [episode, steplength, state[counter]]
q_values_objectives = []
for objective in range(1, env.number_of_objectives + 1):
liste.append(objective)
liste.append(q_values[0][start:machine.actions + start])
start = int(agent.action_size / env.number_of_objectives) + start
log_data_qvalues.append(liste)
machine_start += machine.actions
start = machine_start
state = next_state
steps.append(state)
env.current_step += 1
if len(agent.memory) > agent.batch_size: # If the agent has collected more than batch_size-experience, the networks of the agents are starting
agent.replay(env) # to be trained, with the replay function, batch-size- samples from the memory of the agents are selected
agent.update_target_model() # the Q-target is updated after one batch-training
if steplength == env.steplength-2: # for plotting the process during training
#agent.update_target_model()
print(f'Episode: {episode + 1}/{episodes} Score: {steplength} e: {agent.epsilon:.5}')
log_data.append([episode, agent.epsilon])
As you can see it uses several classes to pass attributes. I don't know how I would reproduce it. I am still experimenting on where the process gets stuck exactly. The worker function communicates with the env and the agent class and passes information that is required to train a neural network. The agent class controls the learning process while the env class simulates the environment the network has control over.
step is an integer variable:
step = 12

Are you calling
mpool.close()
mpool.join()
at the end?
EDIT
The problem is not w/ multiprocessing but with the measurement count part. According to the screenshot, the pool map successfully ends w/ step 11 (range(12) starts at 0). measurement count is nowhere to be seen in the provided snippets to try debugging that part.

Related

Optuna hyperparameter search not reproducible with interrupted / resumed studies

For big ML models with many parameters, it is helpful if one can interrupt and resume the hyperparameter optimization search.
Optuna allows doing that with the RDB backend, which stores the study in a SQlite database (https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/001_rdb.html#sphx-glr-tutorial-20-recipes-001-rdb-py).
However, when interrupting and resuming a study, the results are not the same as that of an uninterrupted study.
Expect: For a fixed seed, the results from an optimization run with n_trials = x are identical to a study with n_trials = x/5, that is resumed 5 times and a study, that is interrupted with KeyboardInterrupt 5 times and resumed 5 times until n_trials = x.
Actual: The results are equal up to the point of the first interruption. From then on, they differ.
The figures show the optimization history of all trials in a study. The left-most figure (A) shows the uninterrupted run, the center one shows a run interrupted by keyboard (B), the right-most figure shows the run interrupted by n_iter (C). In B and C, the red dotted line shows the point where the first study was first interrupted. Left of the line, the results are equal to the uninterrupted study, to the right they differ.
Is it possible to interrupt and resume a study, so that another study with the same seed that has not been interrupted generates exactly the same result?
(Obviously assuming that the objective function behaves in a non-deterministic way.)
Minimal working example to reproduce:
import optuna
import logging
import sys
import numpy as np
def objective(trial):
x = trial.suggest_float("x", -10, 10)
return (x - 4) ** 2
def set_study(db_name,
study_name,
seed,
direction="minimize"):
'''
Creates a new study in a sqlite database located in results/ .
The study can be resumed after keyboard interrupt by simple creating it
using the same command used for the initial creation.
'''
# Add stream handler of stdout to show the messages
optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
sampler = optuna.samplers.TPESampler(seed = seed, n_startup_trials = 0)
storage_name = f"sqlite:///{db_name}.db"
storage = optuna.storages.RDBStorage(storage_name, heartbeat_interval=1)
study = optuna.create_study(storage=storage,
study_name=study_name,
sampler=sampler,
direction=direction,
load_if_exists=True)
return study
study = set_study('optuna_test', 'optuna_test_study', 1)
try:
# Press CTRL+C to stop the optimization.
study.optimize(objective, n_trials=100)
except KeyboardInterrupt:
pass
df = study.trials_dataframe(attrs=("number", "value", "params", "state"))
print(df)
print("Best params: ", study.best_params)
print("Best value: ", study.best_value)
print("Best Trial: ", study.best_trial)
# print("Trials: ", study.trials)
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
Found whats causing the problem: The random number generator in the sampler is initialized using the seed, but of course it returns a different number if the study is interrupted and resumed (it is then reinitialised)
This is especially bad using random search with fixed seed, as the search then basically starts from new.
If one really needs reproducible runs, one can simply extract the rng into a binary file after a run or keyboard interrupt, and resume by overwriting the newly generated rng of the sampler with the saved one.

Python: parallel execution of a function which has a sequential loop inside

I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction.
Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
import numpy as np
def foo(eps): # need an (unused) argument to use pool.map()
# initialising
# the true values of the actions
q = np.random.normal(0, 1, size=10)
# the estimated values
q_est = np.zeros(10)
# the counter of how many times each of the 10 actions was chosen
n = np.zeros(10)
rewards = []
for i in range(1000):
# choose an action based on its estimated value
a = np.argmax(q_est)
# get the normally distributed reward
rewards.append(np.random.normal(q[a], 1))
# increment the chosen action counter
n[a] += 1
# update the estimated value of the action
q_est[a] += (rewards[-1] - q_est[a]) / n[a]
return rewards
I execute this function 2000 times to get (2000, 1000) array:
reward = np.array([foo(0) for _ in range(2000)])
Then I plot the mean reward across 2000 experiments:
import matplotlib.pyplot as plt
plt.plot(np.arange(1000), reward.mean(axis=0))
sequential plot
which fully corresponds the expected result (looks the same as in the book).
But when I try to execute it in parallel, I get much greater standard deviation of the average reward:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
reward_p = np.array(pool.map(foo, [0]*2000))
plt.plot(np.arange(1000), reward_p.mean(axis=0))
parallel plot
I suppose this is due to the parallelization of a loop inside of the foo. As i reduce the number of cores allocated to the task, the reward plot approaches the expected shape.
Is there a way to get the advantage of the multiprocessing here while getting the correct results?
UPD:
I tried running the same code on Windows 10 and sequential vs parallel and the results turned out to be the same! What may be the reason?
Ubuntu 20.04, Python 3.8.5, jupyter
Windows 10, Python 3.7.3, jupyter
As we found out it is different on windows and ubuntu. It is probably because of this:
spawn The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
Try adding this line to your code:
mp.set_start_method('spawn')

Can Kubeflow Pipelines run GPU components in parallel?

I am trying to build a kubeflow pipeline where I run two components (with a GPU constraint) in parallel. It seemed like a non-issue, but every time I tried it, one component would get stuck at "pending" until the other component is done.
Example run
The two components I am testing are simple while loops with a GPU constraint:
while_op1 = while_loop_op(image_name='tensorflow/tensorflow:1.15.2-py3')
while_op1.name = 'while-1-gpu'
while_op1.set_security_context(V1SecurityContext(privileged=True))
while_op1.apply(gcp.use_gcp_secret('user-gcp-sa'))
while_op1.add_pvolumes({pv_base_path: _volume_op.volume})
while_op1.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
while_op1.set_gpu_limit(1)
while_op1.after(init_op)
Where while_loop_op:
import kfp.components as comp
def while_loop_op(image_name):
def while_loop():
import time
max_count = 300
count = 0
while True:
if count >= max_count:
print('Done.')
break
time.sleep(10)
count += 10
print("{} seconds have passed...".format(count))
op = comp.func_to_container_op(while_loop, base_image=image_name)
return op()
the issue might be related to your use of volumes. Have you tried to use the more supported data passing mechanisms?
For example, take this pipeline: https://github.com/kubeflow/pipelines/blob/091316b8bf3790e14e2418843ff67a3072cfadc0/components/XGBoost/_samples/sample_pipeline.py
Apply the GPU-related customizations to the trainer:
some_task.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
some_task.set_gpu_limit(1)
Put the trainer and predictor inside a for _ in range(10): loop so that you have 10 parallel copies.
Check whether the trainers run in parallel.
P.S. It's better to create issues in the official repo: https://github.com/kubeflow/pipelines/issues

Is there a way to pass arguments to multiple jobs in optuna?

I am trying to use optuna for searching hyper parameter spaces.
In one particular scenario I train a model on a machine with a few GPUs.
The model and batch size allows me to run 1 training per 1 GPU.
So, ideally I would like to let optuna spread all trials across the available GPUs
so that there is always 1 trial running on each GPU.
In the docs it says, I should just start one process per GPU in a separate terminal like:
CUDA_VISIBLE_DEVICES=0 optuna study optimize foo.py objective --study foo --storage sqlite:///example.db
I want to avoid that because the whole hyper parameter search continues in multiple rounds after that. I don't want to always manually start a process per GPU, check when all are finished, then start the next round.
I saw study.optimize has a n_jobs argument.
At first glance this seems to be perfect.
E.g. I could do this:
import optuna
def objective(trial):
# the actual model would be trained here
# the trainer here would need to know which GPU
# it should be using
best_val_loss = trainer(**trial.params)
return best_val_loss
study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=8)
This starts multiple threads each starting a training.
However, the trainer within objective somehow needs to know which GPU it should be using.
Is there a trick to accomplish that?
After a few mental breakdowns I figured out that I can do what I want using a multiprocessing.Queue. To get it into the objective function I need to define it as a lambda function or as a class (I guess partial also works). E.g.
from contextlib import contextmanager
import multiprocessing
N_GPUS = 2
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue):
self.gpu_queue = gpu_queue
def __call__(self, trial: Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
best_val_loss = trainer(**trial.params, gpu=gpu_i)
return best_val_loss
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(Objective(GpuQueue()), n_trials=100, n_jobs=8)
If you want a documented solution of passing arguments to objective functions used by multiple jobs, then Optuna docs present two solutions:
callable classes (it can be combined with multiprocessing),
lambda function wrapper (caution: simpler, but does not work with multiprocessing).
If you are prepared to take a few shortcuts, then you can skip some boilerplate by passing global values (constants such as number of GPUs used) directly (via python environment) to the __call__() method (rather than as arguments of __init__()).
The callable classes solution was tested to work (in optuna==2.0.0) with the two multiprocessing backends (loky/multiprocessing) and remote database backends (mariadb/postgresql).
To overcome the problem if introduced a global variable that tracks, which GPU is currently in use, which can then be read out in the objective function. The code looks like this.
EPOCHS = n
USED_DEVICES = []
def objective(trial):
time.sleep(random.uniform(0, 2)) #used because all n_jobs start at the same time
gpu_list = list(range(torch.cuda.device_count())
unused_gpus = [x for x in gpu_list if x not in USED_DEVICES]
idx = random.choice(unused_gpus)
USED_DEVICES.append(idx)
unused_gpus.remove(idx)
DEVICE = f"cuda:{idx}"
model = define_model(trial).to(DEVICE)
#... YOUR CODE ...
for epoch in range(EPOCHS):
# ... YOUR CODE ...
if trial.should_prune():
USED_DEVICES.remove(idx)
raise optuna.exceptions.TrialPruned()
#remove idx from list to reuse in next trial
USED_DEVICES.remove(idx)

Scheduling non-periodic events with multiple threads

I am attempting to develop a GUI program in python (using pyqt5) to interact with a data acquisition device (DAQ) that will be connected via LAN or USB to a windows PC. On the click of a button (in the GUI), the DAQ will perform a test.
Each "Test" will consist of collecting a reading (collecting a reading takes about 1.5 seconds) at user-defined intervals from the start of the test (e.g., 0.1 min, 0.2 min, 0.5 min, 1 min, 2 min, 5 min...1000 min etc.). A reading is collected by execution of a function, so code for a single test might look like this:
import time
t=[0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000] #times from start of test to collect readings at (min)
intervals=[t[i]-t[i-1] for i in range(1,len(t))] #time delta between readings (min)
def GetReading():
#some code to connect to the DAQ (using pyvisa) and collect reading
reading=['2020-01-02 17:33:33',1.23456] #the data returned from the DAQ
return reading
def RunTest(r):
results=[GetReading()] #get the initial (t=0) reading
ReadTime=1.5 #time in seconds to collect 1 reading (I may use an implementation of
#time.run_process() or similar to actually calculate this this instead)
for j in r:
time.sleep(j*60-ReadTime)
results.append(GetReading())
return results
RunTest(intervals)
The DAQ can only perform one reading at a time. I would like to be able to run multiple tests simultaneously, and have my program automatically wait and start a new test when it is feasible (i.e., delay starting a test on click if another test is already running).
The first, say 5 readings, are imperative that they happen on time, but the subsequent readings of a given test can be delayed by a bit without affecting the quality of the test. For example, if a test is running at the 0.2 min reading interval, and the user initiates a new test, the program would wait until the current test completed say, the 5 min reading, before starting the additional test sequence.
Subsequent readings beyond the 5 min reading could be delayed to collect the first 5 readings of a new test sequence, or collect a reading from another test.
I'm struggling with how to program this, conceptually. I think i need to use multiprocesses or similar to allow multiple tests to be run in parallel (though no actual parallel readings can occur). Or, perhaps I can use scheduler? I'm just not sure how to implement either of these; I've never used them before, and I'm having trouble understanding examples I find in the context of my problem.
Furthermore, I need to be able to access results (output from RunTest) between calls to GetReading() (e.g., to view data as the test progresses), and using the time.sleep wouldn't allow that.
UPDATE
The measurement the DAQ is collecting is deformation, via a LVDT.
The time zero in var t is not actually the button click supplied by the user. On button click, the DAQ will open the specified channel and the program will monitor for a change in deformation above a certain threshold. The user will then physically start the test (which involves adding a weight on some material, to measure stress-strain properties), and time zero will occur at i-1 where i is the first instance of change above the threshold is detected (i.e., t=0 corresponds to the zero-deformation reading the instant before the weight is added). I need the whole process from button click, to adding the weight, to collecting up to the 5 minute reading to be uninterrupted for a single test (Deformation occurs most rapidly, and potentially erratically, in the first 5 minutes or so).
Below code works, but doesn't ensure, that first measurements of a new test are prioritized.
If this is is essential, then the solution will be a little bit more difficult.
In order to be sure, that only one function / thread is reading data at a given time, you can use a mutex (threading.Lock)
from threading import Lock
read_lock = Lock()
def get_data():
with read_lock:
#some code to connect to the DAQ (using pyvisa) and collect reading
reading=['2020-01-02 17:33:33',1.23456] #the data returned from the DAQ
return reading
I'd propose to write a function, that fetches the result and appends it to a results list.
Any object being modified by one thread and read by another should be protected with a Lock, therefore there is a second lock to avoid simultaneous reading / writing of results.
results_lock = Lock()
def get_and_store_data(results):
result = get_data()
with results_lock:
results.append(result)
You can schedule a get_and_store_data action with threading.Timer
Below a the full code example:
from threading import Lock
from threading import Timer
t=[0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000] #times from start of test to collect readings at (min)
read_lock = Lock()
results_lock = Lock()
def get_data():
import time
import datetime
import random
with read_lock:
time.sleep(1.5)
#some code to connect to the DAQ (using pyvisa) and collect reading
reading = [
datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
random.randint(0,10000) / 1000,
]
return reading
def get_and_store_data(results):
result = get_data()
with results_lock:
results.append(result)
# schedule measures for one test
def schedule_measures(measure_times, results):
timers = []
for t in measure_times:
timer = Timer(t, get_and_store_data, args=[results])
timer.start()
timers.append(timer)
def main():
results = []
meas_times = [tim * 1 for tim in t]
schedule_measures(meas_times, results)
while True:
msg = "Please press enter to display results or q and enter to quit"
choice = input(msg).strip()
if choice == "q":
break
print("Results:")
with results_lock:
for result in results:
print(result)
main()
If you want to reduce the 'drift' between measurements, then you could do something like:
import time
def schedule_measures(measure_times, results):
timers = []
t_0 = time.time()
for t in measure_times:
now = time.time()
timer = Timer(t - (now - t_0), get_and_store_data, args=[results])
timer.start()
timers.append(timer)
Though the drift will probably be low enough anyways, but it's a neat trick if you have more CPU intensive actions in your schedule function or if you do not want to schedule all events at startup.
For prioritizing measurements it might perhaps be easier to create a sorted list of lists with the calculated times when the measurement should be performed and start the next timer only when the previous timer fired. there had to be some logic to decide which measurement should be the next one to be scheduled. I don't have time now, but will probably come back within the next 12 hours with a suggested algorithm

Categories