I am trying to build a kubeflow pipeline where I run two components (with a GPU constraint) in parallel. It seemed like a non-issue, but every time I tried it, one component would get stuck at "pending" until the other component is done.
Example run
The two components I am testing are simple while loops with a GPU constraint:
while_op1 = while_loop_op(image_name='tensorflow/tensorflow:1.15.2-py3')
while_op1.name = 'while-1-gpu'
while_op1.set_security_context(V1SecurityContext(privileged=True))
while_op1.apply(gcp.use_gcp_secret('user-gcp-sa'))
while_op1.add_pvolumes({pv_base_path: _volume_op.volume})
while_op1.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
while_op1.set_gpu_limit(1)
while_op1.after(init_op)
Where while_loop_op:
import kfp.components as comp
def while_loop_op(image_name):
def while_loop():
import time
max_count = 300
count = 0
while True:
if count >= max_count:
print('Done.')
break
time.sleep(10)
count += 10
print("{} seconds have passed...".format(count))
op = comp.func_to_container_op(while_loop, base_image=image_name)
return op()
the issue might be related to your use of volumes. Have you tried to use the more supported data passing mechanisms?
For example, take this pipeline: https://github.com/kubeflow/pipelines/blob/091316b8bf3790e14e2418843ff67a3072cfadc0/components/XGBoost/_samples/sample_pipeline.py
Apply the GPU-related customizations to the trainer:
some_task.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')
some_task.set_gpu_limit(1)
Put the trainer and predictor inside a for _ in range(10): loop so that you have 10 parallel copies.
Check whether the trainers run in parallel.
P.S. It's better to create issues in the official repo: https://github.com/kubeflow/pipelines/issues
Related
I am running a program in pycharm on a linux server which uses multiprocessing.Pool().map for increased performance.
The code looks something like this:
import multiprocessing
from functools import partial
for episode in episodes:
with multiprocessing.Pool() as mpool:
func_part = partial(worker_function)
mpool.map(func_part, range(step))
The weird thing is that it runs perfectly fine on my Windows 10 Laptop but as soon as I try to run it on a linux server the program gets stuck at the exact last Process measurement count 241/242, so right before proceeding to the next iteration of the loop e.g. the next episode.
No error message given. I am running pycharm on both machines. The Step layer is where I placed the multiprocessing.Pool().map function.
Edit:
I've added mpool.close() and mpool.join() but it does seem to have no effect:
import multiprocessing
from functools import partial
for episode in episodes:
with multiprocessing.Pool() as mpool:
func_part = partial(worker_function)
mpool.map(func_part, range(step))
mpool.close()
mpool.join()
It still gets stuck at the last process.
Edit2:
This is the worker function:
def worker_func(steplength, episode, episodes, env, agent, state, log_data_qvalues, log_data, steps):
env.time_ = step
action = agent.act(state, env) # given the state, the agent acts (eps-greedy) either by choosing randomly or relying on its own prediction (weights are considered here to sum up the q-values of all objectives)
next_state, reward = env.steplength(action, state) # given the action, the environment gives back the next_state and the reward for the transaction for all objectives seperately
agent.remember(state, action, reward, next_state, env.future_reward) # agent puts the experience in his memory
q_values = agent.model.predict(np.reshape(state, [1, env.state_size])) # This part is not necessary for the framework, but lets the agent predict every time_ to
start = 0 # to store the development of the prediction and to investigate the development of the Q-values
machine_start = 0
for counter, machine in enumerate(env.list_of_machines):
liste = [episode, steplength, state[counter]]
q_values_objectives = []
for objective in range(1, env.number_of_objectives + 1):
liste.append(objective)
liste.append(q_values[0][start:machine.actions + start])
start = int(agent.action_size / env.number_of_objectives) + start
log_data_qvalues.append(liste)
machine_start += machine.actions
start = machine_start
state = next_state
steps.append(state)
env.current_step += 1
if len(agent.memory) > agent.batch_size: # If the agent has collected more than batch_size-experience, the networks of the agents are starting
agent.replay(env) # to be trained, with the replay function, batch-size- samples from the memory of the agents are selected
agent.update_target_model() # the Q-target is updated after one batch-training
if steplength == env.steplength-2: # for plotting the process during training
#agent.update_target_model()
print(f'Episode: {episode + 1}/{episodes} Score: {steplength} e: {agent.epsilon:.5}')
log_data.append([episode, agent.epsilon])
As you can see it uses several classes to pass attributes. I don't know how I would reproduce it. I am still experimenting on where the process gets stuck exactly. The worker function communicates with the env and the agent class and passes information that is required to train a neural network. The agent class controls the learning process while the env class simulates the environment the network has control over.
step is an integer variable:
step = 12
Are you calling
mpool.close()
mpool.join()
at the end?
EDIT
The problem is not w/ multiprocessing but with the measurement count part. According to the screenshot, the pool map successfully ends w/ step 11 (range(12) starts at 0). measurement count is nowhere to be seen in the provided snippets to try debugging that part.
I am trying to use optuna for searching hyper parameter spaces.
In one particular scenario I train a model on a machine with a few GPUs.
The model and batch size allows me to run 1 training per 1 GPU.
So, ideally I would like to let optuna spread all trials across the available GPUs
so that there is always 1 trial running on each GPU.
In the docs it says, I should just start one process per GPU in a separate terminal like:
CUDA_VISIBLE_DEVICES=0 optuna study optimize foo.py objective --study foo --storage sqlite:///example.db
I want to avoid that because the whole hyper parameter search continues in multiple rounds after that. I don't want to always manually start a process per GPU, check when all are finished, then start the next round.
I saw study.optimize has a n_jobs argument.
At first glance this seems to be perfect.
E.g. I could do this:
import optuna
def objective(trial):
# the actual model would be trained here
# the trainer here would need to know which GPU
# it should be using
best_val_loss = trainer(**trial.params)
return best_val_loss
study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=8)
This starts multiple threads each starting a training.
However, the trainer within objective somehow needs to know which GPU it should be using.
Is there a trick to accomplish that?
After a few mental breakdowns I figured out that I can do what I want using a multiprocessing.Queue. To get it into the objective function I need to define it as a lambda function or as a class (I guess partial also works). E.g.
from contextlib import contextmanager
import multiprocessing
N_GPUS = 2
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue):
self.gpu_queue = gpu_queue
def __call__(self, trial: Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
best_val_loss = trainer(**trial.params, gpu=gpu_i)
return best_val_loss
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(Objective(GpuQueue()), n_trials=100, n_jobs=8)
If you want a documented solution of passing arguments to objective functions used by multiple jobs, then Optuna docs present two solutions:
callable classes (it can be combined with multiprocessing),
lambda function wrapper (caution: simpler, but does not work with multiprocessing).
If you are prepared to take a few shortcuts, then you can skip some boilerplate by passing global values (constants such as number of GPUs used) directly (via python environment) to the __call__() method (rather than as arguments of __init__()).
The callable classes solution was tested to work (in optuna==2.0.0) with the two multiprocessing backends (loky/multiprocessing) and remote database backends (mariadb/postgresql).
To overcome the problem if introduced a global variable that tracks, which GPU is currently in use, which can then be read out in the objective function. The code looks like this.
EPOCHS = n
USED_DEVICES = []
def objective(trial):
time.sleep(random.uniform(0, 2)) #used because all n_jobs start at the same time
gpu_list = list(range(torch.cuda.device_count())
unused_gpus = [x for x in gpu_list if x not in USED_DEVICES]
idx = random.choice(unused_gpus)
USED_DEVICES.append(idx)
unused_gpus.remove(idx)
DEVICE = f"cuda:{idx}"
model = define_model(trial).to(DEVICE)
#... YOUR CODE ...
for epoch in range(EPOCHS):
# ... YOUR CODE ...
if trial.should_prune():
USED_DEVICES.remove(idx)
raise optuna.exceptions.TrialPruned()
#remove idx from list to reuse in next trial
USED_DEVICES.remove(idx)
I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
I am trying to get into Dask. For that I attempted to parallelize some time consuming sequential code I got. The original code is this:
def sequential():
sims = []
chunksize = len(tokens)//4
for i in range(0, len(tokens), chunksize):
print(i, i+chunksize)
chunk = tokens[i:i+chunksize]
sims.append(process(chunk))
return sims
%time sequential()
and the prallelized code is this:
def parallel():
sims = []
chunksize = len(tokens)//4
for i in range(0, len(tokens), chunksize):
print(i, i+chunksize)
chunk = dask.delayed(tokens[i:i+chunksize])
sims.append(dask.delayed(process)(chunk))
return dask.delayed(sims)
%time parallel().visualize()
But the parallelized code always runs around 10% slower than the parallel one. when I visualize the computation graph for sims I get this:
Not sure where list-#8 comes from, but other than that it looks correct. So why is there no speedup? When I look into htop I can see 3 cores active (~30% load each), while for the sequential code I see only 1 core active (100% load). The sequential code runs 7 minutes and the parallel code runs 7 - 8 minutes.
I guess I am misunderstanding how delayed and compute should be used here?
The setup is this, if you require it:
import numpy
import spacy
import dask
nlp = spacy.load('en_core_web_lg')
tokens = [t for t in nlp(" ".join(t.strip() for t in open('./words.txt','r').readlines())) if len(t.text) > 1 and len(t.text) < 20]
def process(chunk):
sims = numpy.zeros([len(chunk),len(tokens)], dtype=numpy.float32)
for i in range(len(chunk)):
for j in range(len(tokens)):
sims[i,j] = chunk[i].similarity(tokens[j])
return sims
You are seeing this behaviour because the default execution engine for dask is based on multiple threads in a single process (the "threaded" scheduler). Python has a lock, the GIL, which ensures the safety of the interpreter by only executing one python statement at a time. Therefore, each thread is spending most of its time waiting for the lock to become available.
To avoid this problem, you have two options:
find a version of your computation that releases the GIL. This is possible if you can phrase it as (mainly) some numpy, pandas, numba, etc., computation, code that executes at the C level and doesn't need the interpreter, unlike your nested loops.
run your code using processes, using either the "mutiprocessing" scheduler or (better) the "distributed" scheduler which, despite the name, also runs well on a single machine.
Further information: http://dask.pydata.org/en/latest/scheduler-overview.html
I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]
The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res
There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.
Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.
I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.
I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished
Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..