Periodically restart Python multiprocessing pool - python

I have a Python multiprocessing pool doing a very long job that even after a thorough debugging is not robust enough not to fail every 24 hours or so, because it depends on many third-party, non-Python tools with complex interactions. Also, the underlying machine has certain problems that I cannot control. Note that by failing I don't mean the whole program crashing, but some or most of the processes becoming idle because of some errors, and the app itself either hanging or continuing the job just with the processes that haven't failed.
My solution right now is to periodically kill the job, manually, and then just restart from where it was.
Even if it's not ideal, what I want to do now is the following: restart the multiprocessing pool periodically, programatically, from the Python code itself. I don't really care if this implies killing the pool workers in the middle of their job. Which would be the best way to do that?
My code looks like:
with Pool() as p:
for _ in p.imap_unordered(function, data):
save_checkpoint()
log()
What I have in mind would be something like:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
Or:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
start_timeout(time=TIMEOUT) # which would be the best way to to do that without breaking multiprocessing?
try:
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
except Timeout:
pass
Or any suggestion you think would be better. Any help would be much appreciated, thanks!

The problem with your current code is that it iterates the multiprocessed results directly, and that call will block. Fortunately there's an easy solution: use apply_async exactly as suggested in the docs. But because of how you describe the use-case here and the failure, I've adapted it somewhat. Firstly, a mock task:
from multiprocessing import Pool, TimeoutError, cpu_count
from time import sleep
from random import randint
def log():
print("logging is a dangerous activity: wear a hard hat.")
def work(d):
sleep(randint(1, 100) / 100)
print("finished working")
if randint(1, 10) == 1:
print("blocking...")
while True:
sleep(0.1)
return d
This work function will fail with a probabilty of 0.1, blocking indefinitely. We create the tasks:
data = list(range(100))
nproc = cpu_count()
And then generate futures for all of them:
while data:
print(f"== Processing {len(data)} items. ==")
with Pool(nproc) as p:
tasks = [p.apply_async(work, (d,)) for d in data]
Then we can try to get the tasks out manually:
for task in tasks:
try:
res = task.get(timeout=1)
data.remove(res)
log()
except TimeoutError:
failed.append(task)
if len(failed) < nproc:
print(
f"{len(failed)} processes are blocked,"
f" but {nproc - len(failed)} remain."
)
else:
break
The controlling timeout here is the timeout to .get. It should be as long as you expect the longest process to take. Note that we detect when the whole pool is tied up and give up.
But since in the scenario you describe some threads are going to take longer than others, we can give 'failed' processes some time to recover. Thus every time a task fails we quickly check if the others have in fact succeeded:
for task in failed:
try:
res = task.get(timeout=0.01)
data.remove(res)
failed.remove(task)
log()
except TimeoutError:
continue
Whether this is a good addition in your case depends on whether your tasks really are as flaky as I'm guessing they are.
Exiting the context manager for the pool will terminate the pool, so we don't even need to handle that ourselves. If you have significant variation you might want to increase the pool size (thus increasing the number of tasks which are allowed to stall) or allow tasks a grace period before considering them 'failed'.

Related

Why is my multithreading program only actually using a single thread?

I have realized that my multithreading program isn't doing what I think its doing. The following is a MWE of my strategy. In essence I'm creating nThreads threads but only actually using one of them. Could somebody help me understand my mistake and how to fix it?
import threading
import queue
NPerThread = 100
nThreads = 4
def worker(q: queue.Queue, oq: queue.Queue):
while True:
l = []
threadIData = q.get(block=True)
for i in range(threadIData["N"]):
l.append(f"hello {i} from thread {threading.current_thread().name}")
oq.put(l)
q.task_done()
threadData = [{} for i in range(nThreads)]
inputQ = queue.Queue()
outputQ = queue.Queue()
for threadI in range(nThreads):
threadData[threadI]["thread"] = threading.Thread(
target=worker, args=(inputQ, outputQ),
name=f"WorkerThread{threadI}"
)
threadData[threadI]["N"] = NPerThread
threadData[threadI]["thread"].setDaemon(True)
threadData[threadI]["thread"].start()
for threadI in range(nThreads):
# start and end are in units of 8 bytes.
inputQ.put(threadData[threadI])
inputQ.join()
outData = [None] * nThreads
count = 0
while not outputQ.empty():
outData[count] = outputQ.get()
count += 1
for i in outData:
assert len(i) == NPerThread
print(len(i))
print(outData)
edit
I only actually realised that I had made this mistake after profiling. Here's the output, for information:
In your sample program, the worker function is just executing so fast that the same thread is able to dequeue every item. If you add a time.sleep(1) call to it, you'll see other threads pick up some of the work.
However, it is important to understand if threads are the right choice for your real application, which presumably is doing actual work in the worker threads. As #jrbergen pointed out, because of the GIL, only one thread can execute Python bytecode at a time, so if your worker functions are executing CPU-bound Python code (meaning not doing blocking I/O or calling a library that releases the GIL), you're not going to get a performance benefit from threads. You'd need to use processes instead in that case.
I'll also note that you may want to use concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.ThreadPool for an out-of-the-box thread pool implementation, rather than creating your own.

Python run two loops at the same time where one is rate limited and depends on data from the other

I have a problem in python where I want to run two loops at the same time. I feel like I need to do this because the second loop needs to be rate limited, but the first loop really shouldn't be rate limited. Also, the second loop takes an input from the first.
I'm looking fro something that works something like this:
for line in file:
do some stuff
list = []
list.append("an_item")
Rate limited:
for x in list:
do some stuff simultaneously
There are two basic approaches with different tradeoffs: synchronously switching between tasks, and running in threads or subprocesses. First, some common setup:
from queue import Queue # or Queue, if python 2
work = Queue()
def fast_task():
""" Do the fast thing """
if done:
return None
else:
return result
def slow_task(arg):
""" Do the slow thing """
RATE_LIMIT = 30 # seconds
Now, the synchronous approach. It has the advantage of being much simpler, and easier to debug, at the cost of being a bit slower. How much slower depends on the details of your tasks. How it works is, we run a tight loop that calls the fast job every time, and the slow job only if enough time has passed. If the fast job is no longer producing work and the queue is empty, we quit.
import time
last_call = 0
while True:
next_job = fast_task()
if next_job:
work.put(next_job)
elif work.empty():
# nothing left to do
break
else:
# fast task has done all its work - short sleep to slow the spin
time.sleep(.1)
now = time.time()
if now - last_call > RATE_LIMIT:
last_call = now
slow_task(work.get())
If you feel like this doesn't work fast enough, you can try the multiprocessing approach. You can use the same structure for working with threads or processes, depending on whether you import from multiprocessing.dummy or multiprocessing itself. We use a multiprocessing.Queue for communication instead of queue.Queue.
def do_the_fast_loop(work_queue):
while True:
next_job = fast_task()
if next_job:
work_queue.put(next_job)
else:
work_queue.put(None) # sentinel - tells slow process to quit
break
def do_the_slow_loop(work_queue):
next_call = time.time()
while True:
job = work_queue.get()
if job is None: # sentinel seen - no more work to do
break
time.sleep(max(0, next_call - time.time()))
next_call = time.time() + RATE_LIMIT
slow_task(job)
if __name__ == '__main__':
# from multiprocessing.dummy import Queue, Process # for threads
from multiprocessing import Queue, Process # for processes
work = Queue()
fast = Process(target=fast_task, args=(work,))
slow = Process(target=slow_task, args=(work,))
fast.start()
slow.start()
fast.join()
slow.join()
As you can see, there's quite a lot more machinery for you to implement, but it will be somewhat faster. Again, how much faster depends a lot on your tasks. I'd try all three approaches - synchronous, threaded, and multiprocess - and see which you like best.
You need to do 2 things:
Put the function require data from the other on its own process
Implement a way to communicate between the two processes (e.g. Queue)
All of this must be done thanks to the GIL.

Python multiprocessing: retrieve next result

I'm trying to figure out a good way to use the multiprocessing package in Python 3.6 to run a set of around 100 tasks, with a maximum of 4 of them running simultaneously. I also want to:
repeatedly reap the next completed task from the pool and process its return value, until all tasks have either succeeded or failed;
make exceptions thrown in any given task non-fatal, so I can still access the results from the other tasks.
I don't need to maintain the order of tasks submitted to the pool (i.e. I don't need a queue). The total number of tasks ("100" above) isn't prohibitively huge, e.g. I don't mind submitting them all at once and letting them be queued until workers are available.
I thought that multiprocessing.Pool would be a good fit for this, but I can't seem to find a "get next result" method that I can call iteratively.
Is this something I'm going to have to roll myself from process management primitives? Or can Pool (or another thing I'm missing) support this workflow?
For context, I'm using each worker to invoke a remote process that could take a few minutes, and that has capacity to handle N jobs simultaneously ("4" in my concretized example above).
I came up with the following pattern (shown using 2 workers & 6 jobs, instead of 4 & 100):
import random
import time
from multiprocessing import Pool, TimeoutError
from queue import Queue
def worker(x):
print("Start: {}".format(x))
time.sleep(5 * random.random()) # Sleep a random amount of time
if x == 2:
raise Exception("Two is bad")
return x
if __name__ == '__main__':
with Pool(processes=2) as pool:
jobs = Queue()
for i in range(6):
jobs.put(pool.apply_async(worker, [i]))
while not jobs.empty():
j = jobs.get(timeout=1)
try:
r = j.get(timeout=0.1)
print("Done: {}".format(r))
except TimeoutError as e:
jobs.put(j) # Not ready, try again later
except Exception as e:
print("Exception: {}".format(e))
Seems to work pretty well:
Start: 0
Start: 1
Start: 2
Done: 1
Start: 3
Exception: Two is bad
Start: 4
Start: 5
Done: 3
Done: 4
Done: 5
Done: 0
I'll see whether I can make a general utility to manage the queueing for me.
The main shortcoming I think it has is that completed jobs can go unnoticed for a while, while uncompleted jobs are polled and possibly time out. Avoiding that would probably require using callbacks - if it becomes a big enough problem, I'll probably add that to my app.

Process.join() and queue don't work with large numbers [duplicate]

This question already has an answer here:
Script using multiprocessing module does not terminate
(1 answer)
Closed 7 years ago.
I am trying to split for loop i.e.
N = 1000000
for i in xrange(N):
#do something
using multiprocessing.Process and it works well for small values of N.
Problem arise when I use bigger values of N. Something strange happens before or during p.join() and program doesn't respond. If I put print i, instead of q.put(i) in the definition of the function f everything works well.
I would appreciate any help. Here is the code.
from multiprocessing import Process, Queue
def f(q,nMin, nMax): # function for multiprocessing
for i in xrange(nMin,nMax):
q.put(i)
if __name__ == '__main__':
nEntries = 1000000
nCpu = 10
nEventsPerCpu = nEntries/nCpu
processes = []
q = Queue()
for i in xrange(nCpu):
processes.append( Process( target=f, args=(q,i*nEventsPerCpu,(i+1)*nEventsPerCpu) ) )
for p in processes:
p.start()
for p in processes:
p.join()
print q.qsize()
You are trying to grow your queue without bounds, and you are joining up to a subprocess that is waiting for room in the queue, so your main process is stalled waiting for that one to complete, and it never will.
If you pull data out of the queue before the join it will work fine.
One technique you could use is something like this:
while 1:
running = any(p.is_alive() for p in processes)
while not queue.empty():
process_queue_data()
if not running:
break
According to the documentation, the p.is_alive() should perform an implicit join, but it also appears to imply that the best practice might be to explicitly perform joins on all the threads after this.
Edit: Although that is pretty clear, it may not be all that performant. How you make it perform better will be highly task and machine specific (and in general, you shouldn't be creating that many processes at a time, anyway, unless some are going to be blocked on I/O).
Besides reducing the number of processes to the number of CPUs, some easy fixes to make it a bit faster (again, depending on circumstances) might look like this:
liveprocs = list(processes)
while liveprocs:
try:
while 1:
process_queue_data(q.get(False))
except Queue.Empty:
pass
time.sleep(0.5) # Give tasks a chance to put more data in
if not q.empty():
continue
liveprocs = [p for p in liveprocs if p.is_alive()]

Timeout for each thread in ThreadPool in python

I am using Python 2.7.
I am currently using ThreadPoolExecuter like this:
params = [1,2,3,4,5,6,7,8,9,10]
with concurrent.futures.ThreadPoolExecutor(5) as executor:
result = list(executor.map(f, params))
The problem is that f sometimes runs for too long. Whenever I run f, I want to limit its run to 100 seconds, and then kill it.
Eventually, for each element x in param, I would like to have an indication of whether or not f had to be killed, and in case it wasn't - what was the return value.
Even if f times out for one parameter, I still want to run it with the next parameters.
The executer.map method does have a timeout parameter, but it sets a timeout for the entire run, from the time of the call to executer.map, and not for each thread separately.
What is the easiest way to get my desired behavior?
This answer is in terms of python's multiprocessing library, which is usually preferable to the threading library, unless your functions are just waiting on network calls. Note that the multiprocessing and threading libraries have the same interface.
Given you're processes run for potentially 100 seconds each, the overhead of creating a process for each one is fairly small in comparison. You probably have to make your own processes to get the necessary control.
One option is to wrap f in another function that will exectue for at most 100 seconds:
from multiprocessing import Pool
def timeout_f(arg):
pool = Pool(processes=1)
return pool.apply_async(f, [arg]).get(timeout=100)
Then your code changes to:
result = list(executor.map(timeout_f, params))
Alternatively, you could write your own thread/process control:
from multiprocessing import Process
from time import time
def chunks(l, n):
""" Yield successive n-sized chunks from l. """
for i in xrange(0, len(l), n):
yield l[i:i+n]
processes = [Process(target=f, args=(i,)) for i in params]
exit_codes = []
for five_processes = chunks(processes, 5):
for p in five_processes:
p.start()
time_waited = 0
start = time()
for p in five_processes:
if time_waited >= 100:
p.join(0)
p.terminate()
p.join(100 - time_waited)
p.terminate()
time_waited = time() - start
for p in five_processes:
exit_codes.append(p.exit_code)
You'd have to get the return values through something like Can I get a return value from multiprocessing.Process?
The exit codes of the processes are 0 if the processes completed and non-zero if they were terminated.
Techniques from:
Join a group of python processes with a timeout, How do you split a list into evenly sized chunks?
As another option, you could just try to use apply_async on multiprocessing.Pool
from multiprocessing import Pool, TimeoutError
from time import sleep
if __name__ == "__main__":
pool = Pool(processes=5)
processes = [pool.apply_async(f, [i]) for i in params]
results = []
for process in processes:
try:
result.append(process.get(timeout=100))
except TimeoutError as e:
results.append(e)
Note that the above possibly waits more than 100 seconds for each process, as if the first one takes 50 seconds to complete, the second process will have had 50 extra seconds in its run time. More complicated logic (such as the previous example) is needed to enforce stricter timeouts.

Categories