Python multiprocessing: retrieve next result - python

I'm trying to figure out a good way to use the multiprocessing package in Python 3.6 to run a set of around 100 tasks, with a maximum of 4 of them running simultaneously. I also want to:
repeatedly reap the next completed task from the pool and process its return value, until all tasks have either succeeded or failed;
make exceptions thrown in any given task non-fatal, so I can still access the results from the other tasks.
I don't need to maintain the order of tasks submitted to the pool (i.e. I don't need a queue). The total number of tasks ("100" above) isn't prohibitively huge, e.g. I don't mind submitting them all at once and letting them be queued until workers are available.
I thought that multiprocessing.Pool would be a good fit for this, but I can't seem to find a "get next result" method that I can call iteratively.
Is this something I'm going to have to roll myself from process management primitives? Or can Pool (or another thing I'm missing) support this workflow?
For context, I'm using each worker to invoke a remote process that could take a few minutes, and that has capacity to handle N jobs simultaneously ("4" in my concretized example above).

I came up with the following pattern (shown using 2 workers & 6 jobs, instead of 4 & 100):
import random
import time
from multiprocessing import Pool, TimeoutError
from queue import Queue
def worker(x):
print("Start: {}".format(x))
time.sleep(5 * random.random()) # Sleep a random amount of time
if x == 2:
raise Exception("Two is bad")
return x
if __name__ == '__main__':
with Pool(processes=2) as pool:
jobs = Queue()
for i in range(6):
jobs.put(pool.apply_async(worker, [i]))
while not jobs.empty():
j = jobs.get(timeout=1)
try:
r = j.get(timeout=0.1)
print("Done: {}".format(r))
except TimeoutError as e:
jobs.put(j) # Not ready, try again later
except Exception as e:
print("Exception: {}".format(e))
Seems to work pretty well:
Start: 0
Start: 1
Start: 2
Done: 1
Start: 3
Exception: Two is bad
Start: 4
Start: 5
Done: 3
Done: 4
Done: 5
Done: 0
I'll see whether I can make a general utility to manage the queueing for me.
The main shortcoming I think it has is that completed jobs can go unnoticed for a while, while uncompleted jobs are polled and possibly time out. Avoiding that would probably require using callbacks - if it becomes a big enough problem, I'll probably add that to my app.

Related

Periodically restart Python multiprocessing pool

I have a Python multiprocessing pool doing a very long job that even after a thorough debugging is not robust enough not to fail every 24 hours or so, because it depends on many third-party, non-Python tools with complex interactions. Also, the underlying machine has certain problems that I cannot control. Note that by failing I don't mean the whole program crashing, but some or most of the processes becoming idle because of some errors, and the app itself either hanging or continuing the job just with the processes that haven't failed.
My solution right now is to periodically kill the job, manually, and then just restart from where it was.
Even if it's not ideal, what I want to do now is the following: restart the multiprocessing pool periodically, programatically, from the Python code itself. I don't really care if this implies killing the pool workers in the middle of their job. Which would be the best way to do that?
My code looks like:
with Pool() as p:
for _ in p.imap_unordered(function, data):
save_checkpoint()
log()
What I have in mind would be something like:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
Or:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
start_timeout(time=TIMEOUT) # which would be the best way to to do that without breaking multiprocessing?
try:
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
except Timeout:
pass
Or any suggestion you think would be better. Any help would be much appreciated, thanks!
The problem with your current code is that it iterates the multiprocessed results directly, and that call will block. Fortunately there's an easy solution: use apply_async exactly as suggested in the docs. But because of how you describe the use-case here and the failure, I've adapted it somewhat. Firstly, a mock task:
from multiprocessing import Pool, TimeoutError, cpu_count
from time import sleep
from random import randint
def log():
print("logging is a dangerous activity: wear a hard hat.")
def work(d):
sleep(randint(1, 100) / 100)
print("finished working")
if randint(1, 10) == 1:
print("blocking...")
while True:
sleep(0.1)
return d
This work function will fail with a probabilty of 0.1, blocking indefinitely. We create the tasks:
data = list(range(100))
nproc = cpu_count()
And then generate futures for all of them:
while data:
print(f"== Processing {len(data)} items. ==")
with Pool(nproc) as p:
tasks = [p.apply_async(work, (d,)) for d in data]
Then we can try to get the tasks out manually:
for task in tasks:
try:
res = task.get(timeout=1)
data.remove(res)
log()
except TimeoutError:
failed.append(task)
if len(failed) < nproc:
print(
f"{len(failed)} processes are blocked,"
f" but {nproc - len(failed)} remain."
)
else:
break
The controlling timeout here is the timeout to .get. It should be as long as you expect the longest process to take. Note that we detect when the whole pool is tied up and give up.
But since in the scenario you describe some threads are going to take longer than others, we can give 'failed' processes some time to recover. Thus every time a task fails we quickly check if the others have in fact succeeded:
for task in failed:
try:
res = task.get(timeout=0.01)
data.remove(res)
failed.remove(task)
log()
except TimeoutError:
continue
Whether this is a good addition in your case depends on whether your tasks really are as flaky as I'm guessing they are.
Exiting the context manager for the pool will terminate the pool, so we don't even need to handle that ourselves. If you have significant variation you might want to increase the pool size (thus increasing the number of tasks which are allowed to stall) or allow tasks a grace period before considering them 'failed'.

How to wait for thread execution to complete before starting new thread?

I have a python code in which I can run a maximum of 10 threads at a time due to GPU and compute limitations. I have 100 folders that I want to process and I want each thread to process one folder. Here is some sample code that I have written to achieve this.
def random_wait(thread_id):
# print('Inside wait')
rand_number = random.randint(3, 9)
# print(f'Random number : {rand_number}')
print(f'Thread {thread_id} waiting for {rand_number} seconds')
time.sleep(rand_number)
print(f'Thread {thread_id} completed execution')
if __name__=='__main__':
total_runs = 6
thread_limit = 3
running_threads = list()
for i in range(total_runs):
print(f'Active threads : {threading.active_count()}')
if threading.active_count() > thread_limit:
print(f'Active thread count exceeded')
# check if an existing thread is alive and for it to finish execution
for running_thread in running_threads:
if not running_thread.is_alive():
# Remove thread
running_threads.remove(running_thread)
print(f'Removing thread: {running_thread}')
else:
thread = threading.Thread(target=random_wait, args=(i,), kwargs={})
running_threads.append(thread)
print(f'Starting thread : {i}')
thread.start()
In this code, I am checking if the number of active threads exceeds the thread limit that I have specified, and the process refrains from creating new threads unless there's space for one more thread to be executed.
I am able to refrain the process from starting new threads. However, I lose the threads that I wanted to start and the code just ends up starting and stopping the first three threads. How can I achieve starting a new thread/processing as soon as there's space for one more? Is there a better way in which I just start 10 threads, but as soon as one thread completes, I assign it to start processing another folder?
You should use a ThreadPoolExecutor from the Python standard library concurrent.futures, it automatically manages a fixed number of threads. If you need to execute the same function with different arguments in parallel (as in a parallel for-loop), you can use the .map() method:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(10) as e:
results = e.map(work, (arg_1, arg_2, ..., arg_n))
If you need to schedule different work in parallel you should use the .submit() method:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(10) as e:
future_1 = e.submit(work_1, arg_1)
future_2 = e.submit(work_2, arg_2)
result_1 = future_1.result()
result_2 = future_2.result()
In the second case, .submit() returns a Future object which encapsulates the asynchronous execution of the work. You should store that future and get the result when needed. Note that the context manager (with statement) ensures that the .shutdown() method is call before leaving it, so all works are done after this point.

what are the semantics of python multiprocessing. join with timeout. Why does it hang

Newbie here in Python and sympy. I have simple question.
If I start a worker using multiprocessing, and give it some small timeout say 3 seconds, in the join call. What happens if the worker is busy itself, may be waiting for something to complete? what happens when the timeout expires and worker itself is waiting for sympy?
I found that if the worker calls sympy on a long computation, the join hangs waiting for the worker, much longer than the time out period.
Here is a MWE
from multiprocessing import Process, Queue
from sympy import *
x = symbols('x')
def worker(integrand,que):
que.put(integrate(integrand, x))
if __name__ == '__main__':
que = Queue()
result = "still waiting "
#this first call works OK since it is easy integral
#action_process = Process(target=worker,args=(cos(x),que))
#this one hangs inside sympy since hard integral
p = Process(target=worker,args=(1/(x**3+cos(x)),que))
# start the process and wait for max of 3 seconds.
p.start()
p.join(timeout=3)
result=que.get()
print("result from worker is " + str(result))
p.terminate()
Question: Is it possible to force the worker to terminate at the timeout? Is there another option to use to do that? Am I doing something wrong in the above?
If not possible to force timeout at join, then what is the meaning of join() after n seconds, if the worker can't join?
Only reason I am trying the above is to find a way to timeout on long computation in sympy, since I am on windows and timers and alarms do not work on windows, so I thought to try multiprocessing with timeout, but it does not seem to do what I want and hence this will not be any use for what I want.
ps. if you run the above, worker will hang inside sympy for long time, may be 10 minutes or more, and might have to kill it manually.
The above is on windows. But I also tried in on Linux, and it also hangs.
Anacode 4.3.1, Python 3.6
Update
Thanks to hint in answer, this below seems to work now
from multiprocessing import Process, Queue
from sympy import *
x = symbols('x')
def worker(integrand,que):
que.put(integrate(integrand, x))
if __name__ == '__main__':
que = Queue()
result = "timed out "
#this one hangs inside sympy since hard integral
p = Process(target=worker,args=(1/(x**3+cos(x)),que))
# start the process and wait for max of 3 seconds.
p.start()
p.join(timeout=3)
try:
result=que.get(block=False)
except:
print("timed out on que.get()")
pass
print("result from worker is " + str(result))
p.terminate()
I still need to test it more to make sure the worker process() does get killed by terminate as I do not want to have zombie workers around.
The reason why your code never reaches the print "in time" is because it doesn't get beyond the result=que.get(). You specified a timeout for p.join which makes the calling process resume after the specified timeout. However que.get is also blocking and therefore waits until an item has been put in the queue. You can specify a timeout here as well: que.get(timeout=...) or simply turn off blocking: que.get(block=False). In any case, if no item is available in the queue, it will raise a queue.Empty exception.
However the documentation contains several warnings about terminating a process that holds a (shared) reference to a queue. The Programming Guidelines seem to contain a solution for that case and also warn of possible deadlocks (see "Joining processes that use queues").

Why is a ThreadPoolExecutor with one worker still faster than normal execution?

I'm using this library, Tomorrow, that in turn uses the ThreadPoolExecutor from the standard library, in order to allow for Async function calls.
Calling the decorator #tomorrow.threads(1) spins up a ThreadPoolExecutor with 1 worker.
Question
Why is it faster to execute a function using 1 thread worker over just calling it as is (e.g. normally)?
Why is it slower to execute the same code with 10 thread workers in place of just 1, or even None?
Demo code
imports excluded
def openSync(path: str):
for row in open(path):
for _ in row:
pass
#tomorrow.threads(1)
def openAsync1(path: str):
openSync(path)
#tomorrow.threads(10)
def openAsync10(path: str):
openSync(path)
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
[func(p) for p in paths]
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Note: The data folder contains 18 files, each 700 lines of random text.
Output
0 workers: 0.0120 seconds
1 worker: 0.0009 seconds
10 workers: 0.0535 seconds
What I've tested
I've ran the code more than a couple dusin times, with different programs running in the background (ran a bunch yesterday, and a couple today). The numbers change, ofc, but the order is always the same. (I.e. 1 is fastest, then 0 then 10).
I've also tried changing the order of execution (e.g. moving the do calls around) in order to eliminate caching as a factor, but still the same.
Turns out that executing in the order 10, 1, None results in a different order (1 is fastest, then 10, then 0) compared to every other permutation. The result shows that whatever do call is executed last, is considerably slower than it would have been had it been executed first or in the middle instead.
Results (After receiving solution from #Dunes)
0 workers: 0.0122 seconds
1 worker: 0.0214 seconds
10 workers: 0.0296 seconds
When you call one of your async functions it returns a "futures" object (instance of tomorrow.Tomorrow in this case). This allows you to submit all your jobs without having to wait for them to finish. However, never actually wait for the jobs to finish. So all do(openAsync1) does is time how long it takes to setup all the jobs (should be very fast). For a more accurate test you need to do something like:
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
# do all jobs if openSync, else start all jobs if openAsync
results = [func(p) for p in paths]
# if openAsync, the following waits until all jobs are finished
if func is not openSync:
for r in results:
r._wait()
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Using additional threads in python generally slows things down. This is because of the global interpreter lock which means only 1 thread can ever be active, regardless of the number of cores the CPU has.
However, things are complicated by the fact that your job is IO bound. More worker threads might speed things up. This is because a single thread might spend more time waiting for the hard drive to respond than is lost between context switching between the various threads in the multi-threaded variant.
Side note, even though neither openAsync1 and openAsync10 wait for jobs to complete, do(openAsync10) is probably slower because it requires more synchronisation between threads when submitting a new job.

Troubleshooting data inconsistencies with Python multiprocessing/threading

TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?
I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.
I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks

Categories