Creating a timeout function in Python with multiprocessing - python

I'm trying to create a timeout function in Python 2.7.11 (on Windows) with the multiprocessing library.
My basic goal is to return one value if the function times out and the actual value if it doesn't timeout.
My approach is the following:
from multiprocessing import Process, Manager
def timeoutFunction(puzzleFileName, timeLimit):
manager = Manager()
returnVal = manager.list()
# Create worker function
def solveProblem(return_val):
return_val[:] = doSomeWork(puzzleFileName) # doSomeWork() returns list
p = Process(target=solveProblem, args=[returnVal])
p.start()
p.join(timeLimit)
if p.is_alive():
p.terminate()
returnVal = ['Timeout']
return returnVal
And I call the function like this:
if __name__ == '__main__':
print timeoutFunction('example.txt', 600)
Unfortunately this doesn't work and I receive some sort of EOF error in pickle.py
Can anyone see what I'm doing wrong?
Thanks in advance,
Alexander
Edit: doSomeWork() is not an actual function. Just a filler for some other work I do. That work is not done in parallel and does not use any shared variables. I'm only trying to run a single function and have it possibly timeout.

You can use the Pebble library for this.
from pebble import concurrent
from concurrent.futures import TimeoutError
TIMEOUT_IN_SECONDS = 10
#concurrent.process(timeout=TIMEOUT_IN_SECONDS)
def function(foo, bar=0):
return foo + bar
future = function(1, bar=2)
try:
result = future.result() # blocks until results are ready or timeout
except TimeoutError as error:
print "Function took longer than %d seconds" % error.args[1]
result = 'timeout'
The documentation has more complete examples.
The library will terminate the function if it timeouts so you don't need to worry about IO or CPU being wasted.
EDIT:
If you're doing an assignment, you can still look at its implementation.
Short example:
from multiprocessing import Pipe, Process
def worker(pipe, function, args, kwargs):
try:
results = function(*args, **kwargs)
except Exception as error:
results = error
pipe.send(results)
pipe = Pipe(duplex=False)
process = Process(target=worker, args=(pipe, function, args, kwargs))
if pipe.poll(timeout=5):
process.terminate()
process.join()
results = 'timeout'
else:
results = pipe.recv()
Pebble provides a neat API, takes care of corner cases and uses more robust mechanisms. Yet this is more or less what it does under the hood.

The problem seems to have been that the function solveProblem was defined inside my outer function. Python doesn't seem to like that. Once I moved it outside it worked fine.
I'll mark noxdafox answer as an answer as I implementing the pebble solution led me to this answer.
Thanks all!

Related

When running two functions simultaneously how to return the first result and use it for further processes

So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.

How to call method from different class using multiprocess pool python

How do I call a method from a different class (different module) with the use of Multiprocess pool in python?
My aim is to start a process which keep running until some task is provide, and once task is completed it will again go back to waiting mode.
Below is code, which has three module, Reader class is my run time task, I will provide execution of reader method to ProcessExecutor.
Process executor is process pool, it will continue while loop until some task is provided to it.
Main module which initiates everything.
Module 1
class Reader(object):
def __init__(self, message):
self.message = message
def reader(self):
print self.message
Module 2
class ProcessExecutor():
def run(self, queue):
print 'Before while loop'
while True:
print 'Reached Run'
try:
pair = queue.get()
print 'Running process'
print pair
func = pair.get('target')
arguments = pair.get('args', None)
if arguments is None:
func()
else:
func(arguments)
queue.task_done()
except Exception:
print Exception.message
main Module
from process_helper import ProcessExecutor
from reader import Reader
import multiprocessing
import Queue
if __name__=='__main__':
queue = Queue.Queue()
myReader = Reader('Hi')
ps = ProcessExecutor()
pool = multiprocessing.Pool(2)
pool.apply_async(ps.run, args=(queue, ))
param = {'target': myReader.reader}
queue.put(param)
Code executed without any error: C:\Python27\python.exe
C:/Users/PycharmProjects/untitled1/main/main.py
Process finished with exit code 0
Code gets executed but it never reached to run method. I am not sure is it possible to call a method of the different class using multi-processes or not
I tried apply_async, map, apply but none of them are working.
All example searched online are calling target method from the script where the main method is implemented.
I am using python 2.7
Please help.
Your first problem is that you just exit without waiting on anything. You have a Pool, a Queue, and an AsyncResult, but you just ignore all of them and exit as soon as you've created them. You should be able to get away with only waiting on the AsyncResult (after that, there's no more work to do, so who cares what you abandon), except for the fact that you're trying to use Queue.task_done, which doesn't make any sense without a Queue.join on the other side, so you need to wait on that as well.
Your second problem is that you're using the Queue from the Queue module, instead of the one from the multiprocessing module. The Queue module only works across threads in the same process.
Also, you can't call task_done on a plain Queue; that's only a method for the JoinableQueue subclass.
Once you've gotten to the point where the pool tries to actually run a task, you will get the problem that bound methods can't be pickled unless you write a pickler for them. Doing that is a pain, even though it's the right way. The traditional workaround—hacky and cheesy, but everyone did it, and it works—is to wrap each method you want to call in a top-level function. The modern solution is to use the third-party dill or cloudpickle libraries, which know how to pickle bound methods, and how to hook into multiprocessing. You should definitely look into them. But, to keep things simple, I'll show you the workaround.
Notice that, because you've created an extra queue to pass methods onto, in addition to the one built into the pool, you'll need the workaround for both targets.
With these problems fixed, your code looks like this:
from process_helper import ProcessExecutor
from reader import Reader
import multiprocessing
def call_run(ps):
ps.run(queue)
def call_reader(reader):
return reader.reader()
if __name__=='__main__':
queue = multiprocessing.JoinableQueue()
myReader = Reader('Hi')
ps = ProcessExecutor()
pool = multiprocessing.Pool(2)
res = pool.apply_async(call_run, args=(ps,))
param = {'target': call_reader, 'args': myReader}
queue.put(param)
print res.get()
queue.join()
You have additional bugs beyond this in your ProcessReader, but I'm not going to debug everything for you. This gets you past the initial hurdles, and shows the answer to the specific question you were asking about. Also, I'm not sure what the point of all that code is. You seem to be trying to replace what Pool already does on top of Pool, only in a more complicated but less powerful way, but I'm not entirely sure.
Meanwhile, here's a program that does what I think you want, with no problems, by just throwing away that ProcessExecutor and everything that goes with it:
from reader import Reader
import multiprocessing
def call_reader(reader):
return reader.reader()
if __name__=='__main__':
myReader = Reader('Hi')
pool = multiprocessing.Pool(2)
res = pool.apply_async(call_reader, args=(myReader,))
print res.get()

Subprocess not thread safe, alternatives?

I'm using python 2.7 and do not have the option of upgrading or back-porting subprocess32. I am using it in a threaded environment in which usually it works fine, however sometimes the subprocess creation is not returning and so the thread hangs, even strace does not work in the instance of a hang, so I get no feedback.
E.G. this line can cause a hang (data returned is small so it is not a pipe issue):
process = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
I have subsequently read that subprocess is not thread safe in python 2.7 and that "various issues" were fixed in the newest versions. I am using multiple threads calling subprocess.
I have demonstrated this problem with the following code (as a demonstrable example - not my actual code) which starts numerous threads with a subprocess each:
import os, time, threading, sys
from subprocess import Popen
i=0
class Process:
def __init__(self, args):
self.args = args
def run(self):
global i
retcode = -1
try:
self.process = Popen(self.args)
i+=1
if i == 10:
sys.stdout.write("Complete\n")
while self.process.poll() is None:
time.sleep(1.0)
retcode = self.process.returncode
except:
sys.stdout.write("ERROR\n")
return retcode
def main():
processes = [Process(["/bin/cat"]) for _ in range(10)]
# start all processes
for p in processes:
t = threading.Thread(target=Process.run, args=(p,))
t.daemon = True
t.start()
sys.stdout.write("all threads started\n")
# wait for Ctrl+C
while True:
time.sleep(1.0)
main()
This will often result in 1 or more subprocess calls never returning. Does anybody have more information on this or a solution/alternative
I am thinking of using the deprecated commands.getoutput instead but do not know if that is thread safe? It certainly seems to work correctly for the code above.
If the bulk of what your threads is doing is just waiting on subprocesses you can acomplish this much more effectively with coroutines. With python2 you would inplement this with generators so the necessary changes to the run function are:
replace time.sleep(1.0) with yield to pass control to another routine
replace return retcode with self.retcode = retcode or similar since generators can't return a value before python3.3
Then the main function could be something like this:
def main():
processes = [Process(["/bin/cat"]) for _ in range(10)]
#since p.run() is a generator this doesn't run any of the code yet
routines = [p.run() for p in processes]
while routines:
#iterate in reverse so we can remove routines while iterating without skipping any
for routine in reversed(routines):
try:
next(routine) #continue the routine to next yield
except StopIteration:
#this routine has finished, we no longer need to check it
routines.remove(routine)
This is intended to give you a place to start from, I'd recommend adding print statements around the yields or use pythontutor to better understand the order of execution.
This has the benefit of never having any threads waiting for anything, just one thread doing a section of processing at a time which can be much more efficient then many idling threads.

How to Clip Rasters in Parallel?

I'm working through a multiprocessing example (An introduction to parallel programming). I modified the Pool Class example to meet my specific needs--to clip a bunch of rasters with a study area polygon in parallel. On the plus side, the script finishes and prints "Processing Complete.". On the negative side, there is no output generated. I suspect that I have some procedural error in the `pool.apply_async' function. Why is this script producing no results?
import arcpy, os
import multiprocessing as mp
arcpy.env.workspace = r'F:\temp\inws'
outws_utm11 = r'F:\temp\outws'
clipper_utm11 = r'F:\temp\some_polygon.shp'
rasters = arcpy.ListRasters()
pool = mp.Pool(processes=4)
def clip_raster(clipper, outws, raster):
arcpy.Clip_management(raster, "#", os.path.join(outws, raster), clipper, nodata_value = 0, clipping_geometry = "ClippingGeometry")
[pool.apply_async(clip_raster, args=(clipper_utm11,outws_utm11, ras)) for ras in rasters]
print "Processing complete."
The apply_async function kicks off your function in a worker process, but does not block until the function completes. You're letting the main process complete and exit instead of waiting for the workers to finish. This causes them to be killed, which is likely hapenning before they can create your output.
Since you're just applying the same function to all of your items in the rasters list, you should consider using pool.map instead. It will accept both a function name and an iterable object as its arguments, and call the function on each of the items in the list. All of these function calls will occur in a worker process in the pool. One caveat of the pool.map function though, is that the function object you pass it must only accept one argument: the item from the list. I see your clip_rasters function uses a couple of other arguments, so in my example below, I'm using functools.partial to create a new version of clip_raster that always includes the first two arguments. This new function that has clipper_utm11 and outws_utm11 bound to it can now be used with pool.map.
import arcpy, os
import functools
import multiprocessing as mp
arcpy.env.workspace = r'F:\temp\inws'
outws_utm11 = r'F:\temp\outws'
clipper_utm11 = r'F:\temp\some_polygon.shp'
rasters = arcpy.ListRasters()
pool = mp.Pool(processes=4)
def clip_raster(clipper, outws, raster):
arcpy.Clip_management(raster, "#", os.path.join(outws, raster), clipper, nodata_value = 0, clipping_geometry = "ClippingGeometry")
bound_clip_raster = functools.partial(clip_raster, clipper_utm11, outws_utm11)
results = pool.map(bound_clip_raster, rasters)
print "Processing complete."
This code will call the bound_clip_raster function once for each of the items in your rasters list, including clipper_utm11 and outws_utm11. All of the results will be available in a list called results, and the call to pool.map is blocking, so the main process will wait until all the workers are done before it exits.
If, for some strange reason, you're intent on using apply_async, then you'll need to add some code to the end of your script to use the AsyncResult object's associated methods to block the main process until they can complete, such as wait(), or poll for completion in a loop by calling ready(). But you should really use pool.map for this use case. This is what it's made for.
I have answered a question that may be useful to you too. Take a look here: question
There are a few good practices, like to put everything inside a function like you did, but also a must have is to put a main() function and block it by the
if __name__ == '__main__':
main()
Another thing is this one function to call the pool.apply_assync that I've inserted in your code.
Also have made a few modifications so you can try, I've tested and it works for me:
import arcpy, os
from multiprocessing import Pool
arcpy.env.workspace = r'C:\Gis\lab_geo\2236622'
outws_utm11 = r'C:\Gis\lab_geo\2236622\outs'
clipper_utm11 = r'C:\Gis\arcpy_teste\teste.shp'
rasters = arcpy.ListRasters()
def clipRaster(clipper, outws, raster):
arcpy.Clip_management(raster, "#", os.path.join(outws, raster), clipper, 0, "ClippingGeometry")
def clipRasterMulti(processList):
pool = Pool(processes=4, maxtasksperchild=10)
jobs= {}
for item in processList:
jobs[item[0]] = pool.apply_async(clipRaster, [x for x in item])
for item,result in jobs.items():
try:
result = result.get()
except Exception as e:
print(e)
pool.close()
pool.join()
def main():
processList = [(clipper_utm11,outws_utm11, ras) for ras in rasters]
clipRasterMulti(processList)
print "Processing complete."
if __name__ == '__main__':
main()

multiprocessing.Pool hangs if child causes a segmentation fault

I want to apply a function in parallel using multiprocessing.Pool.
The problem is that if one function call triggers a segmentation fault the Pool hangs forever.
Has anybody an idea how I can make a Pool that detects when something like this happens and raises an error?
The following example shows how to reproduce it (requires scikit-learn > 0.14)
import numpy as np
from sklearn.ensemble import gradient_boosting
import time
from multiprocessing import Pool
class Bad(object):
tree_ = None
def fit_one(i):
if i == 3:
# this will segfault
bad = np.array([[Bad()] * 2], dtype=np.object)
gradient_boosting.predict_stages(bad,
np.random.rand(20, 2).astype(np.float32),
1.0, np.random.rand(20, 2))
else:
time.sleep(1)
return i
pool = Pool(2)
out = pool.imap_unordered(fit_one, range(10))
# we will never see 3
for o in out:
print o
As described in the comments, this just works in Python 3 if you use concurrent.Futures.ProcessPoolExecutor instead of multiprocessing.Pool.
If you're stuck on Python 2, the best option I've found is to use the timeout argument on the result objects returned by Pool.apply_async and Pool.map_async. For example:
pool = Pool(2)
out = pool.map_async(fit_one, range(10))
for o in out:
print o.get(timeout=1000) # allow 1000 seconds max
This works as long as you have an upper bound for how long a child process should take to complete a task.
This is a known bug, issue #22393, in Python. There is no meaningful workaround as long as you're using multiprocessing.pool until it's fixed. A patch is available at that link, but it has not been integrated into the main release as yet, so no stable release of Python fixes the problem.
Instead of using Pool().imap() maybe you would rather manually create child processes yourself with Process(). I bet the object returned would allow you to get liveness status of any child. You will know if they hang up.
I haven't run your example to see if it can handle the error, but try concurrent futures. Simply replace my_function(i) with your fit_one(i). Keep the __name__=='__main__': structure. concurrent futures seems to need this. The code below is tested on my machine so will hopefully work straight up on yours.
import concurrent.futures
def my_function(i):
print('function running')
return i
def run():
number_processes=4
executor = concurrent.futures.ProcessPoolExecutor(number_processes)
futures = [executor.submit(my_function,i) for i in range(10)]
concurrent.futures.wait(futures)
for f in futures:
print(f.result())
if __name__ == '__main__':
run()

Categories