I have tasks in a chain with three tasks fetch_page, check_source and store page
def update_page_info(**headers):
chain=fetch_page.s(headers['key']) | check_source.s(headers['key_1']) | store_info.s()
chain().apply_async()
fetch_page fetches the page and gathers what it needs to gather:
#app.task(bind=True)
def fetch_page(self,url):
#fetch_page here and return a tuple so that it can be unpacked
# dosomething
After fetching the page now it checks for source in the next tasks check_source.
#app.task(bind=True)
def check_source(self,page_and_url,handle):
try:
#unpack your stuffs here
page,url=page_and_url
get_result={}
if handle=='first_option':
get_result=select_first_option(one,two)
return get_result
elif handle=='second_option':
get_result=select_second_option(one,two)
return (get_result)
elif handle=='third_option':
get_result=select_third_option(one,two)
return (get_result)
else:
return "IGNORE FOR NOW"
except Exception as exc:
pass
so the confusion is can i call some other tasks from here?? will there be any inconsistency or will the worker will ever get in deadlock doing this?
And finally it should execute store_info() which would just store things returned from check_source()
#app.task(bind=True)
def store_info(self,result):
print ("store_info ")
try:
#store the fetched pages
except Exception as exc:
#dosomething
finally:
pass
I was following this approach which just needed little modification http://docs.celeryproject.org/en/latest/userguide/tasks.html#avoid-launching-synchronous-subtasks.
Can anybody suggest me how it should be done and things i need to be more careful about?
This all should work like you are reading (and communicating) it should. The three tasks will execute in order without any "inconsistency."
If you call update_page_info one time the three chained sub-tasks will be run exclusively of each other. That said, there is still the potential for deadlocks with this setup. If you called update_page_info while previous tasks from the last time you called it were running then you could get more than one task running at once. This would introduce the potential for deadlocks depending how your tasks share resources.
If your tasks share resources I would suggest using something like redis or memcached as a locking system across workers.
Edit: the code I see now is totally fine as the results are passed along as parameters to the next task.
Related
Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!
I know there are a few questions and answers related to hanging threads in Python, but my situation is slightly different as the script is hanging AFTER all the threads have been completed. The threading script is below, but obviously the first 2 functions are simplified massively.
When I run the script shown, it works. When I use my real functions, the script hangs AFTER THE LAST LINE. So, all the scenarios are processed (and a message printed to confirm), logStudyData() then collates all the results and writes to a csv. "Script Complete" is printed. And THEN it hangs.
The script with threading functionality removed runs fine.
I have tried enclosing the main script in try...except but no exception gets logged. If I use a debugger with a breakpoint on the final print and then step it forward, it hangs.
I know there is not much to go on here, but short of including the whole 1500-line script, I don't know hat else to do. Any suggestions welcome!
def runScenario(scenario):
# Do a bunch of stuff
with lock:
# access global variables
pass
pass
def logStudyData():
# Combine results from all scenarios into a df and write to csv
pass
def worker():
global q
while True:
next_scenario = q.get()
if next_scenario is None:
break
runScenario(next_scenario)
print(next_scenario , " is complete")
q.task_done()
import threading
from queue import Queue
global q, lock
q = Queue()
threads = []
scenario_list = ['s1','s2','s3','s4','s5','s6','s7','s8','s9','s10','s11','s12']
num_worker_threads = 6
lock = threading.Lock()
for i in range(num_worker_threads):
print("Thread number ",i)
this_thread = threading.Thread(target=worker)
this_thread.start()
threads.append(this_thread)
for scenario_name in scenario_list:
q.put(scenario_name)
q.join()
print("q.join completed")
logStudyData()
print("script complete")
As the docs for Queue.get say:
Remove and return an item from the queue. If optional args block is true and timeout is None (the default), block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Empty exception if no item was available within that time. Otherwise (block is false), return an item if one is immediately available, else raise the Empty exception (timeout is ignored in that case).
In other words, there is no way get can ever return None, except by you calling q.put(None) on the main thread, which you don't do.
Notice that the example directly below those docs does this:
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
The second one is technically necessary, but you usually get away with not doing it.
But the first one is absolutely necessary. You need to either do this, or come up with some other mechanism to tell your workers to quit. Without that, your main thread just tries to exit, which means it tries to join every worker, but those workers are all blocked forever on a get that will never happen, so your program hangs forever.
Building a thread pool may not be rocket science (if only because rocket scientists tend to need their calculations to be deterministic and hard real-time…), but it's not trivial, either, and there are plenty of things you can get wrong. You may want to consider using one of the two already-built threadpools in the Python standard library, concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.Pool. This would reduce your entire program to:
import concurrent.futures
def work(scenario):
runScenario(scenario)
print(scenario , " is complete")
scenario_list = ['s1','s2','s3','s4','s5','s6','s7','s8','s9','s10','s11','s12']
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as x:
results = list(x.map(work, scenario_list))
print("q.join completed")
logStudyData()
print("script complete")
Obviously you'll still need a lock around any mutable variables you change inside runScenario—although if you're only using a mutable variable there because you couldn't figure out how to return values to the main thread, that's trivial with an Executor: just return the values from work, and then you can use them like this:
for result in x.map(work, scenario_list):
do_something(result)
I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards
I am trying to create a class than can run a separate process to go do some work that takes a long time, launch a bunch of these from a main module and then wait for them all to finish. I want to launch the processes once and then keep feeding them things to do rather than creating and destroying processes. For example, maybe I have 10 servers running the dd command, then I want them all to scp a file, etc.
My ultimate goal is to create a class for each system that keeps track of the information for the system in which it is tied to like IP address, logs, runtime, etc. But that class must be able to launch a system command and then return execution back to the caller while that system command runs, to followup with the result of the system command later.
My attempt is failing because I cannot send an instance method of a class over the pipe to the subprocess via pickle. Those are not pickleable. I therefore tried to fix it various ways but I can't figure it out. How can my code be patched to do this? What good is multiprocessing if you can't send over anything useful?
Is there any good documentation of multiprocessing being used with class instances? The only way I can get the multiprocessing module to work is on simple functions. Every attempt to use it within a class instance has failed. Maybe I should pass events instead? I don't understand how to do that yet.
import multiprocessing
import sys
import re
class ProcessWorker(multiprocessing.Process):
"""
This class runs as a separate process to execute worker's commands in parallel
Once launched, it remains running, monitoring the task queue, until "None" is sent
"""
def __init__(self, task_q, result_q):
multiprocessing.Process.__init__(self)
self.task_q = task_q
self.result_q = result_q
return
def run(self):
"""
Overloaded function provided by multiprocessing.Process. Called upon start() signal
"""
proc_name = self.name
print '%s: Launched' % (proc_name)
while True:
next_task_list = self.task_q.get()
if next_task is None:
# Poison pill means shutdown
print '%s: Exiting' % (proc_name)
self.task_q.task_done()
break
next_task = next_task_list[0]
print '%s: %s' % (proc_name, next_task)
args = next_task_list[1]
kwargs = next_task_list[2]
answer = next_task(*args, **kwargs)
self.task_q.task_done()
self.result_q.put(answer)
return
# End of ProcessWorker class
class Worker(object):
"""
Launches a child process to run commands from derived classes in separate processes,
which sit and listen for something to do
This base class is called by each derived worker
"""
def __init__(self, config, index=None):
self.config = config
self.index = index
# Launce the ProcessWorker for anything that has an index value
if self.index is not None:
self.task_q = multiprocessing.JoinableQueue()
self.result_q = multiprocessing.Queue()
self.process_worker = ProcessWorker(self.task_q, self.result_q)
self.process_worker.start()
print "Got here"
# Process should be running and listening for functions to execute
return
def enqueue_process(target): # No self, since it is a decorator
"""
Used to place an command target from this class object into the task_q
NOTE: Any function decorated with this must use fetch_results() to get the
target task's result value
"""
def wrapper(self, *args, **kwargs):
self.task_q.put([target, args, kwargs]) # FAIL: target is a class instance method and can't be pickled!
return wrapper
def fetch_results(self):
"""
After all processes have been spawned by multiple modules, this command
is called on each one to retreive the results of the call.
This blocks until the execution of the item in the queue is complete
"""
self.task_q.join() # Wait for it to to finish
return self.result_q.get() # Return the result
#enqueue_process
def run_long_command(self, command):
print "I am running number % as process "%number, self.name
# In here, I will launch a subprocess to run a long-running system command
# p = Popen(command), etc
# p.wait(), etc
return
def close(self):
self.task_q.put(None)
self.task_q.join()
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(5):
worker = Worker(config, index)
worker.run_long_command("ls /")
workers.append(worker)
for worker in workers:
worker.fetch_results()
# Do more work... (this would actually be done in a distributor in another class)
for worker in workers:
worker.close()
Edit: I tried to move the ProcessWorker class and the creation of the multiprocessing queues outside of the Worker class and then tried to manually pickle the worker instance. Even that doesn't work and I get an error
RuntimeError: Queue objects should only be shared between processes
through inheritance
. But I am only passing references of those queues into the worker instance?? I am missing something fundamental. Here is the modified code from the main section:
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(1):
task_q = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
process_worker = ProcessWorker(task_q, result_q)
worker = Worker(config, index, process_worker, task_q, result_q)
something_to_look_at = pickle.dumps(worker) # FAIL: Doesn't like queues??
process_worker.start()
worker.run_long_command("ls /")
So, the problem was that I was assuming that Python was doing some sort of magic that is somehow different from the way that C++/fork() works. I somehow thought that Python only copied the class, not the whole program into a separate process. I seriously wasted days trying to get this to work because all of the talk about pickle serialization made me think that it actually sent everything over the pipe. I knew that certain things could not be sent over the pipe, but I thought my problem was that I was not packaging things up properly.
This all could have been avoided if the Python docs gave me a 10,000 ft view of what happens when this module is used. Sure, it tells me what the methods of multiprocess module does and gives me some basic examples, but what I want to know is what is the "Theory of Operation" behind the scenes! Here is the kind of information I could have used. Please chime in if my answer is off. It will help me learn.
When you run start a process using this module, the whole program is copied into another process. But since it is not the "__main__" process and my code was checking for that, it doesn't fire off yet another process infinitely. It just stops and sits out there waiting for something to do, like a zombie. Everything that was initialized in the parent at the time of calling multiprocess.Process() is all set up and ready to go. Once you put something in the multiprocess.Queue or shared memory, or pipe, etc. (however you are communicating), then the separate process receives it and gets to work. It can draw upon all imported modules and setup just as if it was the parent. However, once some internal state variables change in the parent or separate process, those changes are isolated. Once the process is spawned, it now becomes your job to keep them in sync if necessary, either through a queue, pipe, shared memory, etc.
I threw out the code and started over, but now I am only putting one extra function out in the ProcessWorker, an "execute" method that runs a command line. Pretty simple. I don't have to worry about launching and then closing a bunch of processes this way, which has caused me all kinds of instability and performance issues in the past in C++. When I switched to launching processes at the beginning and then passing messages to those waiting processes, my performance improved and it was very stable.
BTW, I looked at this link to get help, which threw me off because the example made me think that methods were being transported across the queues: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
The second example of the first section used "next_task()" that appeared (to me) to be executing a task received via the queue.
Instead of attempting to send a method itself (which is impractical), try sending a name of a method to execute.
Provided that each worker runs the same code, it's a matter of a simple getattr(self, task_name).
I'd pass tuples (task_name, task_args), where task_args were a dict to be directly fed to the task method:
next_task_name, next_task_args = self.task_q.get()
if next_task_name:
task = getattr(self, next_task_name)
answer = task(**next_task_args)
...
else:
# poison pill, shut down
break
REF: https://stackoverflow.com/a/14179779
Answer on Jan 6 at 6:03 by David Lynch is not factually correct when he says that he was misled by
http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html.
The code and examples provided are correct and work as advertised. next_task() is executing a task received via the queue -- try and understand what the Task.__call__() method is doing.
In my case what, tripped me up was syntax errors in my implementation of run(). It seems that the sub-process will not report this and just fails silently -- leaving things stuck in weird loops! Make sure you have some kind of syntax checker running e.g. Flymake/Pyflakes in Emacs.
Debugging via multiprocessing.log_to_stderr()F helped me narrow down the problem.