Parallel processing within a queue (using Pool within Celery)

Parallel processing within a queue (using Pool within Celery) - python

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

Related

Will I run into trouble with python's Global Interpreter Lock?

I am aware that this question is rather high-level and may be vague. Please ask if you need any more details and I will try to edit.
I am using QuickFix with Python bindings to consume high-throughput market data from circa 30 markets simultaneously. Most of computing the work is done in separate CPUs via the multiprocessing module. These parallel processes are spawned by the main process on startup. If I wish to interact with the market in any way via QuickFix, I have to do this within the main process, thus any commands (to enter orders, for example) which come from the child processes must be piped (via an mp.Queue object we will call Q) to the main process before execution.
This raises the problem of monitoring Q, which must be done within the main process. I cannot use Q.get(), since this method blocks and my entire main process will hang until something shows up in Q. In order to decrease latency, I must check Q frequently, on the order of 50 times per second. I have been using the apscheduler to do this, but I keep getting Warning errors stating that the runtime was missed. These errors are a serious issue because they prevent me from easily viewing important information.
I have therefore refactored my application to use the code posted by MestreLion as an answer to this question. This is working for me because it starts a new thread from the main process, and it does not print error messages. However, I am worried that this will cause nasty problems down the road.
I am aware of the Global Interpreter Lock in python (this is why I used the multiprocessing module to begin with), but I don't really understand it. Owing to the high-frequency nature of my application, I do not know if the Q monitoring thread and the main process consuming lots of incoming messages will compete for resources and slow each other down.
My questions:
Am I likely to run into trouble in this scenario?
If not, can I add more monitoring threads using the present approach and still be okay? There are at least two other things I would like to monitor at high frequency.
Thanks.

#MestreLion's solution that you've linked creates 50 threads per second in your case.
All you need is a single thread to consume the queue without blocking the rest of the main process:
import threading
def consume(queue, sentinel=None):
for item in iter(queue.get, sentinel):
pass_to_quickfix(item)
threading.Thread(target=consume, args=[queue], daemon=True).start()
GIL may or may not matter for performance in this case. Measure it.

Without knowing your scenario, it's difficult to say anything specific. Your question suggests, that the threads are waiting most of the time via get, so GIL is not a problem. Interprocess communication may result in problems much earlier. There you can think of switching to another protocol, using some kind of TCP-sockets. Then you can write the scheduler more efficient with select instead of threads, as threads are also slow and resource consuming. select is a system function, that allows to monitor many socket-connection at once, therefore it scales incredibly efficient with the amount of connections and needs nearly no CPU-power for monitoring.

Best way for single worker implementation in Flask

I have some spider that download pages and store data in database. I have created flask application with admin panel (by Flask-Admin extension) that show database.
Now I want append function to my flask app for control spider state: switch on/off.
I thing it posible by threads or multiprocessing. Celery is not good decision because total program must use minimum memory.
Which method to choose for implementation this function?

Discounting Celery based on memory usage would probably be a mistake, as Celery has low overhead in both time and space. In fact, using Celery+Flask does not use much more memory than using Flask alone.
In addition Celery comes with several choices you can make that can have an impact
on the amount of memory used. For example, there are 5 different pool implementations that all have different strengths and trade-offs, the pool choices are:
multiprocessing
By default Celery uses multiprocessing, which means that it will spawn child processes
to offload work to. This is the most memory expensive option - simply because
every child process will duplicate the amount of base memory needed.
But Celery also comes with an autoscale feature that will kill off worker
processes when there's little work to do, and spawn new processes when there's more work:
$ celeryd --autoscale=0,10
where 0 is the mininum number of processes, and 10 is the maximum. Here celeryd will
start off with no child processes, and grow based on load up to a maximum of 10 processes. When load decreases, so will the number of worker processes.
eventlet/gevent
When using the eventlet/gevent pools only a single process will be used, and thus it will
use a lot less memory, but with the downside that tasks calling blocking code will
block other tasks from executing. If your tasks are mostly I/O bound you should be ok,
and you can also combine different pools and send problem tasks to a multiprocessing pool instead.
threads
Celery also comes with a pool using threads.
The development version that will become version 2.6 includes a lot of optimizations,
and there is no longer any need for the Flask-Celery extension module. If you are not going
into production in the next days then I would encourage you to try the development version
which must be installed like this:
$ pip install https://github.com/ask/kombu/zipball/master
$ pip install https://github.com/ask/celery/zipball/master
The new API is now also Flask inspired, so you should read the new getting started guide:
http://ask.github.com/celery/getting-started/first-steps-with-celery.html
With all this said, most optimization work has been focused on execution speed so far,
and there is probably many more memory optimizations that can be made. It has not been a request so far, but in the unlikely event that Celery does not match your memory constraints, you can open up an issue at our bug tracker and I'm sure it will get focus, or you can even help us to do so.

You could hypervize the process using multiprocess or subprocess, then just hand the handle round the session.

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.

You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.

You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

What's the pythonic way to deal with worker processes that must coordinate their tasks?

I'm currently learning Python (from a Java background), and I have a question about something I would have used threads for in Java.
My program will use workers to read from some web-service some data periodically. Each worker will call on the web-service at various times periodically.
From what I have read, it's preferable to use the multiprocessing module and set up the workers as independent processes that get on with their data-gathering tasks. On Java I would have done something conceptually similar, but using threads. While it appears I can use threads in Python, I'll lose out on multi-cpu utilisation.
Here's the guts of my question: The web-service is throttled, viz., the workers must not call on it more than x times per second. What is the best way for the workers to check on whether they may request data?
I'm confused as to whether this should be achieved using:
Pipes as a way to communicate to some other 'managing object', which monitors the total calls per second.
Something along the lines of nmap, to share some data/value between the processes that describes if they may call the web-service.
A Manager() object that monitors the calls per seconds and informs workers if they have permission to make their calls.
Of course, I guess this may come down to how I keep track of the calls per second. I suppose one option would be for the workers to call a function on some other object, which makes the call to the web-service and records the current number of calls/sec. Another option would be for the function that calls the web-service to live within each worker, and for them to message a managing object every time they make a call to the web-service.
Thoughts welcome!

Delegate the retrieval to a separate process which queues the requests until it is their turn.

I think that you'll find that the multiprocessing module will provide you with some fairly familiar constructs.
You might find that multiprocessing.Queue is useful for connecting your worker threads back to a managing thread that could provide monitoring or throttling.

Not really an answer to your question, but an alternative approach to your problem: You could get rid of synchronization issues when doing requests event driven, e.g. by using the Python async module or Twisted. You wouldn't benefit from multiple CPUs/cores, but in context of network communication that's usually negligible.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx

#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2

You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.

Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313

Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.