Best way for single worker implementation in Flask

Best way for single worker implementation in Flask - python

I have some spider that download pages and store data in database. I have created flask application with admin panel (by Flask-Admin extension) that show database.
Now I want append function to my flask app for control spider state: switch on/off.
I thing it posible by threads or multiprocessing. Celery is not good decision because total program must use minimum memory.
Which method to choose for implementation this function?

Discounting Celery based on memory usage would probably be a mistake, as Celery has low overhead in both time and space. In fact, using Celery+Flask does not use much more memory than using Flask alone.
In addition Celery comes with several choices you can make that can have an impact
on the amount of memory used. For example, there are 5 different pool implementations that all have different strengths and trade-offs, the pool choices are:
multiprocessing
By default Celery uses multiprocessing, which means that it will spawn child processes
to offload work to. This is the most memory expensive option - simply because
every child process will duplicate the amount of base memory needed.
But Celery also comes with an autoscale feature that will kill off worker
processes when there's little work to do, and spawn new processes when there's more work:
$ celeryd --autoscale=0,10
where 0 is the mininum number of processes, and 10 is the maximum. Here celeryd will
start off with no child processes, and grow based on load up to a maximum of 10 processes. When load decreases, so will the number of worker processes.
eventlet/gevent
When using the eventlet/gevent pools only a single process will be used, and thus it will
use a lot less memory, but with the downside that tasks calling blocking code will
block other tasks from executing. If your tasks are mostly I/O bound you should be ok,
and you can also combine different pools and send problem tasks to a multiprocessing pool instead.
threads
Celery also comes with a pool using threads.
The development version that will become version 2.6 includes a lot of optimizations,
and there is no longer any need for the Flask-Celery extension module. If you are not going
into production in the next days then I would encourage you to try the development version
which must be installed like this:
$ pip install https://github.com/ask/kombu/zipball/master
$ pip install https://github.com/ask/celery/zipball/master
The new API is now also Flask inspired, so you should read the new getting started guide:
http://ask.github.com/celery/getting-started/first-steps-with-celery.html
With all this said, most optimization work has been focused on execution speed so far,
and there is probably many more memory optimizations that can be made. It has not been a request so far, but in the unlikely event that Celery does not match your memory constraints, you can open up an issue at our bug tracker and I'm sure it will get focus, or you can even help us to do so.

You could hypervize the process using multiprocess or subprocess, then just hand the handle round the session.

Related

Django - background processing without using any external message-broker and related packages

I have to solve this issue: i have a 4G/4Core Ubuntu server 18.04LTS based machine with my django application, postgres, redis, and some other services that runs off my web application.
The nature of the application is quite simple: every 5 minutes i have to collect data and save it to my db from remote devices via SNMP protocol, the total number of devices is around 300.
I want to start this task without slowing down my application, so i've offloaded this task, i've tried both Threads, Multiprocessing and Django-RQ/python-rq.
Now, here are the results:
Threads: 1800 seconds
Multiprocessing: 180 seconds (using fork method as default context, but i'm having random deadlocks, so is totally unusable after all)
Python-RQ and Django-RQ with 12 workers: 180/200 seconds
Now, it seems that my operations are all I/O bound, so, excluding the multiprocessing approach, the background worker with Python-rq seems effective, but i'm experiencing that it eats up all my memory.
You guys, have a solution? i mean: the multiprocessing way using fork as default context is totally unusable due to the random deadlock..but the python-rq/django-rq solution is functional and heavy!
How i can use the multiprocessing module with spawn as default context, load Django and then feed the created process with the tasks to execute?
You have some alternatives?

ThreadPoolExecutor on long running process

I want to use ThreadPoolExecutor on a webapp (django),
All examples that I saw are using the thread pool like that:
with ThreadPoolExecutor(max_workers=1) as executor:
code
I tried to store the thread pool as a class member of a class and to use map fucntion
but I got memory leak, the only way I could use it is by the with notation
so I have 2 questions:
Each time I run with ThreadPoolExecutor does it creates threads again and then release them, in other word is this operation is expensive?
If I avoid using with how can I release the memory of the threads
thanks

Normally, web applications are stateless. That means every object you create should live in a request and die at the end of the request. That includes your ThreadPoolExecutor. Having an executor at the application level may work, but it will be embedded into your web application instead of running as a separate group of processes.
So if you want to take the workers down or restart them, your web app will have to restart as well.
And there will be stability concerns, since there is no main process watching over child processes detecting which one has gotten stale, so requires a lot of code to get multiprocessing right.
Alternatively, If you want a persistent group of processes to listen to a job queue and run your tasks, there are several projects that do that for you. All you need to do is to set up a server that takes care of queueing and locking such as redis or rabbitmq, then point your project at that server and start the workers. Some projects even let you use the database as a job queue backend.

Celery parallel distributed task with multiprocessing

I have a CPU intensive Celery task. I would like to use all the processing power (cores) across lots of EC2 instances to get this job done faster (a celery parallel distributed task with multiprocessing - I think).
The terms, threading, multiprocessing, distributed computing, distributed parallel processing are all terms I'm trying to understand better.
Example task:
#app.task
for item in list_of_millions_of_ids:
id = item # do some long complicated equation here very CPU heavy!!!!!!!
database.objects(newid=id).save()
Using the code above (with an example if possible) how one would ago about distributed this task using Celery by allowing this one task to be split up utilising all the computing CPU power across all available machine in the cloud?

Your goals are:
Distribute your work to many machines (distributed
computing/distributed parallel processing)
Distribute the work on a given machine across all CPUs
(multiprocessing/threading)
Celery can do both of these for you fairly easily. The first thing to understand is that each celery worker is configured by default to run as many tasks as there are CPU cores available on a system:
Concurrency is the number of prefork worker process used to process
your tasks concurrently, when all of these are busy doing work new
tasks will have to wait for one of the tasks to finish before it can
be processed.
The default concurrency number is the number of CPU’s on that machine
(including cores), you can specify a custom number using -c option.
There is no recommended value, as the optimal number depends on a
number of factors, but if your tasks are mostly I/O-bound then you can
try to increase it, experimentation has shown that adding more than
twice the number of CPU’s is rarely effective, and likely to degrade
performance instead.
This means each individual task doesn't need to worry about using multiprocessing/threading to make use of multiple CPUs/cores. Instead, celery will run enough tasks concurrently to use each available CPU.
With that out of the way, the next step is to create a task that handles processing some subset of your list_of_millions_of_ids. You have a couple of options here - one is to have each task handle a single ID, so you run N tasks, where N == len(list_of_millions_of_ids). This will guarantee that work is evenly distributed amongst all your tasks since there will never be a case where one worker finishes early and is just waiting around; if it needs work, it can pull an id off the queue. You can do this (as mentioned by John Doe) using the celery group.
tasks.py:
#app.task
def process_ids(item):
id = item #long complicated equation here
database.objects(newid=id).save()
And to execute the tasks:
from celery import group
from tasks import process_id
jobs = group(process_ids(item) for item in list_of_millions_of_ids)
result = jobs.apply_async()
Another option is to break the list into smaller pieces and distribute the pieces to your workers. This approach runs the risk of wasting some cycles, because you may end up with some workers waiting around while others are still doing work. However, the celery documentation notes that this concern is often unfounded:
Some may worry that chunking your tasks results in a degradation of
parallelism, but this is rarely true for a busy cluster and in
practice since you are avoiding the overhead of messaging it may
considerably increase performance.
So, you may find that chunking the list and distributing the chunks to each task performs better, because of the reduced messaging overhead. You can probably also lighten the load on the database a bit this way, by calculating each id, storing it in a list, and then adding the whole list into the DB once you're done, rather than doing it one id at a time. The chunking approach would look something like this
tasks.py:
#app.task
def process_ids(items):
for item in items:
id = item #long complicated equation here
database.objects(newid=id).save() # Still adding one id at a time, but you don't have to.
And to start the tasks:
from tasks import process_ids
jobs = process_ids.chunks(list_of_millions_of_ids, 30) # break the list into 30 chunks. Experiment with what number works best here.
jobs.apply_async()
You can experiment a bit with what chunking size gives you the best result. You want to find a sweet spot where you're cutting down messaging overhead while also keeping the size small enough that you don't end up with workers finishing their chunk much faster than another worker, and then just waiting around with nothing to do.

In the world of distribution there is only one thing you should remember above all :
Premature optimization is the root of all evil. By D. Knuth
I know it sounds evident but before distributing double check you are using the best algorithm (if it exists...).
Having said that, optimizing distribution is a balancing act between 3 things:
Writing/Reading data from a persistent medium,
Moving data from medium A to medium B,
Processing data,
Computers are made so the closer you get to your processing unit (3) the faster and more efficient (1) and (2) will be. The order in a classic cluster will be : network hard drive, local hard drive, RAM, inside processing unit territory...
Nowadays processors are becoming sophisticated enough to be considered as an ensemble of independent hardware processing units commonly called cores, these cores process data (3) through threads (2).
Imagine your core is so fast that when you send data with one thread you are using 50% of the computer power, if the core has 2 threads you will then use 100%. Two threads per core is called hyper threading, and your OS will see 2 CPUs per hyper threaded core.
Managing threads in a processor is commonly called multi-threading.
Managing CPUs from the OS is commonly called multi-processing.
Managing concurrent tasks in a cluster is commonly called parallel programming.
Managing dependent tasks in a cluster is commonly called distributed programming.
So where is your bottleneck ?
In (1): Try to persist and stream from the upper level (the one closer to your processing unit, for example if network hard drive is slow first save in local hard drive)
In (2): This is the most common one, try to avoid communication packets not needed for the distribution or compress "on the fly" packets (for example if the HD is slow, save only a "batch computed" message and keep the intermediary results in RAM).
In (3): You are done! You are using all the processing power at your disposal.
What about Celery ?
Celery is a messaging framework for distributed programming, that will use a broker module for communication (2) and a backend module for persistence (1), this means that you will be able by changing the configuration to avoid most bottlenecks (if possible) on your network and only on your network.
First profile your code to achieve the best performance in a single computer.
Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery
app = Celery('tasks',
broker='amqp://guest#localhost//',
backend='redis://localhost')
#app.task
def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
#code that does stuff
return result
During execution open your favorite monitoring tools, I use the default for rabbitMQ and flower for celery and top for cpus, your results will be saved in your backend. An example of network bottleneck is tasks queue growing so much that they delay execution, you can proceed to change modules or celery configuration, if not your bottleneck is somewhere else.

Why not use group celery task for this?
http://celery.readthedocs.org/en/latest/userguide/canvas.html#groups
Basically, you should divide ids into chunks (or ranges) and give them to a bunch of tasks in group.
For smth more sophisticated, like aggregating results of particular celery tasks, I have successfully used chord task for similar purpose:
http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
Increase settings.CELERYD_CONCURRENCY to a number that is reasonable and you can afford, then those celery workers will keep executing your tasks in a group or a chord until done.
Note: due to a bug in kombu there were trouble with reusing workers for high number of tasks in the past, I don't know if it's fixed now. Maybe it is, but if not, reduce CELERYD_MAX_TASKS_PER_CHILD.
Example based on simplified and modified code I run:
#app.task
def do_matches():
match_data = ...
result = chord(single_batch_processor.s(m) for m in match_data)(summarize.s())
summarize gets results of all single_batch_processor tasks. Every task runs on any Celery worker, kombu coordinates that.
Now I get it: single_batch_processor and summarize ALSO have to be celery tasks, not regular functions - otherwise of course it will not be parallelized (I'm not even sure chord constructor will accept it if it's not a celery task).

Adding more celery workers will certainly speed up executing the task. You might have another bottleneck though: the database. Make sure it can handle the simultaneous inserts/updates.
Regarding your question: You are adding celery workers by assigning another process on your EC2 instances as celeryd. Depending on how many workers you need you might want to add even more instances.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

how to process long-running requests in python workers?

I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them

The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example

Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.

There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).

The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.

I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.

You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.

Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.