how to profile memory usage of a celery task?

how to profile memory usage of a celery task? - python

I have a django application that runs background tasks using the celery lib and I need to obtain and store the max memory usage of a task.
I've tried memory_usage from memory_profiler library, but I can not use this function inside a task because I get the error: "daemonic processes not allowed have children". I've also tried the memory_usage function outside the task, to monitor the task.async call, but for some reason the task is triggered twice.
All the other ways I found out there consist of checking the memory usage in different places of the code and then getting the maximum, but I have the feeling that it is very inaccurate and there are probably some calls that have a high memory usage that is left out because of garbage collection before I manage to check the current memory usage.
the official documentation has some useful functions but it would have to rely on the method above. https://docs.celeryproject.org/en/latest/reference/celery.utils.debug.html
Thanks in advance!

Why not a controller tasks?
Celery infrastructure let to query the current status of all workers:
from celery import Celery
app = Celery(...)
app.control.inspect().active()
This can be used inside a task to poll every # sec the cluster and understand what's happening.
I've used a similar approach to identify and send the kill() command between tasks. My tasks are killable so each of them know how to handle the soft kill.

Related

Python Django Celery AsyncResult Memory Leak

The problem is a very serious memory leak until the server crashes (or you could recover by killing the celery worker service, which releases all the RAM used)
There seems to be a bunch of reported bugs on this matter, but very little attention is paid to this warning, In the celery API docs, here
Warning:
Backends use resources to store and transmit results. To ensure that resources are released, you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.
And it is reasonable to assume that the leak is related to this warning.
But the conceptual problem is, based on my understanding of celery, that AsyncResult instances are created across multiple Django views within a user session: some are created as you initiate/spawn new tasks in one view, and some you may create later manually (using task_id saved in the user session) to check on the progress (state) of those tasks in another view.
Therefore, AsynResult objects will eventually go out of scope across multiple Views in a real world Django application, and you don't want to call get() in ANY of these views, because you don't want to slow down the Django (or the apache2) daemon process.
Is the solution to never let AsyncResult Objects go out of scope before calling their get() method?
CELERY_RESULT_BACKEND = 'django-db' #backend is a mysql DB
BROKER_URL = 'pyamqp://localhost' #rabbitMQ

We also faced multiple issues with celery in production, and also tackled a memory leak issue. I'm not sure if our problem scope is the same, but if you don't mind you could try out our solution.
You see we had multiple tasks running on a couple of workers managed by supervisor (all workers were on the same Queue). Now, what we saw that when there were a lot of tasks being queued, the broker (in our case rabbitmq) was sending the amount of tasks our celery workers could process and keeping the rest in memory. This resulted in our memory overflowing and the broker started paginating in our hard drive. We found out from reading the docs that if we allow our broker to not wait for worker results, this issue could be resolved. Thus, in our tasks we used the option,
#task(time_limit=10, ignore_result=True)
def ggwp():
# do sth
Here, the time limit would close the task after a certain amount of time, and the ignore_result option would allow the broker to just send the task in celery workers as soon as a worker is freed.

Running asynchronous python code in a Django web application

Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?

As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html

Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.

Best way for single worker implementation in Flask

I have some spider that download pages and store data in database. I have created flask application with admin panel (by Flask-Admin extension) that show database.
Now I want append function to my flask app for control spider state: switch on/off.
I thing it posible by threads or multiprocessing. Celery is not good decision because total program must use minimum memory.
Which method to choose for implementation this function?

Discounting Celery based on memory usage would probably be a mistake, as Celery has low overhead in both time and space. In fact, using Celery+Flask does not use much more memory than using Flask alone.
In addition Celery comes with several choices you can make that can have an impact
on the amount of memory used. For example, there are 5 different pool implementations that all have different strengths and trade-offs, the pool choices are:
multiprocessing
By default Celery uses multiprocessing, which means that it will spawn child processes
to offload work to. This is the most memory expensive option - simply because
every child process will duplicate the amount of base memory needed.
But Celery also comes with an autoscale feature that will kill off worker
processes when there's little work to do, and spawn new processes when there's more work:
$ celeryd --autoscale=0,10
where 0 is the mininum number of processes, and 10 is the maximum. Here celeryd will
start off with no child processes, and grow based on load up to a maximum of 10 processes. When load decreases, so will the number of worker processes.
eventlet/gevent
When using the eventlet/gevent pools only a single process will be used, and thus it will
use a lot less memory, but with the downside that tasks calling blocking code will
block other tasks from executing. If your tasks are mostly I/O bound you should be ok,
and you can also combine different pools and send problem tasks to a multiprocessing pool instead.
threads
Celery also comes with a pool using threads.
The development version that will become version 2.6 includes a lot of optimizations,
and there is no longer any need for the Flask-Celery extension module. If you are not going
into production in the next days then I would encourage you to try the development version
which must be installed like this:
$ pip install https://github.com/ask/kombu/zipball/master
$ pip install https://github.com/ask/celery/zipball/master
The new API is now also Flask inspired, so you should read the new getting started guide:
http://ask.github.com/celery/getting-started/first-steps-with-celery.html
With all this said, most optimization work has been focused on execution speed so far,
and there is probably many more memory optimizations that can be made. It has not been a request so far, but in the unlikely event that Celery does not match your memory constraints, you can open up an issue at our bug tracker and I'm sure it will get focus, or you can even help us to do so.

You could hypervize the process using multiprocess or subprocess, then just hand the handle round the session.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

How is rate_limit enforced in Celery?

I'm running a Django website where I use Celery to implement preventive caching - that is, I calculate and cache results even before they are requested by the user.
However, one of my Celery tasks could, in some situation, be called a lot (I'd say sightly quicker than it completes on average, actually). I'd like to rate_limit it so that it doesn't consume a lot of resources when it's actually not that useful.
However, I'd like first to understand how Celery's celery.task.base.Task.rate_limit attribute is enforced. Are tasks refused? Are they delayed and executed later?
Thanks in advance!

Rate limited tasks are never dropped, they are queued internally in the worker so that they execute as soon as they are allowed to run.
The token bucket algorithm does not specify anything about dropping packets (it is an option, but Celery does not do that).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.