Large celery task memory leak

Large celery task memory leak - python

I have a huge celery task that works basically like this:
#task
def my_task(id):
if settings.DEBUG:
print "Don't run this with debug on."
return False
related_ids = get_related_ids(id)
chunk_size = 500
for i in xrange(0, len(related_ids), chunk_size):
ids = related_ids[i:i+chunk_size]
MyModel.objects.filter(pk__in=ids).delete()
print_memory_usage()
I also have a manage.py command that just runs my_task(int(args[0])), so this can either be queued or run on the command line.
When run on the command line, print_memory_usage() reveals a relatively constant amount of memory used.
When run inside celery, print_memory_usage() reveals an ever-increasing amount of memory, continuing until the process is killed (I'm using Heroku with a 1GB memory limit, but other hosts would have a similar problem.) The memory leak appears to correspond with the chunk_size; if I increase the chunk_size, the memory consumption increases per-print. This seems to suggest that either celery is logging queries itself, or something else in my stack is.
Does celery log queries somewhere else?
Other notes:
DEBUG is off.
This happens both with RabbitMQ and Amazon's SQS as the queue.
This happens both locally and on Heroku (though it doesn't get killed locally due to having 16 GB of RAM.)
The task actually goes on to do more things than just deleting objects. Later it creates new objects via MyModel.objects.get_or_create(). This also exhibits the same behavior (memory grows under celery, doesn't grow under manage.py).

A bit of necroposting, but this can help people in the future. Although the best solution should be tracking the source of the problem, sometimes this is not possible either because the source of the problem is outside of our control. In this case you can use the --max-memory-per-child option when spawning the Celery worker process.

This turned out not to have anything to do with celery. Instead, it was new relic's logger that consumed all of that memory. Despite DEBUG being set to False, it was storing every SQL statement in memory in preparation for sending it to their logging server. I do not know if it still behaves this way, but it wouldn't flush that memory until the task fully completed.
The workaround was to use subtasks for each chunk of ids, to do the delete on a finite number of items.
The reason this wasn't a problem when running this as a management command is that new relic's logger wasn't integrated into the command framework.
Other solutions presented attempted to reduce the overhead for the chunking operation, which doesn't help in an O(N) scaling concern, or force the celery tasks to fail if a memory limit is exceeded (a feature that didn't exist at the time, but might have eventually worked with infinite retries.)

Try Using the #shared_task decorator

You can although run worker with --autoscale n,0 option. If minimum number of pool is 0 celery will kill unused workers and memory will be released.
But this is not good solution.
A lot of memory is used by django's Collector - before deleting it collects all related objects and firstly deletes them. You can set on_delete to SET_NULL on model fields.
Another possible solution is deleting objects with limits, for example some objects per hour. That will lower memory usage.
Django does not have raw_delete. You can use raw sql for this.

Related

how to profile memory usage of a celery task?

I have a django application that runs background tasks using the celery lib and I need to obtain and store the max memory usage of a task.
I've tried memory_usage from memory_profiler library, but I can not use this function inside a task because I get the error: "daemonic processes not allowed have children". I've also tried the memory_usage function outside the task, to monitor the task.async call, but for some reason the task is triggered twice.
All the other ways I found out there consist of checking the memory usage in different places of the code and then getting the maximum, but I have the feeling that it is very inaccurate and there are probably some calls that have a high memory usage that is left out because of garbage collection before I manage to check the current memory usage.
the official documentation has some useful functions but it would have to rely on the method above. https://docs.celeryproject.org/en/latest/reference/celery.utils.debug.html
Thanks in advance!

Why not a controller tasks?
Celery infrastructure let to query the current status of all workers:
from celery import Celery
app = Celery(...)
app.control.inspect().active()
This can be used inside a task to poll every # sec the cluster and understand what's happening.
I've used a similar approach to identify and send the kill() command between tasks. My tasks are killable so each of them know how to handle the soft kill.

Python Django Celery AsyncResult Memory Leak

The problem is a very serious memory leak until the server crashes (or you could recover by killing the celery worker service, which releases all the RAM used)
There seems to be a bunch of reported bugs on this matter, but very little attention is paid to this warning, In the celery API docs, here
Warning:
Backends use resources to store and transmit results. To ensure that resources are released, you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.
And it is reasonable to assume that the leak is related to this warning.
But the conceptual problem is, based on my understanding of celery, that AsyncResult instances are created across multiple Django views within a user session: some are created as you initiate/spawn new tasks in one view, and some you may create later manually (using task_id saved in the user session) to check on the progress (state) of those tasks in another view.
Therefore, AsynResult objects will eventually go out of scope across multiple Views in a real world Django application, and you don't want to call get() in ANY of these views, because you don't want to slow down the Django (or the apache2) daemon process.
Is the solution to never let AsyncResult Objects go out of scope before calling their get() method?
CELERY_RESULT_BACKEND = 'django-db' #backend is a mysql DB
BROKER_URL = 'pyamqp://localhost' #rabbitMQ

We also faced multiple issues with celery in production, and also tackled a memory leak issue. I'm not sure if our problem scope is the same, but if you don't mind you could try out our solution.
You see we had multiple tasks running on a couple of workers managed by supervisor (all workers were on the same Queue). Now, what we saw that when there were a lot of tasks being queued, the broker (in our case rabbitmq) was sending the amount of tasks our celery workers could process and keeping the rest in memory. This resulted in our memory overflowing and the broker started paginating in our hard drive. We found out from reading the docs that if we allow our broker to not wait for worker results, this issue could be resolved. Thus, in our tasks we used the option,
#task(time_limit=10, ignore_result=True)
def ggwp():
# do sth
Here, the time limit would close the task after a certain amount of time, and the ignore_result option would allow the broker to just send the task in celery workers as soon as a worker is freed.

Does Python garbage collect when Heroku warns about memory quota vastly exceeded (R15)?

I have a background task running under Celery on Heroku, that is getting "Error R14 (Memory quota exceeded)" frequently and "Error R15 (Memory quota vastly exceeded)" occasionally. I am loading a lot of stuff from the database (via Django on Postgres), but it should be loading up a big object, processing it, then disposing of the reference and loading up the next big object.
My question is, does the garbage collector know to run before hitting Heroku's memory limit? Should I manually run the gc?
Another thing is that my task sometimes fails, and then Celery automatically retries it, and it succeeds. It should be deterministic. I wonder if something is hanging around in memory after the task is done, and still takes up space when the next task starts. Restarting the worker process clears the memory and lets it succeed. Maybe Django or the DB has some caches that are not cleared?
I'm using standard-2x size. I could go to performance-m or performance-l, but trying to avoid that as it would cost more money.

Looks like the problem is that I'm not using .iterator() to iterate over the main queryset. Even though I'm freeing the data structures I'm creating after each iteration, the actual query results are all cached.
Unfortunately, I can't use .iterator(), because I use prefetch_related extensively.
I need some kind of hybrid method. I think it will involve processing the top-level queryset in batches. It won't completely have the advantage of a finite number of queries that prefetch_related has, but it will be better than one query per model object.

Best way for single worker implementation in Flask

I have some spider that download pages and store data in database. I have created flask application with admin panel (by Flask-Admin extension) that show database.
Now I want append function to my flask app for control spider state: switch on/off.
I thing it posible by threads or multiprocessing. Celery is not good decision because total program must use minimum memory.
Which method to choose for implementation this function?

Discounting Celery based on memory usage would probably be a mistake, as Celery has low overhead in both time and space. In fact, using Celery+Flask does not use much more memory than using Flask alone.
In addition Celery comes with several choices you can make that can have an impact
on the amount of memory used. For example, there are 5 different pool implementations that all have different strengths and trade-offs, the pool choices are:
multiprocessing
By default Celery uses multiprocessing, which means that it will spawn child processes
to offload work to. This is the most memory expensive option - simply because
every child process will duplicate the amount of base memory needed.
But Celery also comes with an autoscale feature that will kill off worker
processes when there's little work to do, and spawn new processes when there's more work:
$ celeryd --autoscale=0,10
where 0 is the mininum number of processes, and 10 is the maximum. Here celeryd will
start off with no child processes, and grow based on load up to a maximum of 10 processes. When load decreases, so will the number of worker processes.
eventlet/gevent
When using the eventlet/gevent pools only a single process will be used, and thus it will
use a lot less memory, but with the downside that tasks calling blocking code will
block other tasks from executing. If your tasks are mostly I/O bound you should be ok,
and you can also combine different pools and send problem tasks to a multiprocessing pool instead.
threads
Celery also comes with a pool using threads.
The development version that will become version 2.6 includes a lot of optimizations,
and there is no longer any need for the Flask-Celery extension module. If you are not going
into production in the next days then I would encourage you to try the development version
which must be installed like this:
$ pip install https://github.com/ask/kombu/zipball/master
$ pip install https://github.com/ask/celery/zipball/master
The new API is now also Flask inspired, so you should read the new getting started guide:
http://ask.github.com/celery/getting-started/first-steps-with-celery.html
With all this said, most optimization work has been focused on execution speed so far,
and there is probably many more memory optimizations that can be made. It has not been a request so far, but in the unlikely event that Celery does not match your memory constraints, you can open up an issue at our bug tracker and I'm sure it will get focus, or you can even help us to do so.

You could hypervize the process using multiprocess or subprocess, then just hand the handle round the session.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.