memcache.get returns wrong object (Celery, Django) - python

Here is what we have currently:
we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
cache stores correct value (we checked that manually at the moment the bug just appeared)
calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
restarting celery solves the issue for the random period of time (minutes - weeks)
*NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
*NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted.
Isn't cache instance object returned by django's cache thread safe?
*NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
*NEW* At least one MemoryError was logged these days
I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.
PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)

Solved it finally:
Celery has dynamic scaling feature- it's capable to add/kill workers according to load
It does it via forking existing one
Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
OR store them under current PID, and check it each time you're accessing cache

This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.
You can easily reproduce this problem with a local memcached instance:
from django.core.cache import cache
import os
def write_read_test():
pid = os.getpid()
cache.set(pid, pid)
for x in range(5):
value = cache.get(pid)
if value != pid:
print "Unexpected response {} in process {}. Attempt {}/5".format(
value, pid, x+1)
os._exit(0)
cache.set("access cache", "before fork")
for x in range(5):
if os.fork() == 0:
write_read_test()
What you can do is close the cache client as Django does in the request_finished signal:
https://github.com/django/django/blob/master/django/core/cache/init.py#L128
If you put a cache.close() after the fork, everything works as expected.
For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().
This also affects gunicorn when preload is active and the cache is initialized before forking the workers.
For gunicorn, you could use post_fork in your gunicorn configuration:
def post_fork(server, worker):
from django.core.cache import cache
cache.close()

Related

Python Django Celery AsyncResult Memory Leak

The problem is a very serious memory leak until the server crashes (or you could recover by killing the celery worker service, which releases all the RAM used)
There seems to be a bunch of reported bugs on this matter, but very little attention is paid to this warning, In the celery API docs, here
Warning:
Backends use resources to store and transmit results. To ensure that resources are released, you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.
And it is reasonable to assume that the leak is related to this warning.
But the conceptual problem is, based on my understanding of celery, that AsyncResult instances are created across multiple Django views within a user session: some are created as you initiate/spawn new tasks in one view, and some you may create later manually (using task_id saved in the user session) to check on the progress (state) of those tasks in another view.
Therefore, AsynResult objects will eventually go out of scope across multiple Views in a real world Django application, and you don't want to call get() in ANY of these views, because you don't want to slow down the Django (or the apache2) daemon process.
Is the solution to never let AsyncResult Objects go out of scope before calling their get() method?
CELERY_RESULT_BACKEND = 'django-db' #backend is a mysql DB
BROKER_URL = 'pyamqp://localhost' #rabbitMQ
We also faced multiple issues with celery in production, and also tackled a memory leak issue. I'm not sure if our problem scope is the same, but if you don't mind you could try out our solution.
You see we had multiple tasks running on a couple of workers managed by supervisor (all workers were on the same Queue). Now, what we saw that when there were a lot of tasks being queued, the broker (in our case rabbitmq) was sending the amount of tasks our celery workers could process and keeping the rest in memory. This resulted in our memory overflowing and the broker started paginating in our hard drive. We found out from reading the docs that if we allow our broker to not wait for worker results, this issue could be resolved. Thus, in our tasks we used the option,
#task(time_limit=10, ignore_result=True)
def ggwp():
# do sth
Here, the time limit would close the task after a certain amount of time, and the ignore_result option would allow the broker to just send the task in celery workers as soon as a worker is freed.

Use a multiprocessing.Queue in luigi

I have a set of tasks in luigi which all need to access a database. I can have up to 8 tasks accessing my database at the same time provided they are on different ports (I have the list of allowed ports).
How should I best implement this restriction which seems to be similar to the standard restriction of number of workers, ie for my case a task should run when a worker is free AND a database port is free.
I tried creating a multiprocessing.Queue() in __main__ and pass this to the WrapperTask, which receives it as a luigi.Parameter(), but this gives an error and hangs
UserWarning: Parameter "queue" with value <multiprocessing.queues.Queue object at 0x00000000149E4518>" is not of type string.
warnings.warn('Parameter "{}" with value "{}" is not of type string.'.format(param_name, param_value))
The idea was that a .get() call would hang a Task if the queue is empty and continue once another task .put(port) again.
What is going wrong here? Or am I taking the completely wrong approach to managing the resource in luigi?
You should be using "resources" section in Luigi Configuration. This will ensure that not more than this number of workers share a global resource. Find more here https://luigi.readthedocs.io/en/stable/configuration.html#resources.

shared objects after os.fork()

I've encountered some strange application behaviour while interacting with database using many processes. I'm using Linux.
I have my own implementation of QueryExecutor which uses the a single connection during its lifetime:
class QueryExecutor(object):
def __init__(self, db_conf):
self._db_config = db_conf
self._conn = self._get_connection()
def execute_query(self, query):
# some code
# some more code
def query_executor():
global _QUERY_EXECUTOR
if _QUERY_EXECUTOR is None:
_QUERY_EXECUTOR = QueryExecutor(some_db_config)
return _QUERY_EXECUTOR
Query Executor is never modified after instantiation.
Initially there is only one process, which from time to time forks (os.fork()) several times. The new processes are workers which do some tasks and then exit. Each worker calls query_executor() to be able to execute a SQL query.
I have found out that sql queries often return wrong results (it seems that sometimes sql query result is returned to the wrong process). The only sensible explanation is all processes share the same sql connection (according to MySQLdb doc: threadsafety = 1 Threads may share the module, but not connections).
I wonder which OS mechanism leads to this situation. As far as I know, on Linux when process forks, the parent process's pages are not copied for the child process, they are shared by both processes until one of them tries to modify some page (copy-on-write). As I have mentioned before, QueryExecutor object remains unmodified after creation. I guess this is the reason for the fact that all processes uses the same QueryExecutor instance and hence the same sql connection.
Am I right or do I miss something? Do you have any suggestions?
Thanks in advance!
Grzegorz
The root of the problem is that fork() simply creates an exact independent copy of a process, but these two processes share opened files, sockets and pipes. That's why any data written by MySQL server may be [correctly] read only from a single process and if two processes try to make requests and read responses then they quite likely will mess up each other work. This has nothing with "multithreading" because in case of multi-threading there's a single process with few threads of executions, they share data and may coordinate.
The correct way to use fork() is to close (or re-open) right after forking all file-handle-like objects in all but one copies of the process or at least avoid using them from multiple processes.

Large celery task memory leak

I have a huge celery task that works basically like this:
#task
def my_task(id):
if settings.DEBUG:
print "Don't run this with debug on."
return False
related_ids = get_related_ids(id)
chunk_size = 500
for i in xrange(0, len(related_ids), chunk_size):
ids = related_ids[i:i+chunk_size]
MyModel.objects.filter(pk__in=ids).delete()
print_memory_usage()
I also have a manage.py command that just runs my_task(int(args[0])), so this can either be queued or run on the command line.
When run on the command line, print_memory_usage() reveals a relatively constant amount of memory used.
When run inside celery, print_memory_usage() reveals an ever-increasing amount of memory, continuing until the process is killed (I'm using Heroku with a 1GB memory limit, but other hosts would have a similar problem.) The memory leak appears to correspond with the chunk_size; if I increase the chunk_size, the memory consumption increases per-print. This seems to suggest that either celery is logging queries itself, or something else in my stack is.
Does celery log queries somewhere else?
Other notes:
DEBUG is off.
This happens both with RabbitMQ and Amazon's SQS as the queue.
This happens both locally and on Heroku (though it doesn't get killed locally due to having 16 GB of RAM.)
The task actually goes on to do more things than just deleting objects. Later it creates new objects via MyModel.objects.get_or_create(). This also exhibits the same behavior (memory grows under celery, doesn't grow under manage.py).
A bit of necroposting, but this can help people in the future. Although the best solution should be tracking the source of the problem, sometimes this is not possible either because the source of the problem is outside of our control. In this case you can use the --max-memory-per-child option when spawning the Celery worker process.
This turned out not to have anything to do with celery. Instead, it was new relic's logger that consumed all of that memory. Despite DEBUG being set to False, it was storing every SQL statement in memory in preparation for sending it to their logging server. I do not know if it still behaves this way, but it wouldn't flush that memory until the task fully completed.
The workaround was to use subtasks for each chunk of ids, to do the delete on a finite number of items.
The reason this wasn't a problem when running this as a management command is that new relic's logger wasn't integrated into the command framework.
Other solutions presented attempted to reduce the overhead for the chunking operation, which doesn't help in an O(N) scaling concern, or force the celery tasks to fail if a memory limit is exceeded (a feature that didn't exist at the time, but might have eventually worked with infinite retries.)
Try Using the #shared_task decorator
You can although run worker with --autoscale n,0 option. If minimum number of pool is 0 celery will kill unused workers and memory will be released.
But this is not good solution.
A lot of memory is used by django's Collector - before deleting it collects all related objects and firstly deletes them. You can set on_delete to SET_NULL on model fields.
Another possible solution is deleting objects with limits, for example some objects per hour. That will lower memory usage.
Django does not have raw_delete. You can use raw sql for this.

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Categories