Celery: Dictionary not getting updated [duplicate] - python

from celery import Celery
app = Celery('tasks', backend='amqp://guest#localhost//', broker='amqp://guest#localhost//')
a_num = 0
#app.task
def addone():
global a_num
a_num = a_num + 1
return a_num
this is the code I used to test celery.
I hope every time I use addone() the return value should increase.
But it's always 1
why???
Results
python
>> from tasks import addone
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1

By default when a worker is started Celery starts it with a concurrency of 4, which means it has 4 processes started to handle task requests. (Plus a process that controls the other processes.) I don't know what algorithm is used to assign task requests to the processes started for a worker but eventually, if you execute addone.delay().get() enough, you'll see the number get greater than 1. What happens is that each process (not each task) gets its own copy of a_num. When I try it here, my fifth execution of addone.delay().get() returns 2.
You could force the number to increment each time by starting your worker with a single process to handle requests. (e.g. celery -A tasks worker -c1) However, if you ever restart your worker, the numbering will be reset to 0. Moreover, I would not design code that works only if the number of processes handling requests is forced to be 1. One day down the road a colleague decides that multiple processes should handle the requests for the tasks and then things break. (Big fat warnings in comments in the code can only do so much.)
At the end of the day, such state should be shared in a cache, like Redis, or a database used as a cache, which would work for the code in your question.
However, in a comment you wrote:
Let's see I want use a task to send something. Instead of connecting every time in task, I want to share a global connection.
Storing the connection in a cache won't work. I would strongly advocate having each process that Celery starts use its own connection rather than try to share it among processes. The connection does not need to be reopened with each new task request. It is opened once per process, and then each task request served by this process reuses the connection.
In many cases, trying to share the same connection among processes (through sharing virtual memory through a fork, for instance) would flat out not work anyway. Connections often carry state with them (e.g. whether a database connection is in autocommit mode). If two parts of the code expect the connection to be in different states, the code will operate inconsistently.

The tasks will run asynchronously so every time it starts a new task a_num will be set to 0. They are run as separate instances.
If you want to work with values I suggest a value store or database of some sort.

Related

master-workers implementation in Python

I am trying to implement the prototype for my application in Python and stuck on choosing libraries, frameworks...
The application is a master-workers application (event loop?), where workers requests a master about a work they should do and respond to master with the result of their work.
All tasks (works) are stored in PostgreSQL table, and only master can access its data. The table looks like:
task(task_id int, status varchar, length int, error_msg varchar)
Master process should have the following API methods to outer world (REST/HTTP):
get_workers_count: retutns number of workers. When it starts first time, the initial number of workers is 0
set_workers(workers_count): sets new count of workers. If new count is greater than current one, master should spawn new workers. If new count is less then current one, some workers should die after they complete current work
add_task(time): Adds a tsak in task table with status 'READY' and length equals to time
Master process should also have the following API methods to workers (should not be acceptable to outer world):
get_task Returns task_id and length of the first record in task table in status 'READY'. After returning to worker it changes the status to 'EXECUTING'. Returns -1 if there are no tasks to execute. Returns -2 if worker should die.
set_task_status (task_id, status) - sets task status
Worker process should be run by master process and works as follows:
calls get_task method of master. If it gets -2 it terminates. If it gets -1, it sleeps and calls get_task_again
if it gets positive task numbert, ot sleeps for length of seconds (simulate work) and responds with a status (SUCCESS for prototype).
I am new in Python and ask somebody to help me in choosing frameworks/libraries for my application. My current state is:
I want to use Flask/gunicorn for REST Api in master process
I have no idea what to use for communication between master/worker processes. Is SocketServer is a good choice for me?
almost all work by worker process will be performed by C extension module
- workers and master will work on a single machine
I have no idea how to start workers: should I spawn thread/greenlet or should I fork a new process?
Please advise.
ASync is probably your best bet, I personally LOVE gevent. You could look at GIPC which multi processes gevent and gives you a read write channel back and forth. Or you can just have them communicate over restAPI's.
Personally I would fire up two distinct processes, a master channel that manages the pool and handles the queues. Then I would have worker processes poke at the API for new work, and when they retrieve the work they go do their business in a separate thread.
The advantage of this would be when you want to split the workers to other machines (micro computers) the only change required is an ip address.
Don't know much about master/worker architecture, but you can use pika/RabbitMQ + Celery for event handling and task queues.
Consider RabbitMQ instead of Postgres for events - see some discussion here.
Hope it helps.

Avoiding duplicate tasks (or dealing with them) in task queue in Google App Engine

I have a service I've developed running on GAE. The application needs to 'tick' every 3 seconds to perform a bunch of calculations. It is a simulation-type game.
I have a manually scaled instance that I start which uses the deferred API and task queue like so (some error handling etc, removed for clarity):
#app.route('/_ah/start')
def start():
log.info('Ticker instance started')
return tick()
#app.route('/tick')
def tick():
_do_tick()
deferred.defer(tick, _countdown=3)
return 'Tick!', 200
The problem is that sometimes I end up with this being scheduled twice for some reason (likely a transient error/timeout causing the task to be rescheduled) and I end up with multiple tasks in the task queue, and the game ticking multiple times per 3-second period.
Any ideas how best to deal with this?
As far as I can see you can't ask a queue 'Are there are tasks of X already there?' or 'how many items on the queue at the moment?'.
I understand that this uses as push queue, and one idea might be to switch instead to a pull queue and have the ticker lease items off the queue, grouped by tag, which would get all of them, including duplicates. Would that be better?
In essence what I really want is just a cron-like scheduler to schedule something every 3 seconds, but I know that the scheduler on GAE likely doesn't run to that resolution.
I could just move everything into the startup handler, e.g.:
#app.route('/_ah/start')
def start():
log.info('Ticker instance started')
while True:
_do_tick()
sleep(3)
return 200
But from what I see, the logs won't update as I do this, as it is perceived to be a single request that never completes. This makes it a bit harder to see in the logs what is going on. Currently I see each individual tick as a separate request log entry.
Also if the above gets killed, I'd need to get it to reschedule itself anyway. Which might not be too much of a hassle as I know there are exceptions you can catch when the instance is about to be shut down and I could then fire off a deferred task to start it again.
Or is there a better way to handle this on GAE?
I can't see a way to detect/eliminate duplicates, but have worked around it now using a different mechanism. Rather than rely on the task queue as a scheduler, I run my own scheduler loop in a manually scaled instance:
TICKINTERVAL = 3
#app.route('/_ah/start')
def scheduler():
log.info('Ticker instance started')
while True:
if game.is_running():
task = taskqueue.add(
url='/v1/game/tick',
queue_name='tickqueue',
method='PUT',
target='tickworker',
)
else:
log.info('Tick skipped as game stopped')
db_session.rollback()
sleep(TICKINTERVAL)
I have defined my own queue, tickqueue in queue.yaml
queue:
- name: tickqueue
rate: 5/s
max_concurrent_requests: 1
retry_parameters:
task_retry_limit: 0
task_age_limit: 1m
The queue doesn't retry tasks and any tasks left on there longer than a minute get cancelled. I set the max concurrency to 1 so that is only attempts to process one item at a time.
If an occasional 'tick' takes longer than 3 seconds then it will back up on the queue, but the queue should clear if it speeds up again. If ticks end up taking longer than 3s on average then the tasks that have been on the queue longer than a minute will get discarded.
This gives the advantage that I get a log entry for each tick (and it is called /v1/game/tick so easy to spot, as opposed to /_ah/deferred). The downside is that I am needing to use one instance for the scheduler and one for the worker, as you can't have the scheduler instance process requests as it won't do until /_ah/start completes, which it never does here.
You can set to 0 the task_retry_limit value in the _retry_options optional argument as mentioned in https://stackoverflow.com/a/36621588/4495081.
The trouble is that if a valid reason for a failure exists then the ticking job stops forever. You may want to also keep track of the last time the job executed and have a cron-based sanity-check job to periodically check that ticking is still running and restart it if not.

memcache.get returns wrong object (Celery, Django)

Here is what we have currently:
we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
cache stores correct value (we checked that manually at the moment the bug just appeared)
calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
restarting celery solves the issue for the random period of time (minutes - weeks)
*NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
*NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted.
Isn't cache instance object returned by django's cache thread safe?
*NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
*NEW* At least one MemoryError was logged these days
I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.
PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)
Solved it finally:
Celery has dynamic scaling feature- it's capable to add/kill workers according to load
It does it via forking existing one
Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
OR store them under current PID, and check it each time you're accessing cache
This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.
You can easily reproduce this problem with a local memcached instance:
from django.core.cache import cache
import os
def write_read_test():
pid = os.getpid()
cache.set(pid, pid)
for x in range(5):
value = cache.get(pid)
if value != pid:
print "Unexpected response {} in process {}. Attempt {}/5".format(
value, pid, x+1)
os._exit(0)
cache.set("access cache", "before fork")
for x in range(5):
if os.fork() == 0:
write_read_test()
What you can do is close the cache client as Django does in the request_finished signal:
https://github.com/django/django/blob/master/django/core/cache/init.py#L128
If you put a cache.close() after the fork, everything works as expected.
For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().
This also affects gunicorn when preload is active and the cache is initialized before forking the workers.
For gunicorn, you could use post_fork in your gunicorn configuration:
def post_fork(server, worker):
from django.core.cache import cache
cache.close()

Unknown queue names show on Rabbitmq mgmt. when using Celery

I only created the last 2 queue names that show in Rabbitmq management Webui in the below table:
The rest of the table has hash-like queues, which I don't know:
1- Who created them? (I know it is celery, but which process, task,etc.)
2- Why they are created, and what they are created for?.
I can notice that when the number of pushed messages increase, the number of those hash-like messages increase.
When using celery, Rabbitmq is used as a default result backend, and also to store errors of failing
tasks(that raised exceptions).
Every new task creates a new queue on the server, with thousands of tasks the
broker may be overloaded with queues and this will affect performance
in negative ways.
Each queue in Rabbit will be a separate Erlang process, so if you’re planning to
keep many results simultaneously you may have to increase the Erlang
process limit, and the maximum number of file descriptors your OS
allows.
Old results will not be cleaned automatically, so we have to tell
rabbit to do so.
The below conf. line dictates the time to live of the temp
queues. The default is 1 day
CELERY_AMQP_TASK_RESULT_EXPIRES = Number of seconds
OR, We can change the backend store totally, and not make it in Rabbit.
CELERY_BACKEND = "amqp"
We may also ignore it:
CELERY_IGNORE_RESULT = True.
Also, when ignoring the result, we can also keep the errors stored for later usage,
which means one more queue for the failing tasks.
CELERY_STORE_ERRORS_EVEN_IF_IGNORED = True.
I will not mark this question as answered, waiting for a better answer.
Rererences:
This SO link
Celery documentation
Rabbitmq documentation

Python App Engine: Task Queues

I need to import some data to show it for user but page execution time exceeds 30 second limit. So I decided to split my big code into several tasks and try Task Queues. I add about 10-20 tasks to queue and app engine executes tasks in parallel while user is waiting for data. How can I determine that my tasks are completed to show user data ASAP? Can I somehow iterate over active tasks?
I've solved this in the past by keeping the status for the tasks in memcached, and polling (via Ajax) to determine when the tasks are finished.
If you go this way, it's best if you can always "manually" determine the status of the tasks without looking in memcached, since there's always the (slim) chance that memcache will go down or will get cleared or something as a task is running.

Categories