CPU Concurrency behavior with celery task that calls a subprocess - python

I have a celery task which uses subprocess.Popen() to call out to an executable which does some CPU-intensive data crunching. It works well but does not take full advantage of the celery worker concurrency.
If I start up celeryd with --concurrency 8 -P prefork, I can confirm with ps aux | grep celeryd that 8 child processes have been spawned. OK.
Now when I start e.g. 3 tasks in parallel, I see all three tasks picked up by one of child workers:
[2014-05-08 13:41:23,839: WARNING/Worker-2] running task a...
[2014-05-08 13:41:23,839: WARNING/Worker-4] running task b...
[2014-05-08 13:41:24,661: WARNING/Worker-7] running task c...
... and they run for several minutes before completing successfully. However, when you observe the CPU usage during that time, it's clear that all three tasks are sharing the same CPU despite another free core:
If I add two more tasks, each subprocess takes ~20% of that one CPU, etc.
I would expect that each child celery process (which are created using multiprocessing.Pool via the prefork method) would be able to operate independently and not be constrained to a single core. If not, how can I take full advantage of multiple CPU cores with a CPU-bound celery task?

According to
http://bugs.python.org/issue17038
and
https://stackoverflow.com/a/15641148/519385
There is a problem whereby some C extensions mess with core affinity and prevent multiprocessing from accessing all available CPUs. The solution is a total hack but appears to work.
os.system("taskset -p 0xff %d" % os.getpid())

Related

Serial processing of specific tasks using Celery with concurrency

I have a python/celery setup: I have a queue named "task_queue" and multiple python scripts that feed it data from different sensors. There is a celery worker that reads from that queue and sends an alarm to user if the sensor value changed from high to low. The worker has multiple threads (I have autoscaling parameter enabled) and everything works fine until one sensor decides to send multiple messages at once. That's when I get the race condition and may send multiple alarms to user, since before a thread stores the info that it had already sent an alarm, few other threads also send it.
I have n sensors (n can be more than 10000) and messages from any sensor should be processed sequentially. So in theory I could have n threads, but that would be an overkill. I'm looking for a simplest way to equally distribute the messages across x threads (usually 10 or 20), so I wouldn't have to (re)write routing function and define new queues each time I want to increase x (or decrease).
So is it possible to somehow mark the tasks that originate from same sensor to be executed in serial manner (when calling the delay or apply_async)? Or is there a different queue/worker architecture I should be using to achieve that?
From what I understand, you have some tasks that can run all at the same time and a specific task that can not do this (this task needs to be executed 1 at a time).
There is no way (for now) to set the concurrency of a specific task queue so I think the best approach in your situation would be handling the problem with multiple workers.
Lets say you have the following queues:
queue_1 Here we send tasks that can run all at the same time
queue_2 Here we send tasks that can run 1 at a time.
You could start celery with the following commands (If you want them in the same machine).
celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1#%h -Q queue_1
celery -A proj worker --loglevel=INFO --concurrency=1 -n worker2#%h -Q queue_2
This will make worker1 which has concurrency 10 handle all tasks that can be ran at the same time and worker2 handles only the tasks that need to be 1 at a time.
Here is some documentation reference:
https://docs.celeryproject.org/en/stable/userguide/workers.html
NOTE: Here you will need to specify the task in which queue runs. This can be done when calling with apply_async, directly from the decorator or some other ways.

Celery pool processes, tasks, and system processes & memory space

Each task execution in a unique process space?
Do Celery pool (not Master) processes spawn off a process for each task execution?
In other words, is each task execution through a new process spawned by worker pool process?
Or is it the other way?
task is executed as part of worker pool process?
One implication of that: If celery task relies on data stored in the process memory space, that data is part of the worker pool process which is executing it. And, all tasks executed by the worker pool process have access to that copy of the data.
These details depend on the concurrency model you pick for your workers.
In the default, prefork model (based on processes), every task is executed inside one of the pre-forked processes (worker processes). So yes - it is a process pool. You can configure Celery to create a new worker-process for each task, but that is not the default behaviour. By default Celery does not replace old worker processes with new ones, but you can control that with the worker_max_tasks_per_child setting.

Correcting mis-configured celery (running with supervisord)

I started running celery for tasks in a Python/Django web project, hosted on a single VM with 8 cores or CPUs. I need to improve the configuration now - I've made rookie mistakes.
I use supervisor to handle celery workers and beat. In /etc/supervisor/conf.d/, I have two worker-related conf files - celery1.conf and celery1.conf. Should I...
1) Remove one of them? Both spawn different workers. I.e. the former conf file has command=python manage.py celery worker -l info -n celeryworker1. The latter has command=python manage.py celery worker -l info -n celeryworker2. And it's authoritatively stated here to run 1 worker per machine.
2) Tinker with numprocs in the conf? Currently in celery1.conf, I've defined numprocs=2. In celery2.conf, I've defined numprocs=3* (see footer later). At the same time, in /etc/default/celeryd, I have CELERYD_OPTS="--time-limit=300 --concurrency=8". So what's going on? supervisor's numprocs takes precedence over concurrency in celeryd, or what? Should I set numprocs=0?
*total numprocs over both files = 2+3 = 5. This checks out. sudo supervisorctl shows 5 celery worker processes. But in newrelic, I see 45 processes running for celeryd. What the heck?! Even if each proc created by supervisor is actually giving birth to 8 procs (via celeryd), total numprocs x concurrency = 5 x 8 = 40. That's 5 less than the 45 shown by newrelic. Need guidance in righting these wrongs.
Compare screenshots:
vs
it's authoritatively stated here to run 1 worker per machine
Actually, it's advised ("I would suggest") to only run one worker per machine for this given use case.
You may have pretty good reason to do otherwise (for example having different workers for different queues...), and the celery doc states that how many worker / how many processes per worker (concurrency) works best really depends on you tasks, usage, machine and whatnots.
wrt/ numprocs in supervisor conf and concurrency in celery, these are totally unrelated (well, almost...) things. A celery "worker" is actually one main process spawning concurrency children (which are the ones effectively handling your tasks). Supervisor's numprocs tells supervisor how many processes (here: the celery workers) it should launch. So if you have one celery conf with numprocs = 2 and another one with numproc = 3, this means you launch a total of 5 parents worker processes - each of them spawning n subchilds, where - by default - n is your server's cpus count. This means you have a total of 5 + (5*8) = 45 worker subprocesses running.
Wether you actually need that many workers is a question only you can answer ;)

Single Celery task suspend all other Celery workers

I have a celery configuration (over django-celery) with rabbit MQ as a broker and concurrency of 20 threads.
one of the tasks is taking really long time (about an hour) to be executed and. after a few minutes that the task running all the other concurrency threads stop working until the task finish, why this is happening?
Thanks!
Need to have multiple worker to pick up the task.
For example, environment needs to have at least 2 CPU.
http://docs.celeryproject.org/en/latest/userguide/workers.html#concurrency
May use celery flower to inspect the workers.

How can I get my Luigi scheduler to utilize multiple cores with the parallel-scheduling flag?

I have the following line in my luigi.cfg file (on all nodes, scheduler and workers):
[core]
parallel-scheduling: true
However, when I monitor CPU utilization on my luigi scheduler (with a graph of around ~4000 tasks, handling requests from ~100 workers), it is only utilizing a single core on the scheduler, with the single luigid thread often hitting 100% CPU utilization. My understanding is that this configuration variable should parallelize scheduling of tasks.
The source suggests that this flag should indeed use multiple cores on the scheduler. In https://github.com/spotify/luigi/blob/master/luigi/interface.py#L194, a call is made to https://github.com/spotify/luigi/blob/master/luigi/worker.py#L498 to check the .complete() state of the task in parallel.
What am I missing to get my Luigi scheduler to utilize all of its cores?
I just realize the name parallel-scheduling is a bit confusing. It does not affect the scheduler. Only the workers. Workers will perform the scheduling phase in parallel when that option is set.
As of today there is no way to utilize multiple cores for the central scheduler.

Categories