Distinguishing different processes spawned by multiprocessing - python

I have an application that uses multiprocessing. It creates several processes using multiprocessing.Process(name='foo', target=fn). I would like to be able to see which of the processes is consuming more resources (CPU, memory) through the task manager, but all these processes end up being named python.exe.
Is there a way to distinguish between the spawned processes? I'm running under Windows.

Each process has different pid, you can get it with http://docs.python.org/library/os.html#os.getpid
I'm just not sure if there is pid visible in task manager though :<

Related

Daemon threads vs daemon processes in Python

Based on the Python documentation, daemon threads are threads that die once the main thread dies. This seems to be the complete opposite behavior of daemon processes which involve creating a child process and terminating the parent process in order to have init take over the child process (aka killing the parent process does NOT kill the child process).
So why do daemon threads die when the parent dies, is this a misnomer? I would think that "daemon" threads would keep running after the main process has been terminated.
It's just names meaning different things in different contexts.
In case you are not aware, like threading.Thread, multiprocessing.Process also can be flagged as "daemon". Your description of "daemon processes" fits to Unix-daemons, not to Python's daemon-processes.
The docs also have a section about Process.daemon:
... Note that a daemonic process is not allowed to create child processes.
Otherwise a daemonic process would leave its children orphaned if it
gets terminated when its parent process exits. Additionally, these are
not Unix daemons or services, they are normal processes that will be
terminated (and not joined) if non-daemonic processes have exited.
The only thing in common between Python's daemon-processes and Unix-daemons (or Windows "Services") is that you would use them for background-tasks
(for Python: only an option for tasks which don't need proper clean up on shutdown, though).
Python imposes it's own abstraction layer on top of OS-threads and processes. The daemon-attribute for Thread and Process is about this OS-independent, Python-level abstraction.
At the Python-level, a daemon-thread is a thread which doesn't get joined (awaited to exit voluntarily) when the main-thread exits and a daemon-process is a process which gets terminated (not joined) when the parent-process exits. Daemon-threads and processes both experience the same behavior in that their natural exit is not awaited in case the main or parent-process is shutting down. That's all.
Note that Windows doesn't even have the concept of "related processes" like Unix, but Python implements this relationship of "child" and "parent" in a cross-platform manner.
I would think that "daemon" threads would keep running after the main
process has been terminated.
A thread cannot exist outside of a process. A process always hosts and gives context to at least one thread.

multiprocessing.pool.Pool/ThreadPool can create workers lazily?

I'd like declare my ThreadPool globally in my program.
However, multiprocessing.pool.ThreadPool creates worker threads greadily, creating them even if no jobs are submitted to the pool. This causes various sorts of problems even when I don't end up allocating threads (e.g. PyCharm is utterly unusable for me when I set breakpoints in a multi-threaded program, and also dingleberry zombie processes are left if the process is not shutdown cleanly).
Is there a way to ask multiprocessing.pool.ThreadPool to create the threads lazily?

Can a Daemon process fork child processes in python?

I have a daemon process that keeps on running which I created using runit package. I want daemon process to listen to a table and perform tasks based on the column of the table which says what task it needs to perform.
EG: table 'A' has column job_type.
I was thinking of forking child processes from this daemon process every time it gets a new task to perform (based on the new row inserted in the table A which daemon listens to).
The multiprocessing module says I can't or shouldn't fork child processes from daemon as if it dies, the children processes are orphaned.
What is a good approach to achieve that Daemons listens to table, based on column value,forks child processes (all independent of each other) which does the task and goes back to the daemon and dies.
I need to use some locking mechanism if the child processes are accessing shared data and modifying it..
I assume the daemon process you have is also spawned from a python script which called multiprocess with daemon=true.
In this case the daemon is running implies that your creator process is still running, so you can just send it a message via pipes to spawn a new process for you. If your daemon needs to talk with this, use sockets or any ipc method of your choice.

Evenlet semaphore, how to limit calls to a particular subprocess?

I need to create a semaphore to restrict the parallel count of a particular subprocess. I am using gunicorn with eventlet workers and allow many simultaneous connections. Mostly these are waiting on remote data. However, they all enter a processing phase at some point and this involves calling a subprocess. This subprocess though should not be run too often in parallel as it is memory/CPU hungry.
Is threading.Semaphore correctly monkey_patch'd and usable with eventlet inside gunicorn?
As I understand the problem:
one gunicorn process (this is crucial) spawns N green threads
each worker may spawn one or more subprocesses
you want to limit total number of subprocesses
In this case, yes, semaphore will work as expected.
However, if you have more than one process, they will have separate instances of semaphore and you would observe more subprocesses. In this case, I recommend to move subprocess responsibilities to a separate application, running on same machine and call it via API you like (RPC/socket/message queue/dbus/etc). You could design the system like this:
user -> gunicorn (any number of processes)
gunicorn -> one subprocess manager
manager -> N subprocesses
The manager listens for jobs from gunicorn, spawns a subprocess if needed, maybe reuses existing subprocesses. You may like a job queue system like Beanstalk, Celery, Gearman. Or you may wish to build a custom solution on top of existing message transports like NSQ, RabbitMQ, ZeroMQ.

Python Celery - lookup task by pid

A pretty straightforward question, maybe -
I often see a celery task process running on my system that I cannot find when I use celery.task.control.inspect()'s active() method. Often this process will be running for hours, and I worry that it's a zombie of some sort. Usually it's using up a lot of memory, too.
Is there a way to look up a task by linux pid? Does celery or the AMPQ result backend save that?
If not, any other way to figure out which particular task is the one that's sitting around eating up memory?
---- updated:
What can I do when active() tells me that there are no tasks running on a particular box, but the box's memory is in full use, and htop is showing that these worker pool threads are the ones using it, but at the same time using 0% CPU? if it turns out this is related to some quirk of my current rackspace setup and nobody can answer, I'll still accept Loren's.
Thanks~
I'm going to make the assumption that by 'task' you mean 'worker'. The question would make little sense otherwise.
For some context it's important to understand the process hierarchy of Celery worker pools. A worker pool is a group of worker processes (or threads) that share the same configuration (process messages of the same set of queues, etc.). Each pool has a single parent process that manages the pool. This process controls how many child workers are forked and is responsible for forking replacement children when children die. The parent process is the only process bound to AMQP and the children ingest and process tasks from the parent via IPC. The parent process itself does not actually process (run) any tasks.
Additionally, and towards an answer to your question, the parent process is the process responsible for responding to your Celery inspect broadcasts, and the PIDs listed as workers in the pool are only the child workers. The parent PID is not included.
If you're starting the Celery daemon using the --pidfile command-line parameter, that file will contain the PID of the parent process and you should be able to cross-reference that PID with the process you're referring to determine if it is in fact a pool parent process. If you're using Celery multi to start multiple instances (multiple worker pools) then by default PID files should be located in the directory from which you invoked Celery multi. If you're not using either of these means to start Celery try using one of them to verify that the process isn't a zombie and is in fact simply a parent.

Categories