Get process id of spawned process in Python Pebble library?

Get process id of spawned process in Python Pebble library? - python

I am using a ProcessPool from Pebble library to launch a subprocess that is prone to crashing. I'd like to log the process-id of the subprocess that crashed but from the main process & not the child process(reason for this is I have a log line in the main process with a bunch of relevant information related to one request where I want to include this instead of being scattered across multiple log lines). Is there some way to access this process-id? I can't seem to find this information in the documentation.
I guess as a workaround I can get the pid in the subprocess before doing anything using os.getpid() and use IPC to communicate it back to the parent process. But I'd like to avoid this if possible.

The ProcessPool is designed to abstract its inner workings from the User. Therefore, it hides the access to the processes used to execute the workers.
If you need this information just for debugging purposes, my suggestion would be to mark your jobs with unique identifiers and then log from the worker processes both these identifiers and the worker PID. In this way you can correlate what is the job which is causing your function to crash.
def function(jobid, *args):
logging.debug("Job ID %d started on worker %d", jobid, os.getpid())
...
pool.schedule(function, (jobid, arg1, arg2))

Related

Interprocess Communication between two python scripts without STDOUT

I am trying to create a Monitor script that monitors all the threads or a huge python script which has several loggers running, several thread running.
From Monitor.py i could run subprocess and forward the STDOUT which might contain my status of the threads.. but since several loggers are running i am seeing other logging in that..
Question: How can run the main script as a separate process and get custom messages, thread status without interfering with logging. ( passing PIPE as argument ? )
Main_Script.py * Runs Several Threads * Each Thread has separate Loggers.
Monitor.py * Spins up the Main_script.py * Monitors the each of the threads in MainScript.py ( may be obtain other messages from Main_script in the future)
So Far, I tried subprocess, process from Multiprocessing.
Subprocess lets me start the Main_script and forward the stdout back to monitor but I see the logging of threads coming in through the same STDOUT. I am using the “import logging “ Library to log the data from each threads to separate files.
I tried “process” from Multiprocessing. I had to call the main function of the main_script.py as a process and send a PIPE argument to it from monitor.py. Now I can’t see the Main_script.py as a separate process when I run top command.

Normally, you want to change the child process to work like a typical Unix userland tool: the logging and other side-band information goes to stderr (or to a file, or syslog, etc.), and only the actual output goes to stdout.
Then, the problem is easy: just capture stdout to a PIPE that you process, and either capture stderr to a different PIPE, or pass it through to real stderr.
If that's not appropriate for some reason, you need to come up with some other mechanism for IPC: Unix or Windows named pipes, anonymous pipes that you pass by leaking the file descriptor across the fork/exec and then pass the fd as an argument, Unix-domain sockets, TCP or UDP localhost sockets, a higher-level protocol like a web service on top of TCP sockets, mmapped files, anonymous mmaps or pipes that you pass between processes via a Unix-domain socket or Windows API calls, …
As you can see, there are a huge number of options. Without knowing anything about your problem other than that you want "custom messages", it's impossible to tell you which one you want.
While we're at it: If you can rewrite your code around multiprocessing rather than subprocess, there are nice high-level abstractions built in to that module. For example, you can use a Queue that automatically manages synchronization and blocking, and also manages pickling/unpickling so you can just pass any (picklable) object rather than having to worry about serializing to text and parsing the text. Or you can create shared memory holding arrays of int32 objects, or NumPy arrays, or arbitrary structures that you define with ctypes. And so on. Of course you could build the same abstractions yourself, without needing to use multiprocessing, but it's a lot easier when they're there out of the box.
Finally, while your question is tagged ipc and pipe, and titled "Interprocess Communication", your description refers to threads, not processes. If you actually are using a bunch of threads in a single process, you don't need any of this.
You can just stick your results on a queue.Queue, or store them in a list or deque with a Lock around it, or pass in a callback to be called with each new result, or use a higher-level abstraction like concurrent.futures.ThreadPoolExecutor and return a Future object or an iterator of Futures, etc.

Message passing between processes using multiprocessing module

What's the recommended way of sending message from a worker process to another randomly selected (worker or master) process? One approach that I can think of is using Pipes, but since it can only create a pipe between two selected processes, I need to create a pipe for each process pair. This doesn't seem so practical. What I want is to create a complete graph between processes and select one of the pipes randomly.

You could use Queue in order to communicate among your processes by maintaining some convention in your queue.Your could find the details on using Queue here.
P.S :- As mentioned here Queues are thread and process safe.

Python Celery - lookup task by pid

A pretty straightforward question, maybe -
I often see a celery task process running on my system that I cannot find when I use celery.task.control.inspect()'s active() method. Often this process will be running for hours, and I worry that it's a zombie of some sort. Usually it's using up a lot of memory, too.
Is there a way to look up a task by linux pid? Does celery or the AMPQ result backend save that?
If not, any other way to figure out which particular task is the one that's sitting around eating up memory?
---- updated:
What can I do when active() tells me that there are no tasks running on a particular box, but the box's memory is in full use, and htop is showing that these worker pool threads are the ones using it, but at the same time using 0% CPU? if it turns out this is related to some quirk of my current rackspace setup and nobody can answer, I'll still accept Loren's.
Thanks~

I'm going to make the assumption that by 'task' you mean 'worker'. The question would make little sense otherwise.
For some context it's important to understand the process hierarchy of Celery worker pools. A worker pool is a group of worker processes (or threads) that share the same configuration (process messages of the same set of queues, etc.). Each pool has a single parent process that manages the pool. This process controls how many child workers are forked and is responsible for forking replacement children when children die. The parent process is the only process bound to AMQP and the children ingest and process tasks from the parent via IPC. The parent process itself does not actually process (run) any tasks.
Additionally, and towards an answer to your question, the parent process is the process responsible for responding to your Celery inspect broadcasts, and the PIDs listed as workers in the pool are only the child workers. The parent PID is not included.
If you're starting the Celery daemon using the --pidfile command-line parameter, that file will contain the PID of the parent process and you should be able to cross-reference that PID with the process you're referring to determine if it is in fact a pool parent process. If you're using Celery multi to start multiple instances (multiple worker pools) then by default PID files should be located in the directory from which you invoked Celery multi. If you're not using either of these means to start Celery try using one of them to verify that the process isn't a zombie and is in fact simply a parent.

scheduling jobs using python apscheduler

I have to monitor a process continuously and I use the process ID to monitor the process. I wrote a program to send an email once the process had stopped so that I would manually reschedule it, but often I forget to reschedule the process ( basically another python program). I then came across the apscheduler module and used the cron style scheduling ( http://packages.python.org/APScheduler/cronschedule.html) to spawn a process once it has stopped. Now, I am able to spawn the process once PID of the process has been killed, but when I spawn it using the apscheduler I am not able to get the process id (PID) of the newly scheduled process; Hence, I am not able to monitor the process. Is there a function in apscheduler to get the process ID of the scheduled process?

Instead of relying on APSchedule to return the pid, why not have your program report the pid itself. It's quite common for daemons to have pidfiles, which are files at a known location that just contain the pid of the running process. Just wrap your main function in something like this:
import os
try:
with open("/tmp/myproc.pid") as pidfile:
pidfile.write(str(os.getpid()))
main()
finally:
os.remove("/tmp/myproc.pid")
Now whenever you want to monitor your process you can firstly check to see in the pid file exists, and if it does, retrieve the pid of the process for further monitoring. This has the benefit of being independent of a specific implementation of cron, and will make it easier in future if you want to write more programs that interact with the program locally.

Starting a process as a subprocess in Python

I am writing a program that uses multiple worker processes (a pre-forking model) with the following code.
from multiprocessing import Process
for i in range(0,3):
Process(target=worker, args=(i,)).start()
I use Windows. I notice that they are run as separate processes when I wanted them to start as subprocesses instead. How do I make them subprocesses of the main process?
I am hesitant to use the subprocess module as it seems suited to run external processes (as far as I have used it).
An update: It seems Windows does not launch new processes as sub-processes. Python doesnt support getppid() (get parent's PID) in Windows.

What do you wall subprocess ? To me they are subprocess of your main process. Here my example and returned output.
import time, os
from multiprocessing import Process
def worker():
print "I'm process %s, my father is %s" % (os.getpid(), os.getppid())
print "I'm the main process %s" % os.getpid()
for i in range(0,3):
Process(target=worker).start()
The output is :
I'm the main process 5897
I'm process 5898, my father is 5897
I'm process 5899, my father is 5897
I'm process 5900, my father is 5897
You have 3 subprocesses attached to a main process...

You seem to be confusing terminology here. A subprocess is a separate process. The processes that are created will be children of the main process of your program, and in that sense are subprocesses. If you want threads, then use multithreading instead of multiprocessing, but note that Python won't use multiple cores/CPUs for multiple threads.
I am hesitant to use the subprocess module as it seems suited to run external processes
I'm sorry, I don't understand this remark.

Short answer: http://docs.python.org/library/threading.html
Longer: I don't understand the question, aitchnyu. In the typical Unix model, the only processes a process can start are subprocesses. I have a strong feeling that there's a vocabulary conflict between the two of us I don't know how to unravel. You seem to have something like an "internal process" in mind; what's an example of that, in any language or operating system?
I can attest that Python's subprocess module is widely used.
You write "... multiple working threads ..." Have you read the documentation to which I refer in the first line at the top of this response?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.