I want to use the Python multiprocessing module to communicate with child processes (via Pipes), which are independent of the caller (ie completely different and independent source code, path, executable).
The main process is doing some networking while the childs should simply do some data conversion (file in, file out). I need to do that several times a second, so restarting a subprocess for each file is no solution.
Is there a way to do that (under Windows)?
Related
I currently have a worker subprocess that does a lot of processing for my main application. I have my stdin and stdout already connected between the two, but now I need more than just these two between the main application and the subprocess worker, since the subprocess worker is heavily threaded and should be able to run multiple different workload at the same time.
So for each workload, I want to dynamically create a separate pipe between the main and subprocess. I don't want to run more than 1 subprocess worker in the background, and want everything to run on a single subprocess.
My problem that I ran into is creating named pipes between the main and subprocess, that will work on both Unix an Windows. Unix has a os.mkfifo() that can be used with tempfiles, but that does not work on Windows. The os.pipe() will not work, because the memory block being allocated is only for the main application, and I have no way to link that with the subprocess.
So basically,
tmp_read, tmp_write = os.pipe( )
for both tmp_read and tmp_write, they are represented as integers, or memory blocks on the main app stack. I can't send these integers to the subprocess and connect, as the subprocess has no idea what they mean. Am I missing something, or is it not possible to share an undefined number of pipes between the processes using IPC? I can't use sockets for IPC either, since the computers that this has to run on are heavily restricted and I don't want to deal with blocked ports.
I am trying to create a Monitor script that monitors all the threads or a huge python script which has several loggers running, several thread running.
From Monitor.py i could run subprocess and forward the STDOUT which might contain my status of the threads.. but since several loggers are running i am seeing other logging in that..
Question: How can run the main script as a separate process and get custom messages, thread status without interfering with logging. ( passing PIPE as argument ? )
Main_Script.py * Runs Several Threads * Each Thread has separate Loggers.
Monitor.py * Spins up the Main_script.py * Monitors the each of the threads in MainScript.py ( may be obtain other messages from Main_script in the future)
So Far, I tried subprocess, process from Multiprocessing.
Subprocess lets me start the Main_script and forward the stdout back to monitor but I see the logging of threads coming in through the same STDOUT. I am using the “import logging “ Library to log the data from each threads to separate files.
I tried “process” from Multiprocessing. I had to call the main function of the main_script.py as a process and send a PIPE argument to it from monitor.py. Now I can’t see the Main_script.py as a separate process when I run top command.
Normally, you want to change the child process to work like a typical Unix userland tool: the logging and other side-band information goes to stderr (or to a file, or syslog, etc.), and only the actual output goes to stdout.
Then, the problem is easy: just capture stdout to a PIPE that you process, and either capture stderr to a different PIPE, or pass it through to real stderr.
If that's not appropriate for some reason, you need to come up with some other mechanism for IPC: Unix or Windows named pipes, anonymous pipes that you pass by leaking the file descriptor across the fork/exec and then pass the fd as an argument, Unix-domain sockets, TCP or UDP localhost sockets, a higher-level protocol like a web service on top of TCP sockets, mmapped files, anonymous mmaps or pipes that you pass between processes via a Unix-domain socket or Windows API calls, …
As you can see, there are a huge number of options. Without knowing anything about your problem other than that you want "custom messages", it's impossible to tell you which one you want.
While we're at it: If you can rewrite your code around multiprocessing rather than subprocess, there are nice high-level abstractions built in to that module. For example, you can use a Queue that automatically manages synchronization and blocking, and also manages pickling/unpickling so you can just pass any (picklable) object rather than having to worry about serializing to text and parsing the text. Or you can create shared memory holding arrays of int32 objects, or NumPy arrays, or arbitrary structures that you define with ctypes. And so on. Of course you could build the same abstractions yourself, without needing to use multiprocessing, but it's a lot easier when they're there out of the box.
Finally, while your question is tagged ipc and pipe, and titled "Interprocess Communication", your description refers to threads, not processes. If you actually are using a bunch of threads in a single process, you don't need any of this.
You can just stick your results on a queue.Queue, or store them in a list or deque with a Lock around it, or pass in a callback to be called with each new result, or use a higher-level abstraction like concurrent.futures.ThreadPoolExecutor and return a Future object or an iterator of Futures, etc.
It seems like subprocess.Popen() and os.fork() both are able to create a child process. I would however like to know what the difference is between both. When would you use which one? I tried looking at their source code but I couldn't find fork()'s source code on my machine and it wasn't totally clear how Popen works on Unix machines.
Could someobody please elaborate?
Thanks
subprocess.Popen let's you execute an arbitrary program/command/executable/whatever in its own process.
os.fork only allows you to create a child process that will execute the same script from the exact line in which you called it. As its name suggests, it "simply" forks the current process into 2.
os.fork is only available on Unix, and subprocess.Popen is cross-platfrom.
So I read the documentation for you. Results:
os.fork only exists on Unix. It creates a child process (by cloning the existing process), but that's all it does. When it returns, you have two (mostly) identical processes, both running the same code, both returning from os.fork (but the new process gets 0 from os.fork while the parent process gets the PID of the child process).
subprocess.Popen is more portable (in particular, it works on Windows). It creates a child process, but you must specify another program that the child process should execute. On Unix, it is implemented by calling os.fork (to clone the parent process), then os.execvp (to load the program into the new child process). Because Popen is all about executing a program, it lets you customize the initial environment of the program. You can redirect its standard handles, specify command line arguments, override environment variables, set its working directory, etc. None of this applies to os.fork.
In general, subprocess.Popen is more convenient to use. If you use os.fork, there's a lot you need to handle manually, and it'll only work on Unix systems. On the other hand, if you actually want to clone a process and not execute a new program, os.fork is the way to go.
Subprocess.popen() spawns a new OS level process.
os.fork() creates another process which will resume at exactly the same place as this one. So within the first loop run, you get a fork after which you have two processes, the "original one" (which gets a pid value of the PID of the child process) and the forked one (which gets a pid value of 0).
According to
https://github.com/joblib/joblib/issues/180, and Is there a safe way to create a subprocess from a thread in python?
the Python multiprocessing module does not allow use from within threads. Is this true?
My understanding is that its fine to fork from threads, as long as you
aren't holding a threading.Lock when you do so (in the current thread? anywhere in the process?). However, Python's documentation is silent on whether threading.Lock objects are safely shared after a fork.
There's also this: locks shared from the logging module causes issues with fork. https://bugs.python.org/issue6721
I'm not sure how this issue arises. It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock). If so, does using multiprocessing really provide any protection against this (since I'm free to create my multiprocessing.Pool after threading.Lock is created and entered by other threads, and after threads have started that using the not-fork-safe logging module) -- the multiprocessing module docs are also silent about whether multiprocessing.Pools should be allocated before Locks.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock).
It is not a design error, rather, fork() predates single-process multithreading. The state of all locks is copied into the child process because they're just objects in memory; the entire address-space of the process is copied as is in fork. There are only bad alternatives: either copy all threads over fork, or deny forking in multithreaded application.
Therefore, fork()ing in a multithreading program was never the safe thing to do, unless then followed by execve() or exit() in the child process.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
No. Nothing makes it safe to combine threads and forks, it cannot be done.
The problem is that when you have multiple threads in a process, after fork() system call you cannot continue safely running the program in POSIX systems.
For example, Linux manuals fork(2):
After a fork(2) in a multithreaded program, the child can safely call
only async-signal-safe functions (see signal(7)) until such time as it
calls execve(2).
I.e. it is OK to fork() in a multithreaded program and then only call async-signal-safe C functions (which is a rather limited subset of C functions), until the child process has been replaced with another executable!
Unsafe C function calls in child processes are then for example
malloc for dynamic memory allocation
any <stdio.h> functions for formatted input
most of the pthread_* functions required for thread state handling, including creation of new threads...
Thus there is very little what the child process can actually safely do. Unfortunately CPython core developers have been downplaying the problems caused by this. Even now the documentation says:
Note that safely forking a multithreaded process is
problematic.
Quite an euphemism for "impossible".
It is safe to use multiprocessing from a Python process that has multiple threads of control provided that you're not using the fork start method; in Python 3.4+ it is now possible to change the start method. In previous Python versions including all of Python 2, the POSIX systems always behaved as if fork was specified as the start method; this would result in undefined behaviour.
The problems are not limited to just threading.Lock objects but all locks held by the C standard library, the C extensions etc. What is worse that most of the time people would say "it works for me"... until it stops from working.
There were even a cases where a seemingly single-threading Python program is actually multithreading in MacOS X, causing failures and deadlocks upon using multiprocessing.
Another problem is that all opened file handles, their use, shared sockets might behave oddly in programs that forks, but that would be the case even in single-threaded programs.
TL;DR: using multiprocessing in multithreaded programs, with C extensions, with opened sockets etc:
fine in 3.4+ & POSIX if you explicitly specify a starting method that is not fork,
fine in Windows because it doesn't support forking;
in Python 2 - 3.3 on POSIX: you'll mostly shoot yourself in the foot.
I've written a large program in Python that calls numerous custom modules' main methods one after another. The parent script creates some common resources, like logging instances, database cursors, and file references which I then pass around to the individual modules. The problem is that I now need to call some of these modules by means of subprocess.check_output, and I don't know how I can share the aforementioned resources across these modules. Is this possible?
The question exactly as you ask it has no general answer. There might be custom ways; e.g. on Linux a lot of things are actually file descriptors, and there are ways to pass them to subprocesses, but it's not nicely Pythonic: you have to give them as numbers on the command line of the subprocess, and then the subprocess rebuilds a file object around the file descriptor (see file.fileno() and os.fdopen() for regular files; I'm not sure there are ways to do it in Python for other things than regular files...).
In your problem, if everything is in Python, why do you need to make subprocesses instead of doing it all in a single process?
If you really need to, then one general way is to use os.fork() instead of the subprocess module: you'd fork the process (which creates two copies of it); in the parent copy you wait for the child copy to terminate; and in the child copy you proceed to run the particular submodule. The advantage is that at the end the child process terminates, which cleans up what it did --- while at the same time starting with its own copy of almost everything that the parent had (file descriptors, database cursors, etc.)