I currently have a worker subprocess that does a lot of processing for my main application. I have my stdin and stdout already connected between the two, but now I need more than just these two between the main application and the subprocess worker, since the subprocess worker is heavily threaded and should be able to run multiple different workload at the same time.
So for each workload, I want to dynamically create a separate pipe between the main and subprocess. I don't want to run more than 1 subprocess worker in the background, and want everything to run on a single subprocess.
My problem that I ran into is creating named pipes between the main and subprocess, that will work on both Unix an Windows. Unix has a os.mkfifo() that can be used with tempfiles, but that does not work on Windows. The os.pipe() will not work, because the memory block being allocated is only for the main application, and I have no way to link that with the subprocess.
So basically,
tmp_read, tmp_write = os.pipe( )
for both tmp_read and tmp_write, they are represented as integers, or memory blocks on the main app stack. I can't send these integers to the subprocess and connect, as the subprocess has no idea what they mean. Am I missing something, or is it not possible to share an undefined number of pipes between the processes using IPC? I can't use sockets for IPC either, since the computers that this has to run on are heavily restricted and I don't want to deal with blocked ports.
Related
I am trying to create a Monitor script that monitors all the threads or a huge python script which has several loggers running, several thread running.
From Monitor.py i could run subprocess and forward the STDOUT which might contain my status of the threads.. but since several loggers are running i am seeing other logging in that..
Question: How can run the main script as a separate process and get custom messages, thread status without interfering with logging. ( passing PIPE as argument ? )
Main_Script.py * Runs Several Threads * Each Thread has separate Loggers.
Monitor.py * Spins up the Main_script.py * Monitors the each of the threads in MainScript.py ( may be obtain other messages from Main_script in the future)
So Far, I tried subprocess, process from Multiprocessing.
Subprocess lets me start the Main_script and forward the stdout back to monitor but I see the logging of threads coming in through the same STDOUT. I am using the “import logging “ Library to log the data from each threads to separate files.
I tried “process” from Multiprocessing. I had to call the main function of the main_script.py as a process and send a PIPE argument to it from monitor.py. Now I can’t see the Main_script.py as a separate process when I run top command.
Normally, you want to change the child process to work like a typical Unix userland tool: the logging and other side-band information goes to stderr (or to a file, or syslog, etc.), and only the actual output goes to stdout.
Then, the problem is easy: just capture stdout to a PIPE that you process, and either capture stderr to a different PIPE, or pass it through to real stderr.
If that's not appropriate for some reason, you need to come up with some other mechanism for IPC: Unix or Windows named pipes, anonymous pipes that you pass by leaking the file descriptor across the fork/exec and then pass the fd as an argument, Unix-domain sockets, TCP or UDP localhost sockets, a higher-level protocol like a web service on top of TCP sockets, mmapped files, anonymous mmaps or pipes that you pass between processes via a Unix-domain socket or Windows API calls, …
As you can see, there are a huge number of options. Without knowing anything about your problem other than that you want "custom messages", it's impossible to tell you which one you want.
While we're at it: If you can rewrite your code around multiprocessing rather than subprocess, there are nice high-level abstractions built in to that module. For example, you can use a Queue that automatically manages synchronization and blocking, and also manages pickling/unpickling so you can just pass any (picklable) object rather than having to worry about serializing to text and parsing the text. Or you can create shared memory holding arrays of int32 objects, or NumPy arrays, or arbitrary structures that you define with ctypes. And so on. Of course you could build the same abstractions yourself, without needing to use multiprocessing, but it's a lot easier when they're there out of the box.
Finally, while your question is tagged ipc and pipe, and titled "Interprocess Communication", your description refers to threads, not processes. If you actually are using a bunch of threads in a single process, you don't need any of this.
You can just stick your results on a queue.Queue, or store them in a list or deque with a Lock around it, or pass in a callback to be called with each new result, or use a higher-level abstraction like concurrent.futures.ThreadPoolExecutor and return a Future object or an iterator of Futures, etc.
I have a process that connects to a pipe with Python 2.7's multiprocessing.Listener() and waits for a message with recv(). I run it various on Windows 7 and Ubuntu 11.
On Windows, the pipe is called \\.\pipe\some_unique_id. On Ubuntu, the pipe is called /temp/some_unique_id. Other than that, the code is the same.
All works well, until, in an unrelated bug, monit starts a SECOND copy of the same program. It tries to listen to the exact same pipe.
I had naively* expected that the second connection attempt would fail, leaving the first connection unscathed.
Instead, I find the behaviour is officially undefined.
Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time.
On Ubuntu, the earlier copies seem to be ignored, and are left without any messages, while the latest version wins.
On Windows, there is some more complex behaviour. Sometimes the original pipe raises an EOFError exception on the recv() call. Sometimes, both listeners are allowed to co-exist and each message is distributed arbitrarily.
Is there a way to open a pipe exclusively, so the second process cannot open the pipe while the first process hasn't closed it or exited?
* I could have sworn I manually tested this exact scenario, but clearly I did not.
Other SO questions I looked at:
several TCP-servers on the same port - I don't (knowngly) set SO_REUSEADDR
Can two applications listen to the same port?
accept() with sockets shared between multiple processes (based on Apache preforking) - there's no forking involved.
Named pipes have the same access symantics as regular files. Any process with read or write permission can open the pipe for reading or writing.
If you had a way to guarantee that the two instances of the Python script were invoked by processes with differing UID's or GID's, then you can implement unique access control using file permissions.
If both instances of the script have the same UID and GID, you can try file locking implemented in Skip Montanaro's FileLock hosted on github. YMMV.
A simpler way to implement this might be to create a lock file in /var/lock that contains the PID of the process creating the lock file and then check for the existence of the lock file before opening the pipe. This scheme is used by most long-running daemons but has problems when the processes that create the lock files terminate in situations that prevent them from removing the lock file.
You could also try a Python System V semaphore to prevent synchronous access.
I need to create a semaphore to restrict the parallel count of a particular subprocess. I am using gunicorn with eventlet workers and allow many simultaneous connections. Mostly these are waiting on remote data. However, they all enter a processing phase at some point and this involves calling a subprocess. This subprocess though should not be run too often in parallel as it is memory/CPU hungry.
Is threading.Semaphore correctly monkey_patch'd and usable with eventlet inside gunicorn?
As I understand the problem:
one gunicorn process (this is crucial) spawns N green threads
each worker may spawn one or more subprocesses
you want to limit total number of subprocesses
In this case, yes, semaphore will work as expected.
However, if you have more than one process, they will have separate instances of semaphore and you would observe more subprocesses. In this case, I recommend to move subprocess responsibilities to a separate application, running on same machine and call it via API you like (RPC/socket/message queue/dbus/etc). You could design the system like this:
user -> gunicorn (any number of processes)
gunicorn -> one subprocess manager
manager -> N subprocesses
The manager listens for jobs from gunicorn, spawns a subprocess if needed, maybe reuses existing subprocesses. You may like a job queue system like Beanstalk, Celery, Gearman. Or you may wish to build a custom solution on top of existing message transports like NSQ, RabbitMQ, ZeroMQ.
I have a memory intensive Python application (between hundreds of MB to several GB).
I have a couple of VERY SMALL Linux executables the main application needs to run, e.g.
child = Popen("make html", cwd = r'../../docs', stdout = PIPE, shell = True)
child.wait()
When I run these external utilities (once, at the end of the long main process run) using subprocess.Popen I sometimes get OSError: [Errno 12] Cannot allocate memory.
I don't understand why... The requested process is tiny!
The system has enough memory for many more shells.
I'm using Linux (Ubuntu 12.10, 64 bits), so I guess subprocess calls Fork.
And Fork forks my existing process, thus doubling the amount of memory consumed, and fails??
What happened to "copy on write"?
Can I spawn a new process without fork (or at least without copying memory - starting fresh)?
Related:
The difference between fork(), vfork(), exec() and clone()
fork () & memory allocation behavior
Python subprocess.Popen erroring with OSError: [Errno 12] Cannot allocate memory after period of time
Python memory allocation error using subprocess.Popen
It doesn't appear that a real solution will be forthcoming (i.e. an alternate implementation of subprocess that uses vfork). So how about a cute hack? At the beginning of your process, spawn a slave that hangs around with a small memory footprint, ready to spawn your subprocesses, and keep open communication to it throughout the life of the main process.
Here's an example using rfoo (http://code.google.com/p/rfoo/) with a named unix socket called rfoosocket (you could obviously use other connection types rfoo supports, or another RPC library):
Server:
import rfoo
import subprocess
class MyHandler(rfoo.BaseHandler):
def RPopen(self, cmd):
c = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
c.wait()
return c.stdout.read()
rfoo.UnixServer(MyHandler).start('rfoosocket')
Client:
import rfoo
# Waste a bunch of memory before spawning the child. Swap out the RPC below
# for a straight popen to show it otherwise fails. Tweak to suit your
# available system memory.
mem = [x for x in range(100000000)]
c = rfoo.UnixConnection().connect('rfoosocket')
print rfoo.Proxy(c).RPopen('ls -l')
If you need real-time back and forth coprocess interaction with your spawned subprocesses this model probably won't work, but you might be able to hack it in. You'll presumably want to clean up the available args that can be passed to Popen based on your specific needs, but that should all be relatively straightforward.
You should also find it straightforward to launch the server at the start of the client, and to manage the socket file (or port) to be cleaned up on exit.
hi lets assume i have a simple programm in python. This programm is running every five minutes throught cron. but i dont know how to write it so the programm will allow to run multiple processes of its self simultaneously. i want to speed things up ...
I'd handle the forking and process control inside your main python program. Let the cron spawn only a single process and that process be a master for (possible multiple) worker processes.
As for how you can create multiple workers, there's the threading module for multi threading and multiprocessing module for multi processing. You can also keep your actual worker code as separate files and use the subprocess module.
Now that I think about it, maybe you should use supervisord to do the actual process control and simply write the actual work code.