Recently I'm studying parallel programming tools in Python. And here are two major differences between os.pipe and multiprocessing.Pipe.(despite the occasion they are used)
os.pipe is unidirectional, multiprocessing.Pipe is bidirectional;
When putting things into pipe/receive things from pipe, os.pipe uses encode/decode, while multiprocessing.Pipe uses pickle/unpickle
I want to know if my understanding is correct, and is there other difference? Thank you.
I believe everything you've stated is correct.
On Linux, os.pipe is just a Python interface for accessing traditional POSIX pipes. On Windows, it's implemented using CreatePipe. When you call it, you get two ordinary file descriptors back. It's unidirectional, and you just write bytes to it on one end that get buffered by the kernel until someone reads from the other side. It's fairly low-level, at least by Python standards.
multiprocessing.Pipe objects are much more high level interface, implemented using multiprocessing.Connection objects. On Linux, these are actually built on top of POSIX sockets, rather than POSIX pipes. On Windows, they're built using the CreateNamedPipe API. As you noted, multiprocessing.Connection objects can send/receive any picklable object, and will automatically handle the pickling/unpickling process, rather than just dealing with bytes. They're capable of being both bidirectional and unidirectional.
Related
I need to mutex several processes running python on a linux host.
They processes are not spawned in a way I control (to be clear, they are my code), so i cannot use multithreading.Lock, at least as I understand it. The resource being synchronized is a series of reads/writes to two separate internal services, which are old, stateful, not designed for concurrent/transactional access, and out of scope to modify.
a couple approaches I'm familiar with but rejected so far:
In native code using shmget / pthread_mutex_lock (eg create a pthread mutex by well-known string name, in shared memory provided by the OS). Im hoping to not have to use/add a ctypes wrapper for this (or ideally have any low-level constructs visible at all here for this high-level app).
Using one of the lock file libraries such as fasteners would work - but requiring any particular file system access is awkward (the library/approach could use it robustly under the hood, but ideally my client code is abstracted from that).
Is there a preferred way to accomplish this in python (under linux; bonus points for cross-platform)?
Options for synchronizing non-child processes:
Use a remote manager. I'm not super familiar with this process, but the docs has at least a simple example.
create a simple server with your own protocol (rather than a manager): something like a socket server on the loopback address for bouncing simple messages around.
use the filesystem: https://pypi.org/project/filelock/
On posix compliant systems, there's a rather straightforward wrapper for IPC constructs posix-ipc. I also found a wrapper for windows semaphores, but it's not quite as simple (though also not difficult per-say). In both cases your program would use a well known string "name" to access / create the mutex. In both cases, care / error checking is needed to handle creation of the mutex properly (see things like O_CREX flag...)
I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?
Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?
Which makes me wonder why do we need to pickle the object in order to
do multiprocessing?
We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).
Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).
isn't fork() enough?
Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).
I kind of get why we need pickle to do interprocess communication, but
that is just for the data you want to transfer right? why does the
multiprocessing module also try to pickle stuff like functions etc?
Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".
Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).
There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.
TL;DR: How can I spawn a different python interpreter (from within python) and create a communication channel between the parent and child when stdin/stdout are unavailable?
I would like my python script to execute a modified python interpreter and through some kind of IPC such as multiprocessing.Pipe communicate with the script that interpreter runs.
Lets say I've got something similar to the following:
subprocess.Popen(args=["/my_modified_python_interpreter.exe",
"--my_additional_flag",
"my_python_script.py"])
Which works fine and well, executes my python script and all.
I would now like to set up some kind of interprocess communication with that modified python interpreter.
Ideally, I would like to share something similar to one of the returned values from multiprocessing.Pipe(), however I will need to share that object with the modified python process (and I suspect multiprocessing.Pipe won't handle that well even if I do that).
Although sending text and binary will be sufficient (I don't need to share python objects or anything), I do need this to be functional on all major OSes (windows, Linux, Mac).
Some more use-case/business explanation
More specifically, the modified interpreter is the IDAPython interpreter that is shipped with IDA to allow scripting within the IDA tool.
Unfortunately, since stdio is already heavily used for the existing user interface functionalities (provided by IDA), I cannot use stdin/stdout for the communication.
I'm searching for possibilities that are better than the one's I thought of:
Use two (rx and tx channels) hard-disk files and pass paths to both as the arguments.
Use a local socket and pass a path as an argument.
Use a memory mapped file and the tagname on windows and some other sync method on other OSes.
After some tinkering with the multiprocessing.Pipe function and the multiprocesing.Connection objects it returns, I realized that serialization of Connection objects is far simpler that I originally thought.
A Connection object has three descripting properties:
fileno - A handle. An arbitrary file descriptor on Unix and a socket on windows.
readable - A boolean controlling whether Connection object can be read.
writable - A boolean controlling whether Connection object can be written.
All three properties are accessible as object attributes and are controllable through the Connection class constructor.
It appears that if:
The process calling Pipe spawns a child process and shares the connection.fileno() number.
The child process creates a Connection object using that file descriptor as the handle.
Both interpreters implement the Connection object roughly the same (And this is the risky part, I guess).
It is possible to Connection.send and Connection.recv between those two processes although they do not share the same interpreter build and the multiprocessing module was not actually used to instantiate the child process.
EDIT:
Please note the Connection class is available as multiprocessing.connection.Connection in python3 and as _multiprocessing.Connection in python2 (which might suggest it's usage is discouraged. YMMV)
Going with the other answer of mine turned out to be a mistake. Because of how handles are inherited in python2 on Windows I couldn't get the same solution to work on Windows machines. I ended up using the far superior Listener and Client interfaces also found in the multiprocessing module.
This question of mine discusses that mistake.
first question so please be gentle.
i am using python.
when creating a named pipe to a c++ windows program with
PIPE = open(r'\\.\pipe\NamedPipe','rb+',0)
as global i can read/write from and to the pipe.
def pipe_writer():
PIPE.write(some_stuff)
def pipe_reader():
data = struct.unpack("byte-type",PIPE.read(number_of_bytes),0)
pipe_writer()
pipe_reader()
this is fine to collect data from the pipe and process the complete data with several functions, one function after the other.
unfortunately i have to process the data bit by bit as i pull it from the pipe with several functions in a serialized manner.
i thought that queueing the data would just do the job so i use the multiprocess module.
when i try to multiprocess i am able to create the pipe and send data once when opening it it after:
if __name__ == '__main__':
PIPE = open(r'\\.\pipe\NamedPipe','rb+',0)
PIPE.write(some_stuff)
when I then try to .start() the functions as processes and read from the pipe I get an error that the pipe doesn't exist or is open in the wrong mode, which can't really be as it works just fine when reading/writing to it without using Process() on the functions AND i can write to it ... even if it's only once.
any suggestions? Also I think I kinda need to use multiprocess as threading doesn't work ... probably ... because of the GIL and slowing stuff down.
If you're in control of the C++ source code too, you can save yourself a lot of code and hassle by moving on to using ZeroMQ or Nanomsg instead of the pipe, and Google Protocol Buffers instead of interpreting a byte stream yourself.
ZeroMQ and Nanomsg are like networks / pipes /IPC on steroids, and are much easier to use than raw pipes, sockets, etc. You have less source code and more functionality : win-win.
Google's protocol Buffers allow you to define data structures (messages) in a language neutral way, and then auto generate source code in C++, Python, Java or whatever. This source code defines structs, classes, etc that represent the messages and also converts them to a standard binary format. That binary data is what you'll send via ZeroMQ. Again, less source code for you to write, more functionality.
This is ideal for getting C++ classes into Python and vice versa.
nanomsg python wrapper is also available on GitHub at Nanomsg Python.
Examples you can see at Examples. I guess this wrapper will serve your purpose. It's always better to use this in place of raw PIPEs. It supports IPC, Between Process and TCP communication patterns.
Moreover it is crossplatform and it's basic implementation is in C. So I guess communication between python and C process can also be made possible.
Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory