Sharing extension-allocated data structures across python processes - python

I am working on a python extension that does some specialized tree things on a large (5GB+) in-memory data structure.
The underlying data structure is already thread-safe (it uses RW locks), and I've written a wrapper around it that exposes it to python. It supports multiple simultaneous readers, only one writer, and I'm using a pthread_rwlock for access synchronization. Since the application is very read-heavy, multiple readers should provide a decent performance improvement (I hope).
However, I cannot determine what the proper solution is to allow the extension data-store to be shared across multiple python processes accessed via the multiprocessing module.
Basically, I'd like something that looks like the current multiprocessing.Value/multiprocessing.Array system, but the caveat here is that my extension is allocating all it's own memory in C++.
How can I allow multiple processes to access my shared data structure?
Sources are here (C++ Header-only library), and here (Cython wrapper).
Right now, if I build a instance of the tree, and then pass references out to multiple processes, it fails with a serialization error:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/queues.py", line 242, in _feed
obj = ForkingPickler.dumps(obj)
File "/usr/lib/python3.4/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
(failing test-case)
I'm currently releasing the GIL in my library, but there are some future tasks that would greatly benefit from independent processes, and I'd like to avoid having to implement a RPC system for talking to the BK tree.

If the extension data is going to exist as a single logical mutable object across multiple processes (so a change in Process A will be reflected in the view in Process B), you can't avoid some sort of IPC mechanism. The memory spaces of two processes are separate; Python can't magically share unshared data.
The closest you could get (without explicitly using shared memory at the C layer that could be used to allow mapping the same memory into each process) would be to use a custom subclass of multiprocessing.BaseManager, which would just hide the IPC from you (the actual object would live in a single process, with other processes proxying to that original object). You can see a really simple example in the multiprocessing docs.
The manager approach is simple, but performance-wise it's probably not going to do so hot; shared memory at the C layer avoids a lot of overhead that the proxying mechanism can't avoid. You'd need to test to verify anything. Making C++ STL use shared memory would be a royal pain to my knowledge, probably not worth the trouble, so unless the manager approach is shown to be too slow, I'd avoid even attempting the optimization.

Related

Why is pickle needed for multiprocessing module in python

I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?
Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?
Which makes me wonder why do we need to pickle the object in order to
do multiprocessing?
We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).
Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).
isn't fork() enough?
Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).
I kind of get why we need pickle to do interprocess communication, but
that is just for the data you want to transfer right? why does the
multiprocessing module also try to pickle stuff like functions etc?
Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".
Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).
There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.

Python multiprocessing guidelines seems to conflict: share memory or pickle?

I'm playing with Python multiprocessing module to have a (read-only) array shared among multiple processes. My goal is to use multiprocessing.Array to allocate the data and then have my code forked (forkserver) so that each worker can read straight from the array to do their job.
While reading the Programming guidelines I got a bit confused.
It is first said:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of
data between processes.
It is probably best to stick to using queues or pipes for
communication between processes rather than using the lower level
synchronization primitives.
And then, a couple of lines below:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
As far as I understand, queues and pipes pickle objects. If so, aren't those two guidelines conflicting?
Thanks.
The second guideline is the one relevant to your use case.
The first is reminding you that this isn't threading where you manipulate shared data structures with locks (or atomic operations). If you use Manager.dict() (which is actually SyncManager.dict) for everything, every read and write has to access the manager's process, and you also need the synchronization typical of threaded programs (which itself may come at a higher cost from being cross-process).
The second guideline suggests inheriting shared, read-only objects via fork; in the forkserver case, this means you have to create such objects before the call to set_start_method, since all workers are children of a process created at that time.
The reports on the usability of such sharing are mixed at best, but if you can use a small number of any of the C-like array types (like numpy or the standard array module), you should see good performance (because the majority of pages will never be written to deal with reference counts). Note that you do not need multiprocessing.Array here (though it may work fine), since you do not need writes in one concurrent process to be visible in another.

How to share a user-defined object that occupy large memory and changes frequently in python multiprocess efficiently?

I've tried module multiprocessing Manager and Array , but it can't meet my needs
Is there a method just like shared memory in linux C?
Not as such.
Sharing memory like this in the general case is very tricky. The CPython interpreter does not relocate objects, so they would have to be created in situ within the shared memory region. That means shared memory allocation, which is considerably more complex than just calling PyMem_Malloc(). In increasing order of difficulty, you would need cross-process locking, a per-process reference count, and some kind of inter-process cyclic garbage collection. That last one is really hard to do efficiently and safely. It's also necessary to ensure that shared objects only reference other shared objects, which is very difficult to do if you're not willing to relocate objects into the shared region. So Python doesn't provide a general purpose means of stuffing arbitrary full-blown Python objects into shared memory.
But you can share mmap objects between processes, and mmap supports the buffer protocol, so you can wrap it up in something higher-level like array/numpy.ndarray or anything else with buffer protocol support. Depending on your precise modality, you might have to write a small amount of C or Cython glue code to rapidly move data between the mmap and the array. This should not be necessary if you are working with NumPy. Note that high-level objects may require locking which mmap does not provide.

Why does python multiprocessing pickle objects to pass objects between processes?

Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory

Python 2.7.5 - Run multiple threads simultaneously without slowing down

I'm creating a simple multiplayer game in python. I have split the processes up using the default thread module in python. However I noticed that the program still slows down with the speed of other threads. I tried using the multiprocessing module but not all of my objects can be pickled.
Is there an alternative to using the multiprocessing module for running simultaneous processes?
Here are your options:
MPI4PY:
http://code.google.com/p/mpi4py/
Celery:
http://www.celeryproject.org/
Pprocess:
http://www.boddie.org.uk/python/pprocess.html
Parallel Python(PP):
http://www.parallelpython.com/
You need to analyze why your program is slowing down when other threads do their work. Assuming that the threads are doing CPU-intensive work, the slowdown is consistent with threads being serialized by the global interpreter lock.
It is impossible to answer in detaile without knowing more about the nature of the work your threads are performing and of objects that must be shared in parallel. In general, you have two viable options:
Use processes, typically through the multiprocessing module. The typical reasons why objects are not picklable is because they contain unpicklable state such as closures, open file handles, or other system resources. But pickle allows objects to implement methods like __getstate__ or __reduce__ which identify object's state, using the state to rebuild the objects. If your objects are unpicklable because they are huge, then you might need to write a C extension that stores them in shared memory or a memory-mapped file, and pickle only a key that identifies them in the shared memory.
Use threads, finding ways to work around the GIL. If your computation is concentrated in several hot spots, you can move those hot spots to C, and release the GIL for the duration of the computation. For this to work, the computation must not refer to any Python objects, i.e. all data must be extracted from the objects while the GIL is held, and stored back into the Python world after the GIL has been reacquired.

Categories