I'm looking to wrap a small C++ library for use in Python. I've briefly read up on the Python C-API, ctypes, Cython, SWIG, Boost.Python, and CLIF. Which framework (or other) should I use given my specific use-case described below?
Context: I have a multi-process program that does communication via shared memory. I've written Writer and Reader classes to interact with the shared memory and handle synchronization. I have multiple C++ programs working together via the shared memory and want to add a Python program to the party. Its behavior in interacting with the shared memory should be the same as the C++ programs.
Considerations:
(main) The data packets stored in the shared memory are large, so I want to avoid a copy when returning to Python. Specifically, the Reader::read() method returns a const T& reference so the client can directly read from it. I'd like to preserve this behavior in the Python class.
Would like to directly expose the Writer() and Reader() classes as Python classes without re-implementing logic.
I'm only exposing these two classes, not an entire library. I don't mind some upfront work if it means more maintainability.
All things being equal, I'd like to minimize external dependencies. That said, I don't mind dependencies if it leads to a better outcome.
Related
I'm coding a program that requires high memory usage.
I use python 3.7.10.
During the program I create about 3GB of python objects, modifying them.
Some objects I create contain pointer to other objects.
Also, sometimes I need to deepcopy one object to create another.
My problem is that these objects creation and modification takes a lot of time and causing some performance issues.
I wish I could do some of the creation and modification in parallel. However, there are some limitations:
the program is very CPU-bound and there is almost no usage of IO/network - so multithreading library will not work due to the GIL
the system I work with has no Read-on-write feature- so using multiprocessing python library spend a lot of time on forking the process
the objects do not contain numbers and most of the work in the program are not mathematical - so I cannot benefit from numpy and ctypes
What can be a good alternative for this kind of memory to allow me to parallelize better my code?
Deepcopy is extremely slow in python. A possible solution is to serialize and load the objects from the disk. See this answer for viable options – perhaps ujson and cPickle. Furthermore, you can serialize and deserialize objects asynchronously using aiofiles.
Can't you use your GPU RAM and use CUDA?
https://developer.nvidia.com/how-to-cuda-python
If it doesn't need to be realtime I'd use PySpark (see streaming section https://spark.apache.org/docs/latest/api/python/) and work with remote machines.
Can you tell me a bit about the application? Perhaps you're searching for something like the PyTorch framework (https://pytorch.org/).
You may also like to try using Transparent Huge Pages and a hugepage-aware allocator, such as tcmalloc. That may speed up your application by 5-15% without having to change a line of code.
See thp-usage for more information.
I have the following problem, Lets have this python function
def func():
run some code here which calls some native code
Inside func() I am calling some functions which in turn calls some native C code.
If any crash happens the whole python process crashes alltoghether.
How is possible to catch and recover from such errors?
One way that came to my mind is run this function in a separate process, but not just starting another process because there is a lot of memory and objects used by the function, will be very hard to split that. Is there something like fork() in C available in python, to create a copy of the same exact process with same memory structures and etc?
Or maybe other ideas?
Update:
It seems that there is no real way of catching the C runtime errors in python, those are at a lower level and crashes the whole Python virtual machine.
As solutions you currently have two options:
Use os.fork() but work only in unix like OS env.
Use multiprocessing and a shared memory model to share big objects between processes. Usual serialization will just not work with objects that have multi-gigabytes in memory (you will just run out of memory). However there is a very good python library called Ray (https://docs.ray.io/en/master/) that performs in-memory big objects serialization using shared memory model and it's ideal for BigData/ML workloads - highly recommended.
As long as you are running on an operating system that supports fork that's already how the multiprocessing module creates subprocesses. You could os.fork, multiprocessing.Process or multiprocessing.Pool to get what you want. You can also use the os.fork() call on these systems.
I'm relatively inexperienced with C++, but I need to build a framework to shuffle some data around. Not necessarily relevant, but the general flow path of my data needs to go like this:
Data is generated in a python script
The python object is passed to a compiled C++ extension
The C++ extension makes some changes and passes the data (presumably a pointer?) to compiled C++/CUDA code (.exe)
C++/CUDA .exe does stuff
Data is handled back in the python script and sent to more python functions
Step 3. is where I'm having trouble. How would I go about calling the .exe containing the CUDA code in a way that it can access the data that is seen in the C++ python extension? I assume I should be able to pass a pointer somehow, but I'm having trouble finding resources that explain how. I've seen references to creating shared memory, but I'm unclear on the details there, as well.
There are many ways two executables can exchange data.
Some examples:
write/read data to/from a shared file (don't forget locking so they don't stumble on eachother).
use TCP or UDP sockets between the processes to exchange data.
use shared memory.
if one application starts the other you can pass data via commandline arguments or in the environment.
use pipes between the processes.
use Unix domain sockets between the processes.
And there are more options but the above are probably the most common ones.
What you need to research is IPC (Inter-Process Communication).
I've tried module multiprocessing Manager and Array , but it can't meet my needs
Is there a method just like shared memory in linux C?
Not as such.
Sharing memory like this in the general case is very tricky. The CPython interpreter does not relocate objects, so they would have to be created in situ within the shared memory region. That means shared memory allocation, which is considerably more complex than just calling PyMem_Malloc(). In increasing order of difficulty, you would need cross-process locking, a per-process reference count, and some kind of inter-process cyclic garbage collection. That last one is really hard to do efficiently and safely. It's also necessary to ensure that shared objects only reference other shared objects, which is very difficult to do if you're not willing to relocate objects into the shared region. So Python doesn't provide a general purpose means of stuffing arbitrary full-blown Python objects into shared memory.
But you can share mmap objects between processes, and mmap supports the buffer protocol, so you can wrap it up in something higher-level like array/numpy.ndarray or anything else with buffer protocol support. Depending on your precise modality, you might have to write a small amount of C or Cython glue code to rapidly move data between the mmap and the array. This should not be necessary if you are working with NumPy. Note that high-level objects may require locking which mmap does not provide.
I am working on a python extension that does some specialized tree things on a large (5GB+) in-memory data structure.
The underlying data structure is already thread-safe (it uses RW locks), and I've written a wrapper around it that exposes it to python. It supports multiple simultaneous readers, only one writer, and I'm using a pthread_rwlock for access synchronization. Since the application is very read-heavy, multiple readers should provide a decent performance improvement (I hope).
However, I cannot determine what the proper solution is to allow the extension data-store to be shared across multiple python processes accessed via the multiprocessing module.
Basically, I'd like something that looks like the current multiprocessing.Value/multiprocessing.Array system, but the caveat here is that my extension is allocating all it's own memory in C++.
How can I allow multiple processes to access my shared data structure?
Sources are here (C++ Header-only library), and here (Cython wrapper).
Right now, if I build a instance of the tree, and then pass references out to multiple processes, it fails with a serialization error:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/queues.py", line 242, in _feed
obj = ForkingPickler.dumps(obj)
File "/usr/lib/python3.4/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
(failing test-case)
I'm currently releasing the GIL in my library, but there are some future tasks that would greatly benefit from independent processes, and I'd like to avoid having to implement a RPC system for talking to the BK tree.
If the extension data is going to exist as a single logical mutable object across multiple processes (so a change in Process A will be reflected in the view in Process B), you can't avoid some sort of IPC mechanism. The memory spaces of two processes are separate; Python can't magically share unshared data.
The closest you could get (without explicitly using shared memory at the C layer that could be used to allow mapping the same memory into each process) would be to use a custom subclass of multiprocessing.BaseManager, which would just hide the IPC from you (the actual object would live in a single process, with other processes proxying to that original object). You can see a really simple example in the multiprocessing docs.
The manager approach is simple, but performance-wise it's probably not going to do so hot; shared memory at the C layer avoids a lot of overhead that the proxying mechanism can't avoid. You'd need to test to verify anything. Making C++ STL use shared memory would be a royal pain to my knowledge, probably not worth the trouble, so unless the manager approach is shown to be too slow, I'd avoid even attempting the optimization.