Python - alternatives for internal memory

Python - alternatives for internal memory - python

I'm coding a program that requires high memory usage.
I use python 3.7.10.
During the program I create about 3GB of python objects, modifying them.
Some objects I create contain pointer to other objects.
Also, sometimes I need to deepcopy one object to create another.
My problem is that these objects creation and modification takes a lot of time and causing some performance issues.
I wish I could do some of the creation and modification in parallel. However, there are some limitations:
the program is very CPU-bound and there is almost no usage of IO/network - so multithreading library will not work due to the GIL
the system I work with has no Read-on-write feature- so using multiprocessing python library spend a lot of time on forking the process
the objects do not contain numbers and most of the work in the program are not mathematical - so I cannot benefit from numpy and ctypes
What can be a good alternative for this kind of memory to allow me to parallelize better my code?

Deepcopy is extremely slow in python. A possible solution is to serialize and load the objects from the disk. See this answer for viable options – perhaps ujson and cPickle. Furthermore, you can serialize and deserialize objects asynchronously using aiofiles.

Can't you use your GPU RAM and use CUDA?
https://developer.nvidia.com/how-to-cuda-python
If it doesn't need to be realtime I'd use PySpark (see streaming section https://spark.apache.org/docs/latest/api/python/) and work with remote machines.
Can you tell me a bit about the application? Perhaps you're searching for something like the PyTorch framework (https://pytorch.org/).

You may also like to try using Transparent Huge Pages and a hugepage-aware allocator, such as tcmalloc. That may speed up your application by 5-15% without having to change a line of code.
See thp-usage for more information.

Related

How to catch runtime errors from native code in python?

I have the following problem, Lets have this python function
def func():
run some code here which calls some native code
Inside func() I am calling some functions which in turn calls some native C code.
If any crash happens the whole python process crashes alltoghether.
How is possible to catch and recover from such errors?
One way that came to my mind is run this function in a separate process, but not just starting another process because there is a lot of memory and objects used by the function, will be very hard to split that. Is there something like fork() in C available in python, to create a copy of the same exact process with same memory structures and etc?
Or maybe other ideas?
Update:
It seems that there is no real way of catching the C runtime errors in python, those are at a lower level and crashes the whole Python virtual machine.
As solutions you currently have two options:
Use os.fork() but work only in unix like OS env.
Use multiprocessing and a shared memory model to share big objects between processes. Usual serialization will just not work with objects that have multi-gigabytes in memory (you will just run out of memory). However there is a very good python library called Ray (https://docs.ray.io/en/master/) that performs in-memory big objects serialization using shared memory model and it's ideal for BigData/ML workloads - highly recommended.

As long as you are running on an operating system that supports fork that's already how the multiprocessing module creates subprocesses. You could os.fork, multiprocessing.Process or multiprocessing.Pool to get what you want. You can also use the os.fork() call on these systems.

How to share objects and data between python processes in real-time?

I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?

I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.

First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.

For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.

One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657

Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.

Why does python multiprocessing pickle objects to pass objects between processes?

Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.

multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory

How does Python handle memory?

I've been looking at a in-memory database -- and it got me thinking, how does Python handle IO that's not tied to a connection (and even data that is); for example, hashes, sets, etc.; is this a config somewhere, or is it dynamically managed based on resources; are there "easy" ways to view the effect resources are having on a real program, and simulate what the performance hit would be differing hardware setups?
NOTE: If it matters, Redis is the in-memory data store I'm looking at; there's an implementation of a wrapper for Redis datatypes so they mimic the datatypes found in Python.

Python allocates all memory that the application asks for. There is not much room for policy. The only issue is when to release memory. (C)Python immediately releases all memory that is not referenced anymore (this is also not tunable). Memory that is referenced only from itself (ie. cycles) are released by the garbage collector; this has tunable settings.
It is the operating system's decision to write some of the memory into the pagefile.

Not exactly what you're asking for, but Dowser is a Python tool for interactively browsing the memory usage of your running program. Very useful in understanding memory usage and allocation patterns.
http://www.aminus.net/wiki/Dowser

How to debug a MemoryError in Python? Tools for tracking memory use?

I have a Python program that dies with a MemoryError when I feed it a large file. Are there any tools that I could use to figure out what's using the memory?
This program ran fine on smaller input files. The program obviously needs some scalability improvements; I'm just trying to figure out where. "Benchmark before you optimize", as a wise person once said.
(Just to forestall the inevitable "add more RAM" answer: This is running on a 32-bit WinXP box with 4GB RAM, so Python has access to 2GB of usable memory. Adding more memory is not technically possible. Reinstalling my PC with 64-bit Windows is not practical.)
EDIT: Oops, this is a duplicate of Which Python memory profiler is recommended?

Heapy is a memory profiler for Python, which is the type of tool you need.

The simplest and lightweight way would likely be to use the built in memory query capabilities of Python, such as sys.getsizeof - just run it on your objects for a reduced problem (i.e. a smaller file) and see what takes a lot of memory.

In your case, the answer is probably very simple: Do not read the whole file at once but process the file chunk by chunk. That may be very easy or complicated depending on your usage scenario. Just for example, a MD5 checksum computation can be done much more efficiently for huge files without reading the whole file in. The latter change has dramatically reduced memory consumption in some SCons usage scenarios but was almost impossible to trace with a memory profiler.
If you still need a memory profiler: eliben already suggested sys.getsizeof. If that doesn't cut it, try Heapy or Pympler.

You asked for a tool recommendation:
Python Memory Validator allows you to monitor the memory usage, allocation locations, GC collections, object instances, memory snapshots, etc of your Python application. Windows only.
http://www.softwareverify.com/python/memory/index.html
Disclaimer: I was involved in the creation of this software.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.