We are writing a system where memory is tight. There are multiple python processes that will import the same set of classes. If the processes load a gazillion classes and each process consumes a couple hundred megs, then 10 processes and I am running into gigs. Is there a way to "share" class imports across python processes?
Summary:
import classA # Process 1 : loads classA in memory
import classA # Process 2 : reuses what was loaded previously by Process 1
PS: What I'm trying to achieve is something that you would do with .so modules in C/C++.
It's possible to save at least a bit of memory if your OS supports an efficient copy-on-write fork (many do). Just import everything once in the parent, then fork() to create all the children.
Note that you won't save anything if you have many small objects, as the objects' refcounts will need to be written to. However, you could see substantial savings with large static structures like attribute dictionaries.
I don't believe this is possible in the design of python. You might want to uses threads instead of separate processes if this is a big deal. Also, if you're running 100 Megs just by importing several modules you might be doing something wrong in the modules (seems like quite a bit of memory to be using).
One possibility if you absolutely must import everything, can't cut down on memory usage and can't use threads would be to move the big memory code into python extensions written in C/C++. This would then allow you to share the common segments of the shared library your python extensions are in across processes. It also might reduce the memory footprint somewhat.
P.S. In python you're actually importing modules, not classes. Similar to how in C/C++ you're including header files, not classes.
Related
I have the following problem, Lets have this python function
def func():
run some code here which calls some native code
Inside func() I am calling some functions which in turn calls some native C code.
If any crash happens the whole python process crashes alltoghether.
How is possible to catch and recover from such errors?
One way that came to my mind is run this function in a separate process, but not just starting another process because there is a lot of memory and objects used by the function, will be very hard to split that. Is there something like fork() in C available in python, to create a copy of the same exact process with same memory structures and etc?
Or maybe other ideas?
Update:
It seems that there is no real way of catching the C runtime errors in python, those are at a lower level and crashes the whole Python virtual machine.
As solutions you currently have two options:
Use os.fork() but work only in unix like OS env.
Use multiprocessing and a shared memory model to share big objects between processes. Usual serialization will just not work with objects that have multi-gigabytes in memory (you will just run out of memory). However there is a very good python library called Ray (https://docs.ray.io/en/master/) that performs in-memory big objects serialization using shared memory model and it's ideal for BigData/ML workloads - highly recommended.
As long as you are running on an operating system that supports fork that's already how the multiprocessing module creates subprocesses. You could os.fork, multiprocessing.Process or multiprocessing.Pool to get what you want. You can also use the os.fork() call on these systems.
I need to read a large dataset (about 25GB of images) into memory and read it from multiple processes. None of the processes has to write, only read. All the processes are started using Python's multiprocessing module, so they have the same parent process. They train different models on the data and run independently of each other. The reason why I want to read it only one time rather than in each process is that the memory on the machine is limited.
I have tried using Redis, but unfortunately it is extremely slow when many processes read from it. Is there another option to do this?
Is it maybe somehow possible to have another process that only serves as a "get the image with ID x" function? What Python module would be suited for this? Otherwise, I was thinking about implementign a small webserver using werkzeug or Flask, but I am not sure if that would become my new bottleneck then...
Another possibility that came to my mind was to use threads instead of processes, but since Python is not really doing "real" multithreading, this would probably become my new bottleneck.
If you are on linux and the content is read-only, you can use the linux fork inheriting mechanism.
from mp documentation:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
which means:
Before you fork your child processes, prepare your big data in a module level variable (global to all the functions).
Then in the same module, run your child with multiprocessing in 'fork' mode set_start_method('fork').
using this the sub-processes will see this variable without copying it. This happens due to linux forking mechanism that creates child processes with the same memory mapping as the parent (see "copy on write").
I'd suggest mmapping the files, that way they can be shared across multiple processes as well as getting swapped in/out as appropriate
the details of this would depend on what you mean by "25GB of images" and how these models want to access the images
the basic idea would be to preprocess the images into an appropriate format (e.g. one big 4D uint8 numpy array or maybe smaller ones, indicies could be (image, row, column, channel)) and save them in a format where they can be efficiently used by the models. see numpy.memmap for some examples of this
I'd suggest preprocessing files into a useful format "offline", i.e. not part of the model training but a seperate program that is run first. as this would probably take a while and you'd probably not want to do it every time
I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?
Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?
Which makes me wonder why do we need to pickle the object in order to
do multiprocessing?
We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).
Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).
isn't fork() enough?
Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).
I kind of get why we need pickle to do interprocess communication, but
that is just for the data you want to transfer right? why does the
multiprocessing module also try to pickle stuff like functions etc?
Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".
Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).
There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.
I'm playing with Python multiprocessing module to have a (read-only) array shared among multiple processes. My goal is to use multiprocessing.Array to allocate the data and then have my code forked (forkserver) so that each worker can read straight from the array to do their job.
While reading the Programming guidelines I got a bit confused.
It is first said:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of
data between processes.
It is probably best to stick to using queues or pipes for
communication between processes rather than using the lower level
synchronization primitives.
And then, a couple of lines below:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
As far as I understand, queues and pipes pickle objects. If so, aren't those two guidelines conflicting?
Thanks.
The second guideline is the one relevant to your use case.
The first is reminding you that this isn't threading where you manipulate shared data structures with locks (or atomic operations). If you use Manager.dict() (which is actually SyncManager.dict) for everything, every read and write has to access the manager's process, and you also need the synchronization typical of threaded programs (which itself may come at a higher cost from being cross-process).
The second guideline suggests inheriting shared, read-only objects via fork; in the forkserver case, this means you have to create such objects before the call to set_start_method, since all workers are children of a process created at that time.
The reports on the usability of such sharing are mixed at best, but if you can use a small number of any of the C-like array types (like numpy or the standard array module), you should see good performance (because the majority of pages will never be written to deal with reference counts). Note that you do not need multiprocessing.Array here (though it may work fine), since you do not need writes in one concurrent process to be visible in another.
Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory