I've written a large program in Python that calls numerous custom modules' main methods one after another. The parent script creates some common resources, like logging instances, database cursors, and file references which I then pass around to the individual modules. The problem is that I now need to call some of these modules by means of subprocess.check_output, and I don't know how I can share the aforementioned resources across these modules. Is this possible?
The question exactly as you ask it has no general answer. There might be custom ways; e.g. on Linux a lot of things are actually file descriptors, and there are ways to pass them to subprocesses, but it's not nicely Pythonic: you have to give them as numbers on the command line of the subprocess, and then the subprocess rebuilds a file object around the file descriptor (see file.fileno() and os.fdopen() for regular files; I'm not sure there are ways to do it in Python for other things than regular files...).
In your problem, if everything is in Python, why do you need to make subprocesses instead of doing it all in a single process?
If you really need to, then one general way is to use os.fork() instead of the subprocess module: you'd fork the process (which creates two copies of it); in the parent copy you wait for the child copy to terminate; and in the child copy you proceed to run the particular submodule. The advantage is that at the end the child process terminates, which cleans up what it did --- while at the same time starting with its own copy of almost everything that the parent had (file descriptors, database cursors, etc.)
Related
I need to read a large dataset (about 25GB of images) into memory and read it from multiple processes. None of the processes has to write, only read. All the processes are started using Python's multiprocessing module, so they have the same parent process. They train different models on the data and run independently of each other. The reason why I want to read it only one time rather than in each process is that the memory on the machine is limited.
I have tried using Redis, but unfortunately it is extremely slow when many processes read from it. Is there another option to do this?
Is it maybe somehow possible to have another process that only serves as a "get the image with ID x" function? What Python module would be suited for this? Otherwise, I was thinking about implementign a small webserver using werkzeug or Flask, but I am not sure if that would become my new bottleneck then...
Another possibility that came to my mind was to use threads instead of processes, but since Python is not really doing "real" multithreading, this would probably become my new bottleneck.
If you are on linux and the content is read-only, you can use the linux fork inheriting mechanism.
from mp documentation:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
which means:
Before you fork your child processes, prepare your big data in a module level variable (global to all the functions).
Then in the same module, run your child with multiprocessing in 'fork' mode set_start_method('fork').
using this the sub-processes will see this variable without copying it. This happens due to linux forking mechanism that creates child processes with the same memory mapping as the parent (see "copy on write").
I'd suggest mmapping the files, that way they can be shared across multiple processes as well as getting swapped in/out as appropriate
the details of this would depend on what you mean by "25GB of images" and how these models want to access the images
the basic idea would be to preprocess the images into an appropriate format (e.g. one big 4D uint8 numpy array or maybe smaller ones, indicies could be (image, row, column, channel)) and save them in a format where they can be efficiently used by the models. see numpy.memmap for some examples of this
I'd suggest preprocessing files into a useful format "offline", i.e. not part of the model training but a seperate program that is run first. as this would probably take a while and you'd probably not want to do it every time
I have several scripts. Each of them does some computation and it is completely independent from the others. Once these computations are done, they will be saved to disk and a record updated.
The record is maintained by an instance of a class, which saves itself to disks. I would like to have a single record instance used in multiple scripts (for example, record_manager = RecordManager(file_on_disk). And then record_manager.update(...) ); but I can't do this right now, because when updating the record there may be concurrent write accesses to the same file on disk, leading to data loss. So I have a separate record manager for every script, and then I merge the records manually later.
What is the easiest way to have a single instance used in all the scripts that solves the concurrent write access problem?
I am using macOS (High sierra) and linux (Ubuntu 16.04).
Thanks!
To build a custom solution to this you will probably need to write a short new queuing module. This queuing module will have write access to the file(s) alone and be passed write actions from the existing modules in your code.
The queue logic and logic should be a pretty straightforward queue architecture.
There may also be libraries that exist in python to handle this problem that would avoid you writing your own queue class.
Finally, it is possible that this whole thing will be/could be handled in some way by your OS, independent of python.
I want to call python functions in sub-processes without creating a copy of the current process.
I have a method A.run() which should call B.run() multiple times.
A.run() consumes a lot of memory, so I don't want to use a ProcessPoolExecutor because it copies the whole memory AFAIK.
I also do not want to use subprocess.Popen because it has several disadvantages to me:
only pass strings as parameters
cannot take advantage of exceptions
I have to know the location of B.py exactly, instead of relying on PYTHONPATH
I also do not want to spawn threads because B.run() crashes easily and I don't want it to effect the parent process.
Is there a way I have overlooked that has the advantage of spawning separate processes, without the extra memory but with the benefits of calling a python method?
Edit 1:
Answers to some questions:
If I understand this correctly, I don't need the context of the first python process.
I cannot reuse Processes because I call a C++ library which has static variables and they need to be destroyed.
Most Unix Operating Systems are using Copy-On-Write when they fork new processes.
This implies that, if the memory is not changed by the process children, the memory is not duplicated but shared.
You see the processes having the same amount of memory due to the fact that they use that amount of virtual memory, but when it comes to the physical one, the parent process memory is actually in a unique copy shared among them all.
If I assume right and the children processes are not touching the parent's memory at all, then you're just wasting your time going against Unix design principles.
More info here.
is it possible for Python 2.5 to access the file descriptors opened by subprocess.Popen() child processes? Not specifically the input and output, but other pipes that it creates? If so, how would I do this? Appreciate the help.
There is no cross-platform way to access another process's file descriptors.
On almost any POSIX system, you can use sendmsg to pass open files over a Unix socket or pipe (usually requiring SCM_RIGHTS), and Windows has different but similar functionality. So, if you control the child process, you can do things that way. This allows you to have the same actual file object and/or a clone of it, although you will still have a different descriptor (small number) or handle for that object.
Some platforms have a way to access another process's file handles explicitly by using their descriptor or handle. On linux, for example, '/proc/{}/fd/{}'.format(childpid, childfd) will be a symlink to the open file (even if it's not in a name-accessible part of the filesystem), although of course in that case you end up with a different file object for the same file. Windows has NT-level APIs for enumerating all open handles of a child process; you usually won't have access to them unless the child gives it to you explicitly, but you can get the pathname and open it for yourself. (Although any close-on-exec files will get broken, of course, so be careful with that… and if you want to use stdin/out/err pipes it will get a lot more complicated.)
Finally, some platforms have ways to do something between forking a new process and just starting a new thread. For example, with linux, you can clone with all the flags the same as in fork, except for CLONE_FILES set to true, and you will then have a separate process that nevertheless shares the file descriptor table. Then you can actually use the same file descriptor numbers to refer to the same file objects. This obviously isn't wrapped up by subprocess, but it wouldn't be too hard to fork the subprocess source (or, better, if you're using 3.1 or earlier, the subprocess32 backport) and make your own version that wraps this up.
If you want a cross-platform solution, you can build one yourself. For example, in most cases, you can just pass an absolute pathname over a pipe, and have the parent open the same file as the child. Unless you really need to share the position within the file or similar, or are using file descriptors to files you don't have access to as passable capabilities (and in either case you're probably not writing cross-platform code), this will usually work just as well, and it's a lot simpler.
I want to use the Python multiprocessing module to communicate with child processes (via Pipes), which are independent of the caller (ie completely different and independent source code, path, executable).
The main process is doing some networking while the childs should simply do some data conversion (file in, file out). I need to do that several times a second, so restarting a subprocess for each file is no solution.
Is there a way to do that (under Windows)?