python multiprocessing sharing data between separate python processes - python

Multiprocessing allows me to share data between processes started from within the same python runtime interpreter.
But what if i had a need to share data between processes started by separate python runtime processes?
I was looking at multiprocessing.Manager which seems to be the right construct for it. If I create a manager i can see its address:
>>> from multiprocessing import Manager
>>> m=Manager()
>>> m.address
'/tmp/pymp-o2TCd_/listener-Qld03B'
And the socket is there:
adrian#sammy ~/temp $ netstat -naA unix | grep pymp
unix 2 [ ACC ] STREAM LISTENING 1220401 /tmp/pymp- o2TCd_/listener-Qld03B
If I start a new process with multiprocessing.Process it spawns a new python interpreter that somehow inherits information about these shared constructs like this Manager.
Is there a way to access it from a new python process NOT spawned from the same one that created the Manager?

You are on the (or a) right track with this.
In a comment, stovfl suggests looking at the remote manager section of the Python multiprocessing Manager documentation (Python2, Python3). As you have observed, each manager has a name-able entity (a socket in /tmp in this case) through which each Python process can connect to a peer Python process. Because these are accessible from any process, however, they each have an access key.
The default key for each Manager is the one for the "main process", and it is a string of 32 random bytes:
class _MainProcess(BaseProcess):
def __init__(self):
self._identity = ()
self._name = 'MainProcess'
self._parent_pid = None
self._popen = None
self._config = {'authkey': AuthenticationString(os.urandom(32)),
'semprefix': '/mp'}
# Note that some versions of FreeBSD only allow named
# semaphores to have names of up to 14 characters. Therefore
# we choose a short prefix.
#
# On MacOSX in a sandbox it may be necessary to use a
# different prefix -- see #19478.
#
# Everything in self._config will be inherited by descendant
# processes.
but you may assign your own key, which you can then know and therefore use from anywhere else.
There are other ways to handle this. For instance, you can use XML RPC to export callable functions from one Python process, callable from anything—not just Python—that can speak XML RPC. See the Python2 or Python3 documentation. Heed this warning (this is the py3k variant but it applies in py2k as well):
Warning: The xmlrpc.client module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.
Do not, however, assume that using a multiprocessing.Manager instead of XML RPC secures you against maliciously constructed data. Those are just as vulnerable since they will unpickle arbitrary data. See Attacking Python's pickle for more about this.

Related

Multiprocess inherently shared memory in no longer working on python 3.10 (coming from 3.6)

I understand there are a variety of techniques for sharing memory and data structures between processes in python. This question is specifically about this inherently shared memory in python scripts that existed in python 3.6 but seems to no longer exist in 3.10. Does anyone know why and if it's possible to bring this back in 3.10? Or what this change that I'm observing is? I've upgraded my Mac to Monterey and it no longer supports python 3.6, so I'm forced to upgrade to either 3.9 or 3.10+.
Note: I tend to develop on Mac and run production on Ubuntu. Not sure if that factors in here. Historically with 3.6, everything behaved the same regardless of OS.
Make a simple project with the following python files
myLibrary.py
MyDict = {}
test.py
import threading
import time
import multiprocessing
import myLibrary
def InitMyDict():
myLibrary.MyDict = {'woot': 1, 'sauce': 2}
print('initialized myLibrary.MyDict to ', myLibrary.MyDict)
def MainLoop():
numOfSubProcessesToStart = 3
for i in range(numOfSubProcessesToStart):
t = threading.Thread(
target=CoolFeature(),
args=())
t.start()
while True:
time.sleep(1)
def CoolFeature():
MyProcess = multiprocessing.Process(
target=SubProcessFunction,
args=())
MyProcess.start()
def SubProcessFunction():
print('SubProcessFunction: ', myLibrary.MyDict)
if __name__ == '__main__':
InitMyDict()
MainLoop()
When I run this on 3.6 it has a significantly different behavior than 3.10. I do understand that a subprocess cannot modify the memory of the main process, but it is still super convenient to access the main process' data structure that was previously set up as opposed to moving every little tiny thing into shared memory just to read a simple dictionary/int/string/etc.
Python 3.10 output:
python3.10 test.py
initialized myLibrary.MyDict to {'woot': 1, 'sauce': 2}
SubProcessFunction: {}
SubProcessFunction: {}
SubProcessFunction: {}
Python 3.6 output:
python3.6 test.py
initialized myLibrary.MyDict to {'woot': 1, 'sauce': 2}
SubProcessFunction: {'woot': 1, 'sauce': 2}
SubProcessFunction: {'woot': 1, 'sauce': 2}
SubProcessFunction: {'woot': 1, 'sauce': 2}
Observation:
Notice that in 3.6, the subprocess can view the value that was set from the main process. But in 3.10, the subprocess sees an empty dictionary.
In short, since 3.8, CPython uses the spawn start method on MacOs. Before it used the fork method.
On UNIX platforms, the fork start method is used which means that every new multiprocessing process is an exact copy of the parent at the time of the fork.
The spawn method means that it starts a new Python interpreter for each new multiprocessing process. According to the documentation:
The child process will only inherit those resources necessary to run the process object’s run() method.
It will import your program into this new interpreter, so starting processes et cetera sould only be done from within the if __name__ == '__main__':-block!
This means you cannot count on variables from the parent process being available in the children, unless they are module level constants which would be imported.
So the change is significant.
What can be done?
If the required information could be a module-level constant, that would solve the problem in the simplest way.
If that is not possible (e.g. because the data needs to be generated at runtime) you could have the parent write the information to be shared to a file. E.g. in JSON format and before it starts other processes. Then the children could simply read this. That is probably the next simplest solution.
Using a multiprocessing.Manager would allow you to share a dict between processes. There is however a certain amount of overhead associated with this.
Or you could try calling multiprocessing.set_start_method("fork") before creating processes or pools and see if it doesn't crash in your case. That would revert to the pre-3.8 method on MacOs. But as documented in this bug, there are real problems with using the fork method on MacOs.
Reading the issue indicates that fork might be OK as long as you don't use threads.

UUID stayed the same for different processes on Centos OS, but works fine on Windows OS (UUID per Process flow)

I have two source files that I am running in Python 3.9. (The files are big...)
File one (fileOne.py)
# ...
sessionID = uuid.uuid4().hex
# ...
File two (fileTwo.py)
# ...
from fileOne import sessionID
# ...
File two is executed using module multiprocessing.
When I run on my local machine and print the UUID in file two, it is always unique.
When I run the script on Centos OS, it somehow remained the same
If I restart the service, the UUID will change once.
My question: Why does this work locally (Windows OS) as expected, but not on a CentOS VM?
UPDATE 1.0:
To make it clear.
For each separate process, I need that UUID will be the same across FileOne and FileTwo. WHich mean
processOne = UUID in file one and in file two will be 1q2w3e
processTwo = UUID in file one and in file two will be r4t5y6 (a different one)
Your riddle is likely is caused by the way multi-processing works in different operating systems. You don't mention, but your "run locally" is certainly Windows or MacOS, not a Linux or other Unix Flavor.
The thing is that multiprocessing on Linux (and up to a time ago on MacOS, but changed that on Python 3.8), used a system fork call when using multiprocessing: the current process is duplicatesd "as is" with all its defined variables and classes - since your sessionID is defined at import time, it stays the same in all subprocesess.
Windows lacks the fork call, and multiprocessing resorts to start a new Python interpreter which re-imports all modules from the current process (this leads to another, more common cause of confusion, where any code not guarded by an if __name__ == "__main__": on the entry Python file is re-executed). In your case the value for sessionID is regenerated.
Check the docs at: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
So, if you want the variable to behave reliably and have teh same value across all processes when running multiprocessing, you should either pass it as a parameter to the target functions in the other processes, or use a proper structure meant to share values across processes as documented here:
https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes
(you can also check this recent question about the same topic: why is a string printing 3 times instead of 1 when using time.sleep with multiprocessing imported?)
If you need a unique ID across files for each different process:
(As is more clear from the edit and comments)
Have a global (plain) dictionary which will work as a per-process registry for the IDs, and use a function to retrieve the ID - the function can use os.getpid() as a key to the registry.
file 1:
import os
import uuid
...
_id_registry = {}
def get_session_id():
return _id_registry.setdefault(os.getpid(), uuid.uuid4())
file2:
from file1 import get_session_id
sessionID = get_session_id()
(the setdefault dict method takes care of providing a new ID value if none was set)
NB.: the registry set up in this way will keep at most the master process ID (if multiprocessing is using fork mode) and itself - no data on the siblings, as each process will hold its own copy of the registry. If you need a working inter-process dictionary (which could hold a live registry for all processes, for example) you will probably be better using redis for it (https://redis.io - certainly one of the Python bindings have a transparent Python-mapping-over-redis, so you don´t have to worry with the semantics of it)
When you run your script it generate the new value of the uuid, but when you run it inside some service you code the same as:
sessionID = 123 # simple constant
so to fix the issue you can try wrap the code to the function, for example:
def get_uuid():
return uuid.uuid4().hex
in you second file:
from frileOne import get_uuid
get_uuid()

A timeout decorator class with multiprocessing gives a pickling error

So on windows the signal and the thread approahc in general are bad ideas / don't work for timeout of functions.
I've made the following timeout code which throws a timeout exception from multiprocessing when the code took to long. This is exactly what I want.
def timeout(timeout, func, *arg):
with Pool(processes=1) as pool:
result = pool.apply_async(func, (*arg,))
return result.get(timeout=timeout)
I'm now trying to get this into a decorator style so that I can add it to a wide range of functions, especially those where external services are called and I have no control over the code or duration. My current attempt is below:
class TimeWrapper(object):
def __init__(self, timeout=10):
"""Timing decorator"""
self.timeout = timeout
def __call__(self, f):
def wrapped_f(*args):
with Pool(processes=1) as pool:
result = pool.apply_async(f, (*args,))
return result.get(timeout=self.timeout)
return wrapped_f
It gives a pickling error:
#TimeWrapper(7)
def func2(x, y):
time.sleep(5)
return x*y
File "C:\Users\rmenk\AppData\Local\Continuum\anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function func2 at 0x000000770C8E4730>: it's not the same object as __main__.func2
I'm suspecting this is due to the multiprocessing and the decorator not playing nice but I don't actually know how to make them play nice. Any ideas on how to fix this?
PS: I've done some extensive research on this site and other places but haven't found any answers that work, be it with pebble, threading, as a function decorator or otherwise. If you have a solution that you know works on windows and python 3.5 I'd be very happy to just use that.
What you are trying to achieve is particularly cumbersome in Windows. The core issue is that when you decorate a function, you shadow it. This happens to work just fine in UNIX due to the fact it uses the fork strategy to create a new process.
In Windows though, the new process will be a blank one where a brand new Python interpreter is started and loads your module. When the module gets loaded, the decorator hides the real function making it hard to find for the pickle protocol.
The only way to get it right is to rely on a trampoline function to be set during the decoration. You can take a look on how is done on pebble but, as long as you're not doing it for an exercise, I'd recommend to use pebble directly as it already offers what you are looking for.
from pebble import concurrent
#concurrent.process(timeout=60)
def my_function(var, keyvar=0):
return var + keyvar
future = my_function(1, keyvar=2)
future.result()
The only problem You have here is that You tested the decorated function in the main context. Move it out to a different module and it will probably work.
I wrote the wrapt_timeout_decorator what uses wrapt & dill & multiprocess & pipes versus pickle & multiprocessing & queue, because it can serialize more datatypes.
It might look simple at first, but under windows a reliable timeout decorator is quite tricky - You might use mine, its quite mature and tested :
https://github.com/bitranox/wrapt_timeout_decorator
On Windows the main module is imported again (but with a name != 'main') because Python is trying to simulate a forking-like behavior on a system that doesn't support forking. multiprocessing tries to create an environment similar to Your main process by importing the main module again with a different name. Thats why You need to shield the entry point of Your program with the famous " if name == 'main': "
import lib_foo
def some_module():
lib_foo.function_foo()
def main():
some_module()
# here the subprocess stops loading, because __name__ is NOT '__main__'
if __name__ = '__main__':
main()
This is a problem of Windows OS, because the Windows Operating System does not support "fork"
You can find more information on that here:
Workaround for using __name__=='__main__' in Python multiprocessing
https://docs.python.org/2/library/multiprocessing.html#windows
Since main.py is loaded again with a different name but "main", the decorated function now points to objects that do not exist anymore, therefore You need to put the decorated Classes and functions into another module. In general (especially on windows) , the main() program should not have anything but the main function, the real thing should happen in the modules. I am also used to put all settings or configurations in a different file - so all processes or threads can access them (and also to keep them in one place together, not to forget typing hints and name completion in Your favorite editor)
The "dill" serializer is able to serialize also the main context, that means the objects in our example are pickled to "main.lib_foo", "main.some_module","main.main" etc. We would not have this limitation when using "pickle" with the downside that "pickle" can not serialize following types:
functions with yields, nested functions, lambdas, cell, method, unboundmethod, module, code, methodwrapper, dictproxy, methoddescriptor, getsetdescriptor, memberdescriptor, wrapperdescriptor, xrange, slice, notimplemented, ellipsis, quit
additional dill supports:
save and load python interpreter sessions, save and extract the source code from functions and classes, interactively diagnose pickling errors
To support more types with the decorator, we selected dill as serializer, with the small downside that methods and classes can not be decorated in the main context, but need to reside in a module.
You can find more information on that here: Serializing an object in __main__ with pickle or dill

IPC with a Python subprocess

I'm trying to do some simple IPC in Python as follows: One Python process launches another with subprocess. The child process sends some data into a pipe and the parent process receives it.
Here's my current implementation:
# parent.py
import pickle
import os
import subprocess
import sys
read_fd, write_fd = os.pipe()
if hasattr(os, 'set_inheritable'):
os.set_inheritable(write_fd, True)
child = subprocess.Popen((sys.executable, 'child.py', str(write_fd)), close_fds=False)
try:
with os.fdopen(read_fd, 'rb') as reader:
data = pickle.load(reader)
finally:
child.wait()
assert data == 'This is the data.'
# child.py
import pickle
import os
import sys
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
pickle.dump('This is the data.', writer)
On Unix this works as expected, but if I run this code on Windows, I get the following error, after which the program hangs until interrupted:
Traceback (most recent call last):
File "child.py", line 4, in <module>
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
File "C:\Python34\lib\os.py", line 978, in fdopen
return io.open(fd, *args, **kwargs)
OSError: [Errno 9] Bad file descriptor
I suspect the problem is that the child process isn't inheriting the write_fd file descriptor. How can I fix this?
The code needs to be compatible with Python 2.7, 3.2, and all subsequent versions. This means that the solution can't depend on either the presence or the absence of the changes to file descriptor inheritance specified in PEP 446. As implied above, it also needs to run on both Unix and Windows.
(To answer a couple of obvious questions: The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state. Also, the child process's standard streams are being used for other purposes and are not available for this.)
UPDATE: After setting the close_fds parameter, the code now works in all versions of Python on Unix. However, it still fails on Windows.
subprocess.PIPE is implemented for all platforms. Why don't you just use this?
If you want to manually create and use an os.pipe(), you need to take care of the fact that Windows does not support fork(). It rather uses CreateProcess() which by default not make the child inherit open files. But there is a way: each single file descriptor can be made explicitly inheritable. This requires calling Win API. I have implemented this in gipc, see the _pre/post_createprocess_windows() methods here.
As #Jan-Philip Gehrcke suggested, you could use subprocess.PIPE instead of os.pipe():
#!/usr/bin/env python
# parent.py
import sys
from subprocess import check_output
data = check_output([sys.executable or 'python', 'child.py'])
assert data.decode().strip() == 'This is the data.'
check_output() uses stdout=subprocess.PIPE internally.
You could use obj = pickle.loads(data) if child.py uses data = pickle.dumps(obj).
And the child.py could be simplified:
#!/usr/bin/env python
# child.py
print('This is the data.')
If the child process is written in Python then for greater flexibility you could import the child script as a module and call its function instead of using subprocess. You could use multiprocessing, concurrent.futures modules if you need to run some Python code in a different process.
If you can't use standard streams then your django applications could use sockets to talk to one another.
The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state.
This seems bogus. multiprocessing under-the-hood also may use subprocess module. If you don't want to share global state -- don't share it -- it is the default for multiple processes. You should probably ask a more specific for your particular case question about how to organize the communication between various parts of your project.

Strange way to pass data between modules in Python: How does it work?

I'm supposed to work with some messy code that I haven't written myself, and amidst the mess I found out two scripts that communicate by this strange fashion (via a 3rd middleman script):
message.py, the 'middleman' script:
class m():
pass
sender.py, who wants to send some info to the receiver:
from message import m
someCalculationResult = 1 + 2
m.result = someCalculationResult
receiver.py, who wants to print some results produced by sender.py:
from message import m
mInstance = m()
print mInstance.result
And, by magic, in the interpreter, importing sender.py then receiver.py does indeed print 3...
Now, what the hell is happening behind the scenes here? Are we storing our results into the class definition itself and recovering them via a particular instance? If so, why can't we recover the results from the definition itself also? Is there a more elegant way to pass stuff inbetween scripts ran sucessively in the interpreter?
Using Python 2.6.6
That is just a convoluted way to set a global.
m is a class, m.result a class attribute. Both the sender and receiver can access it directly, just as they can access m.
They could have done this too:
# sender
import message
message.result = someCalculationResult
# receiver
import message
print message.result
Here result is just a name in the message module top-level module.
It should be noted that what you are doing is not running separate scripts; you are importing modules into the same interpreter. If you ran python sender.py first, without ever importing reciever.py, then separately running python receiver.py without ever importing sender.py this whole scheme doesn't work.
There are myriad ways to pass data from one section of code to another section, too many to name here, all fitting for a different scenario and need. Threading, separate processes, separate computers all introduce different constraints on how message passing can and should take place, for example.

Categories