Error when trying to use pickle and multiprocessing - python

I'm trying to use multiprocessing to run my code in Python 3.7, but met a problem. There is an error when I try to run my code:
Can't pickle local object 'mm_prepare_run.<locals>...
I understand it's a problem with pickle, but I didn't find a proper answer how to resolve this issue.
My simple code is below. Could you advise how I can solve the problem?
import multiprocessing
import copy
from pathlib import Path
proc_mrg = multiprocessing.Manager()
num_cpu = 8 # number of CPU
def prepare_run(config):
din['config'] = config
din_temp = copy.deepcopy(din)
dout_list.append(proc_mrg.dict({}))
#process = multiprocessing.Process(target=Run_IDEAS_instance_get_trajectory,args=(din_temp, dout_list[-1]))
process = multiprocessing.Process(target=Run_IDEAS_instance_get_trajectory(din_temp, dout_list[-1]))
proc_list.append(process)
for job in proc_list:
job.start()

When you create a Process, in prepare_run you are calling Run_IDEAS_instance_get_trajectory instead of just passing it as a reference.
And since that function does not return a result, the target of the Process is None.
Use this instead:
process = multiprocessing.Process(
target=Run_IDEAS_instance_get_trajectory,
args=(din_temp, dout_list[-1])
)
Functions in Python are first class objects of the callable type.
See the "Data model" chapter in the Python language reference.
Edit:
From your comment, I can see that you are running this code on ms-windows.
On this platform it is required that you run process creation inside a if __name__ == "__main__" block! Because of how multiprocessing works on this platform, python has to be able to import your script without side effects such as starting a new process. See the "programming guidelines" section in the documentation for multiprocessing.

Related

Recursion error when building beam pipeline with dataflow runner when using custom ptransform

I built and managed to run a satisfactory pipeline locally with beam, and I am ready to send the job to DataFlow.
I planned to just pickle my session with the save_main_session pipeline option, however I run into a recursion error when trying to do so. After a couple of trial and error I managed to narrow it down to the way I define my ptransform_fn, with a decorator.
Please find below a minimal reproducible example
# my_script.py
from typing import Set
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
from apache_beam.transforms.ptransform import ptransform_fn
#ptransform_fn
def my_function(pcoll):
return pcoll | beam.Create([1])
if __name__ == "__main__":
options = PipelineOptions()
options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=options) as p:
p | my_function()
The full traceback is quite long but ends with RecursionError: maximum recursion depth exceeded while calling a Python object
(Note that this is the save_main_session=True) option that enables this error, so I can run this python -m my_script using the local runner and will run into the RecursionError)
As the ptransform_fn is actually making my_function behave in an "unpythonic" way (being called without the argument it was defined), it seems like the pickler library has a problem with this.
So my final questions are:
Is this expected behavior ? Should I stick to defining classes which inherit from Beam.PTransform if I want to use the save_main_session ?
Is there a simple way (as setting up a pipeline option) to be able to pickle this script and run it on dataflow ?
save_main_session is inherently a bit fragile; for anything non-trivial I recommend putting the logic in a named module that can be imported in your main script (and on your workers).

A timeout decorator class with multiprocessing gives a pickling error

So on windows the signal and the thread approahc in general are bad ideas / don't work for timeout of functions.
I've made the following timeout code which throws a timeout exception from multiprocessing when the code took to long. This is exactly what I want.
def timeout(timeout, func, *arg):
with Pool(processes=1) as pool:
result = pool.apply_async(func, (*arg,))
return result.get(timeout=timeout)
I'm now trying to get this into a decorator style so that I can add it to a wide range of functions, especially those where external services are called and I have no control over the code or duration. My current attempt is below:
class TimeWrapper(object):
def __init__(self, timeout=10):
"""Timing decorator"""
self.timeout = timeout
def __call__(self, f):
def wrapped_f(*args):
with Pool(processes=1) as pool:
result = pool.apply_async(f, (*args,))
return result.get(timeout=self.timeout)
return wrapped_f
It gives a pickling error:
#TimeWrapper(7)
def func2(x, y):
time.sleep(5)
return x*y
File "C:\Users\rmenk\AppData\Local\Continuum\anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function func2 at 0x000000770C8E4730>: it's not the same object as __main__.func2
I'm suspecting this is due to the multiprocessing and the decorator not playing nice but I don't actually know how to make them play nice. Any ideas on how to fix this?
PS: I've done some extensive research on this site and other places but haven't found any answers that work, be it with pebble, threading, as a function decorator or otherwise. If you have a solution that you know works on windows and python 3.5 I'd be very happy to just use that.
What you are trying to achieve is particularly cumbersome in Windows. The core issue is that when you decorate a function, you shadow it. This happens to work just fine in UNIX due to the fact it uses the fork strategy to create a new process.
In Windows though, the new process will be a blank one where a brand new Python interpreter is started and loads your module. When the module gets loaded, the decorator hides the real function making it hard to find for the pickle protocol.
The only way to get it right is to rely on a trampoline function to be set during the decoration. You can take a look on how is done on pebble but, as long as you're not doing it for an exercise, I'd recommend to use pebble directly as it already offers what you are looking for.
from pebble import concurrent
#concurrent.process(timeout=60)
def my_function(var, keyvar=0):
return var + keyvar
future = my_function(1, keyvar=2)
future.result()
The only problem You have here is that You tested the decorated function in the main context. Move it out to a different module and it will probably work.
I wrote the wrapt_timeout_decorator what uses wrapt & dill & multiprocess & pipes versus pickle & multiprocessing & queue, because it can serialize more datatypes.
It might look simple at first, but under windows a reliable timeout decorator is quite tricky - You might use mine, its quite mature and tested :
https://github.com/bitranox/wrapt_timeout_decorator
On Windows the main module is imported again (but with a name != 'main') because Python is trying to simulate a forking-like behavior on a system that doesn't support forking. multiprocessing tries to create an environment similar to Your main process by importing the main module again with a different name. Thats why You need to shield the entry point of Your program with the famous " if name == 'main': "
import lib_foo
def some_module():
lib_foo.function_foo()
def main():
some_module()
# here the subprocess stops loading, because __name__ is NOT '__main__'
if __name__ = '__main__':
main()
This is a problem of Windows OS, because the Windows Operating System does not support "fork"
You can find more information on that here:
Workaround for using __name__=='__main__' in Python multiprocessing
https://docs.python.org/2/library/multiprocessing.html#windows
Since main.py is loaded again with a different name but "main", the decorated function now points to objects that do not exist anymore, therefore You need to put the decorated Classes and functions into another module. In general (especially on windows) , the main() program should not have anything but the main function, the real thing should happen in the modules. I am also used to put all settings or configurations in a different file - so all processes or threads can access them (and also to keep them in one place together, not to forget typing hints and name completion in Your favorite editor)
The "dill" serializer is able to serialize also the main context, that means the objects in our example are pickled to "main.lib_foo", "main.some_module","main.main" etc. We would not have this limitation when using "pickle" with the downside that "pickle" can not serialize following types:
functions with yields, nested functions, lambdas, cell, method, unboundmethod, module, code, methodwrapper, dictproxy, methoddescriptor, getsetdescriptor, memberdescriptor, wrapperdescriptor, xrange, slice, notimplemented, ellipsis, quit
additional dill supports:
save and load python interpreter sessions, save and extract the source code from functions and classes, interactively diagnose pickling errors
To support more types with the decorator, we selected dill as serializer, with the small downside that methods and classes can not be decorated in the main context, but need to reside in a module.
You can find more information on that here: Serializing an object in __main__ with pickle or dill

How should a Python file be written such that it can be both a module and a script with command line options and pipe capabilities?

I'm considering how a Python file could be made to be an importable module as well as a script that is capable of accepting command line options and arguments as well as pipe data. How should this be done?
My attempt seems to work, but I want to know if my approach is how such a thing should be done (if such a thing should be done). Could there be complexities (such as when importing it) that I have not considered?
#!/usr/bin/env python
"""
usage:
program [options]
options:
--version display version and exit
--datamode engage data mode
--data=FILENAME input data file [default: data.txt]
"""
import docopt
import sys
def main(options):
print("main")
datamode = options["--datamode"]
filename_input_data = options["--data"]
if datamode:
print("engage data mode")
process_data(filename_input_data)
if not sys.stdin.isatty():
print("accepting pipe data")
input_stream = sys.stdin
input_stream_list = [line for line in input_stream]
print("input stream: {data}".format(data = input_stream_list))
def process_data(filename):
print("process data of file {filename}".format(filename = filename))
if __name__ == "__main__":
options = docopt.docopt(__doc__)
if options["--version"]:
print(version)
exit()
main(options)
That's it, you're good.
Nothing matters[1] except the if __name__ == '__main__', as noted elsewhere
From the docs (emphasis mine):
A module’s __name__ is set equal to '__main__' when read from standard input, a script, or from an interactive prompt. A module can discover whether or not it is running in the main scope by checking its own __name__, which allows a common idiom for conditionally executing code in a module when it is run as a script or with python -m but not when it is imported
I also like how python 2's docs poetically phrase it
It is this environment in which the idiomatic “conditional script” stanza causes a script to run:
That guard guarantees that the code underneath it will only be accepted if it is the main function being called; put all your argument-grabbing code there. If there is no other top-level code except class/function declarations, it will be safe to import.
Other complications?
Yes:
Multiprocessing (a new interpreter is started and things are re-imported). if __name__ == '__main__' covers that
If you're used to C coding, you might be thinking you can protect your imports with ifdef's and the like. There's some analogous hacks in python, but it's not what you're looking for.
I like having a main method like C and Java - when's that coming out? Never.
But I'm paranoid! What if someone changes my main function. Stop being friends with that person. As long as you're the user, I assume this isn't an issue.
I mentioned the -m flag. That sounds great, what's that?! Here and here, but don't worry about it.
Footnotes:
[1] Well, the fact that you put your main code in a function is nice. Means things will run slightly faster

IPC with a Python subprocess

I'm trying to do some simple IPC in Python as follows: One Python process launches another with subprocess. The child process sends some data into a pipe and the parent process receives it.
Here's my current implementation:
# parent.py
import pickle
import os
import subprocess
import sys
read_fd, write_fd = os.pipe()
if hasattr(os, 'set_inheritable'):
os.set_inheritable(write_fd, True)
child = subprocess.Popen((sys.executable, 'child.py', str(write_fd)), close_fds=False)
try:
with os.fdopen(read_fd, 'rb') as reader:
data = pickle.load(reader)
finally:
child.wait()
assert data == 'This is the data.'
# child.py
import pickle
import os
import sys
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
pickle.dump('This is the data.', writer)
On Unix this works as expected, but if I run this code on Windows, I get the following error, after which the program hangs until interrupted:
Traceback (most recent call last):
File "child.py", line 4, in <module>
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
File "C:\Python34\lib\os.py", line 978, in fdopen
return io.open(fd, *args, **kwargs)
OSError: [Errno 9] Bad file descriptor
I suspect the problem is that the child process isn't inheriting the write_fd file descriptor. How can I fix this?
The code needs to be compatible with Python 2.7, 3.2, and all subsequent versions. This means that the solution can't depend on either the presence or the absence of the changes to file descriptor inheritance specified in PEP 446. As implied above, it also needs to run on both Unix and Windows.
(To answer a couple of obvious questions: The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state. Also, the child process's standard streams are being used for other purposes and are not available for this.)
UPDATE: After setting the close_fds parameter, the code now works in all versions of Python on Unix. However, it still fails on Windows.
subprocess.PIPE is implemented for all platforms. Why don't you just use this?
If you want to manually create and use an os.pipe(), you need to take care of the fact that Windows does not support fork(). It rather uses CreateProcess() which by default not make the child inherit open files. But there is a way: each single file descriptor can be made explicitly inheritable. This requires calling Win API. I have implemented this in gipc, see the _pre/post_createprocess_windows() methods here.
As #Jan-Philip Gehrcke suggested, you could use subprocess.PIPE instead of os.pipe():
#!/usr/bin/env python
# parent.py
import sys
from subprocess import check_output
data = check_output([sys.executable or 'python', 'child.py'])
assert data.decode().strip() == 'This is the data.'
check_output() uses stdout=subprocess.PIPE internally.
You could use obj = pickle.loads(data) if child.py uses data = pickle.dumps(obj).
And the child.py could be simplified:
#!/usr/bin/env python
# child.py
print('This is the data.')
If the child process is written in Python then for greater flexibility you could import the child script as a module and call its function instead of using subprocess. You could use multiprocessing, concurrent.futures modules if you need to run some Python code in a different process.
If you can't use standard streams then your django applications could use sockets to talk to one another.
The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state.
This seems bogus. multiprocessing under-the-hood also may use subprocess module. If you don't want to share global state -- don't share it -- it is the default for multiple processes. You should probably ask a more specific for your particular case question about how to organize the communication between various parts of your project.

Using object in different module than original script

I just started with Python and I'm having some problems. I've written already a few scripts for ArcGIS and had some recurring stuff. So I thought it would be smart to put that in modules which I can easily use again.
So now I have two scripts, script.py and toolbox.py.
My script was working fine so I copied and paste the part I needed, edited it a bit and everything goes well except for the messages created with gp.Addmessage
script.py will create the message "Hello Stackoverflow" but the messages from toolbox.py doesn't show up. Why is that? It loads the toolbox because I can use it later on, so it regocnizes the gp object.
I'm kind of stuck here, would love to be able to print messages from inside the modules to inform the user of the tool what is happening.
script.py:
import os, sys, arcgisscripting
# Create the Geoprocessor object
gp = arcgisscripting.create()
gp.AddMessage("# Hello Stackoverflow")
import toolbox
toolbox.loadToolbox
toolbox.py:
def loadToolbox:
try:
some code
gp.AddToolbox(path)
gp.AddMessage("# Toolbox loaded")
except:
gp.AddMessage("# Toolbox not found")
You have two problems with your code:
You never call the loadToolBox method, you only refer to it. Add ():
toolbox.loadToolbox()
Your loadToolbox() function doesn't take gp as an argument. If gp is meant to be a global, then it won't be visible to the toolbox module (globals are only visible in the current module).
Add gp as a parameter and pass it in when calling loadToolbox. In script.py:
toolbox.loadToolbox(gp)
and in toolbox.py:
def loadToolbox(gp):
# rest of function

Categories