Changes made to object attribute not seen when using the multiprocessing module - python

When using multiprocessing in Python, and you're importing a module, why is is that any instance variables in the module are pass by copy to the child process, whereas and arguments passed in the args() parameter are pass by reference.
Does this have to do with thread safety perhaps?
foo.py
class User:
def __init__(self, name):
self.name = name
foo_user = User('foo')
main.py
import multiprocessing
from foo import User, foo_user
def worker(main_foo):
print(main_foo.name) #prints 'main user'
print(foo_user.name) #prints 'foo user', why doesn't it print 'override'
if __name__ == '__main__':
main_foo = User('main user')
foo_user.name = 'override'
p = multiprocessing.Process(target=worker, args=(main_foo,))
p.start()
p.join()
EDIT: I'm an idiot, self.name = None should have been self.name = name. I made the correction in my code and forgot to copy it back over.

Actually, it does print override. Look at this:
$ python main.py
None
override
But! This only happens on *Nix. My guess is that you are running on Windows. The difference being that, in Windows, a fresh copy of the interpreter is spawned to just run your function, and the change you made to foo_user.name is not made, because in this new instance, __name__ is not __main__, so that bit of code is not executed. This is done to prevent infinite recursion.
You'll see the difference if you add this line to your function:
def worker(main_foo):
print(__name__)
...
This prints __main__ on *Nix. However, it will not be __main__ for Windows.
You'll want to move that line out of the if __name__ == __main__ block, if you want it to work.

Related

Parent class init is executing during inheritence

One basic question in OOP.
test.py file content:
class test(object):
def __init__(self):
print 'INIT of test class'
obj = test()
Then I opened another file.
I just inherited from the above test class:
from test import test
class test1(test):
def __init__(self):
pass
So when I run this second file, the __init__() from the parent class is executed (the INIT got printed).
I read that I can avoid it by using
if __name__ == '__main__':
# ...
I can overcome this, but my question is why the parent class's init is executing as I am just importing this class only in my second file. Why is the object creation code executed?
Importing a module executes all module-level statements, including obj=test().
To avoid this, make an instance only when run as the main program, not when imported:
class test(object):
def __init__(self):
print 'INIT of test class'
if __name__ == '__main__':
obj=test()
The problem is not the inheritance but the import. In your case you execute obj=test() when importing:
from test import test
When you import test, its name __name__ is test.
But when you run your program on the command line as main program with python test.py, its name is __main__. So, in the import case, you skip obj=test()if you use:
if __name__ == '__main__':
obj=test()

Wrote script in OSX, with multiprocessing. Now windows won't play ball

The program/script I've made works on OSX and linux. It uses selenium to scrape data from some pages, manipulates the data and saves it. In order to be more efficient, I included the multiprocessing pool and manager. I create a pool, for each item in a list, it calles the scrap class, starts a phantomjs instance and scrapes. Since I'm using multiprocessing.pool, and I want a way to pass data between the threads, I read that multiprocessing.manager was the way forward. If I wrote
manager = Manager()
info = manager.dict([])
it would create a dict that could be accessed by all threads. It all worked perfectly.
My issue is that the client wants to run this on a windows machine (I wrote the entire thing on OSX) I assumed, it would be as simple as installing python, selenium and launching it. I had errors which later lead me to writing if __name__ == '__main__: at the top of my main.py file, and indenting everything to be inside. The issue is, when I have class scrape(): outside of the if statement, it cannot see the global info, since it is declared outside of the scope. If I insert the class scrape(): inside the if __name__ == '__main__': then i get an attribute error saying
AttributeError: 'module' object has no attribute 'scrape'
And if I go back to declaring manager = manager() and info = manager.dict([]) outside of the if __name__ == '__main__' then I get the error in windows about making sure I use if __name__ == '__main__' it doesn't seem like I can win with this project at the moment.
Code Layout...
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
main()
Basically, the scrape_items is called by main, then scrape_items uses pool.map(do_scrape, s) so it calls the do_scrape class and passes the list of items to it one by one. The do_scrape then scrapes a web page based on the item url in "s" then saves that info in the global info which is the multiprocessing.manager dict. The above code does not show any if __name__ == '__main__': statements, it is an outline of how it works on my OSX setup. It runs and completes the task as is. If someone could issue a few pointers, I would appreciate it. Thanks
It would be helpful to see your code, but its sounds like you just need to explicitly pass your shared dict to scrape, like this:
import multiprocessing
from functools import partial
def scrape(info, item):
# Use info in here
if __name__ == "__main__":
manager = multiprocessing.Manager()
info = manager.dict()
pool = multiprocessing.Pool()
func = partial(scrape, info) # use a partial to make it easy to pass the dict to pool.map
items = [1,2,3,4,5] # This would be your actual data
results = pool.map(func, items)
#pool.apply_async(scrape, [shared_dict, "abc"]) # In case you're not using map...
Note that you shouldn't put all your code inside the if __name__ == "__main__": guard, just the code that's actually creating processes via multiprocessing, this includes creating the Manager and the Pool.
Any method you want to run in a child process must be declared at the top level of the module, because it has to be importable from __main__ in the child process. When you declared scrape inside the if __name__ ... guard, it could no longer be imported from the __main__ module, so you saw the AttributeError: 'module' object has no attribute 'scrape' error.
Edit:
Taking your example:
import multiprocessing
from functools import partial
date = str(datetime.date.today())
#class do_scrape():
# def __init__():
# def...
def do_scrape(info, s):
# do stuff
# Also note that do_scrape should probably be a function, not a class
def scrape_items():
# scrape_items is called by main(), which is protected by a`if __name__ ...` guard
# so this is ok.
manager = multiprocessing.Manager()
info = manager.dict([])
pool = multiprocessing.Pool()
func = partial(do_scrape, info)
s = [1,2,3,4,5] # Substitute with the real s
results = pool.map(func, s)
def save_scrape():
def update_price():
def main():
scrape_items()
if __name__ == "__main__":
# Note that you can declare manager and info here, instead of in scrape_items, if you wanted
#manager = multiprocessing.Manager()
#info = manager.dict([])
main()
One other important note here is that the first argument to map should be a function, not a class. This is stated in the docs (multiprocessing.map is meant to be equivalent to the built-in map).
Find the starting point of your program, and make sure you wrap only that with your if statement. For example:
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
if __name__ == "__main__":
main()
Essentially the contents of the if are only executed if you called this file directly when running your python code. If this file/module is included as an import from another file, all attributes will be defined, so you can access various attributes without actually beginning execution of the module.
Read more here:
What does if __name__ == "__main__": do?

Access function from within scripts and from commandline

I want to do the following:
I have a class which should provide several functions, which need different inputs. And I would like to use these functions from within other scripts, or solely from commandline.
e.g. I have the class "test". It has a function "quicktest" (which basically justs prints something). (From commandline) I want to be able to
$ python test.py quicktest "foo" "bar"
Whereas quicktest is the name of the function, and "foo" and "bar" are the variables.
Also (from within another script) I want to
from test import test
# this
t = test()
t.quicktest(["foo1", "bar1"])
# or this
test().quicktest(["foo2", "bar2"])
I just can't bring that to work. I managed to write a class for the first request and one for the second, but not for both of them. The problem is that I sometimes have to call the functions via (self), sometimes not, and also I have to provide the given parameters at any time, which is also kinda complicated.
So, does anybody have an idea for that?
This is what I already have:
Works only from commandline:
class test:
def quicktest(params):
pprint(params)
if (__name__ == '__main__'):
if (sys.argv[1] == "quicktest"):
quicktest(sys.argv)
else:
print "Wrong call."
Works only from within other scripts:
class test:
_params = sys.argv
def quicktest(self, params):
pprint(params)
pprint(self._params)
if (__name__ == '__main__'):
if (sys.argv[1] == "quicktest"):
quicktest()
else:
print "Wrong call"
try the following (note that the different indentation, the if __name__ part is not part of class test anymore):
class test:
def quicktest(params):
pprint(params)
if __name__ == '__main__':
if sys.argv[1] == "quicktest":
testObj = test()
testObj.quicktest(sys.argv)
else:
print "Wrong call."
from other scripts:
from test import test
testObj = test()
testObj.quicktest(...)
The if __name__ == '__main__': block needs to be at the top level:
class Test(object): # Python class names are capitalized and should inherit from object
def __init__(self, *args):
# parse args here so you can import and call with options too
self.args = args
def quicktest(self):
return 'ret_value'
if __name__ == '__main__':
test = Test(sys.argv[1:])
You can parse the command line with the help of argparse to parse the value from the command line.
Your class which has the method and associate methods to arguments.

How does name resolution work when classes are derived across modules?

Classes B and C both derive from base class A, and neither override A's method test(). B is defined in the same module as A; C is defined in a separate module. How is it that calling B.test() prints "hello", but calling C.test() fails? Shouldn't either invocation end up executing A.test() and therefore be able to resolve the symbol "message" in mod1's namespace?
I'd also gratefully receive hints on where this behaviour is documented as I've been unable to turn up anything. How are names resolved when C.test() is called, and can "message" be injected into one of the namespaces somehow?
FWIW, the reason I haven't used an instance variable (e.g. set A.message = "hello") is because I'm wanting to access a "global" singleton object and don't want to have an explicit referent to it in every other object.
mod1.py:
import mod2
class A(object):
def test(self):
print message
class B(A):
pass
if __name__ == "__main__":
message = "hello"
A().test()
B().test()
mod2.C().test()
mod2.py:
import mod1
class C(mod1.A):
pass
output is:
$ python mod1.py
hello
hello
Traceback (most recent call last):
File "mod1.py", line 14, in <module>
mod2.C().test()
File "mod1.py", line 5, in test
print message
NameError: global name 'message' is not defined
Many thanks!
EOL is correct, moving the "main" part of the program into a new file mod3.py does indeed make things work.
http://bytebaker.com/2008/07/30/python-namespaces/ further clarifies the issue.
In my original question, it turns out that the variable message ist stored in the __main__ module namespace because mod1.py is being run as a script. mod2 imports mod1, but it gets a separate mod1 namespace, where the variable message does not exist. The following code snippet demonstrates more clearly as it writes message into mod1's namespace (not that I'd recommend this be done in real life), causing the expected behaviour.
import sys
class A(object):
def test(self):
print message
class B(A):
pass
if __name__ == "__main__":
import mod2
message = "hello"
sys.modules["mod1"].message = message
A().test()
B().test()
mod2.C().test()
I think the best real-world fix is to move the "main" part of the program into a separate module, as EOL implies, or do:
class A(object):
def test(self):
print message
class B(A):
pass
def main():
global message
message = "hello"
A().test()
B().test()
# resolve circular import by importing in local scope
import mod2
mod2.C().test()
if __name__ == "__main__":
# break into mod1 namespace from __main__ namespace
import mod1
mod1.main()
Could you use a class attribute instead of a global? The following works
import mod2
class A(object):
message = "Hello" # Class attribute (not duplicated in instances)
def test(self):
print self.message # Class A attribute can be overridden by subclasses
class B(A):
pass
if __name__ == "__main__":
A().test()
B().test()
mod2.C().test()
Not using globals is cleaner: in the code above, message is explicitly attached to the class it is used in.
That said, I am also very curious as to why the global message is not found by mod2.C().test().
Things work as expected, though, if the cross-importing is removed (no main program in mod1.py, and no import mod2): importing mod1 and mod2 from mod3.py, doing mod1.message = "Hello" there and mod2.C().test() works. I am therefore wondering if the problem is not related to cross-importing…

How to test or mock "if __name__ == '__main__'" contents

Say I have a module with the following:
def main():
pass
if __name__ == "__main__":
main()
I want to write a unit test for the bottom half (I'd like to achieve 100% coverage). I discovered the runpy builtin module that performs the import/__name__-setting mechanism, but I can't figure out how to mock or otherwise check that the main() function is called.
This is what I've tried so far:
import runpy
import mock
#mock.patch('foobar.main')
def test_main(self, main):
runpy.run_module('foobar', run_name='__main__')
main.assert_called_once_with()
I will choose another alternative which is to exclude the if __name__ == '__main__' from the coverage report , of course you can do that only if you already have a test case for your main() function in your tests.
As for why I choose to exclude rather than writing a new test case for the whole script is because if as I stated you already have a test case for your main() function the fact that you add an other test case for the script (just for having a 100 % coverage) will be just a duplicated one.
For how to exclude the if __name__ == '__main__' you can write a coverage configuration file and add in the section report:
[report]
exclude_lines =
if __name__ == .__main__.:
More info about the coverage configuration file can be found here.
Hope this can help.
You can do this using the imp module rather than the import statement. The problem with the import statement is that the test for '__main__' runs as part of the import statement before you get a chance to assign to runpy.__name__.
For example, you could use imp.load_source() like so:
import imp
runpy = imp.load_source('__main__', '/path/to/runpy.py')
The first parameter is assigned to __name__ of the imported module.
Whoa, I'm a little late to the party, but I recently ran into this issue and I think I came up with a better solution, so here it is...
I was working on a module that contained a dozen or so scripts all ending with this exact copypasta:
if __name__ == '__main__':
if '--help' in sys.argv or '-h' in sys.argv:
print(__doc__)
else:
sys.exit(main())
Not horrible, sure, but not testable either. My solution was to write a new function in one of my modules:
def run_script(name, doc, main):
"""Act like a script if we were invoked like a script."""
if name == '__main__':
if '--help' in sys.argv or '-h' in sys.argv:
sys.stdout.write(doc)
else:
sys.exit(main())
and then place this gem at the end of each script file:
run_script(__name__, __doc__, main)
Technically, this function will be run unconditionally whether your script was imported as a module or ran as a script. This is ok however because the function doesn't actually do anything unless the script is being ran as a script. So code coverage sees the function runs and says "yes, 100% code coverage!" Meanwhile, I wrote three tests to cover the function itself:
#patch('mymodule.utils.sys')
def test_run_script_as_import(self, sysMock):
"""The run_script() func is a NOP when name != __main__."""
mainMock = Mock()
sysMock.argv = []
run_script('some_module', 'docdocdoc', mainMock)
self.assertEqual(mainMock.mock_calls, [])
self.assertEqual(sysMock.exit.mock_calls, [])
self.assertEqual(sysMock.stdout.write.mock_calls, [])
#patch('mymodule.utils.sys')
def test_run_script_as_script(self, sysMock):
"""Invoke main() when run as a script."""
mainMock = Mock()
sysMock.argv = []
run_script('__main__', 'docdocdoc', mainMock)
mainMock.assert_called_once_with()
sysMock.exit.assert_called_once_with(mainMock())
self.assertEqual(sysMock.stdout.write.mock_calls, [])
#patch('mymodule.utils.sys')
def test_run_script_with_help(self, sysMock):
"""Print help when the user asks for help."""
mainMock = Mock()
for h in ('-h', '--help'):
sysMock.argv = [h]
run_script('__main__', h*5, mainMock)
self.assertEqual(mainMock.mock_calls, [])
self.assertEqual(sysMock.exit.mock_calls, [])
sysMock.stdout.write.assert_called_with(h*5)
Blam! Now you can write a testable main(), invoke it as a script, have 100% test coverage, and not need to ignore any code in your coverage report.
Python 3 solution:
import os
from importlib.machinery import SourceFileLoader
from importlib.util import spec_from_loader, module_from_spec
from importlib import reload
from unittest import TestCase
from unittest.mock import MagicMock, patch
class TestIfNameEqMain(TestCase):
def test_name_eq_main(self):
loader = SourceFileLoader('__main__',
os.path.join(os.path.dirname(os.path.dirname(__file__)),
'__main__.py'))
with self.assertRaises(SystemExit) as e:
loader.exec_module(module_from_spec(spec_from_loader(loader.name, loader)))
Using the alternative solution of defining your own little function:
# module.py
def main():
if __name__ == '__main__':
return 'sweet'
return 'child of mine'
You can test with:
# Override the `__name__` value in your module to '__main__'
with patch('module_name.__name__', '__main__'):
import module_name
self.assertEqual(module_name.main(), 'sweet')
with patch('module_name.__name__', 'anything else'):
reload(module_name)
del module_name
import module_name
self.assertEqual(module_name.main(), 'child of mine')
I did not want to exclude the lines in question, so based on this explanation of a solution, I implemented a simplified version of the alternate answer given here...
I wrapped if __name__ == "__main__": in a function to make it easily testable, and then called that function to retain logic:
# myapp.module.py
def main():
pass
def init():
if __name__ == "__main__":
main()
init()
I mocked the __name__ using unittest.mock to get at the lines in question:
from unittest.mock import patch, MagicMock
from myapp import module
def test_name_equals_main():
# Arrange
with patch.object(module, "main", MagicMock()) as mock_main:
with patch.object(module, "__name__", "__main__"):
# Act
module.init()
# Assert
mock_main.assert_called_once()
If you are sending arguments into the mocked function, like so,
if __name__ == "__main__":
main(main_args)
then you can use assert_called_once_with() for an even better test:
expected_args = ["expected_arg_1", "expected_arg_2"]
mock_main.assert_called_once_with(expected_args)
If desired, you can also add a return_value to the MagicMock() like so:
with patch.object(module, "main", MagicMock(return_value='foo')) as mock_main:
One approach is to run the modules as scripts (e.g. os.system(...)) and compare their stdout and stderr output to expected values.
I found this solution helpful. Works well if you use a function to keep all your script code.
The code will be handled as one code line. It doesn't matter if the entire line was executed for coverage counter (though this is not what you would actually actually expect by 100% coverage)
The trick is also accepted pylint. ;-)
if __name__ == '__main__': \
main()
If it's just to get the 100% and there is nothing "real" to test there, it is easier to ignore that line.
If you are using the regular coverage lib, you can just add a simple comment, and the line will be ignored in the coverage report.
if __name__ == '__main__':
main() # pragma: no cover
https://coverage.readthedocs.io/en/coverage-4.3.3/excluding.html
Another comment by # Taylor Edmiston also mentions it
My solution is to use imp.load_source() and force an exception to be raised early in main() by not providing a required CLI argument, providing a malformed argument, setting paths in such a way that a required file is not found, etc.
import imp
import os
import sys
def mainCond(testObj, srcFilePath, expectedExcType=SystemExit, cliArgsStr=''):
sys.argv = [os.path.basename(srcFilePath)] + (
[] if len(cliArgsStr) == 0 else cliArgsStr.split(' '))
testObj.assertRaises(expectedExcType, imp.load_source, '__main__', srcFilePath)
Then in your test class you can use this function like this:
def testMain(self):
mainCond(self, 'path/to/main.py', cliArgsStr='-d FailingArg')
To import your "main" code in pytest in order to test it you can import main module like other functions thanks to native importlib package :
def test_main():
import importlib
loader = importlib.machinery.SourceFileLoader("__main__", "src/glue_jobs/move_data_with_resource_partitionning.py")
runpy_main = loader.load_module()
assert runpy_main()

Categories