The program/script I've made works on OSX and linux. It uses selenium to scrape data from some pages, manipulates the data and saves it. In order to be more efficient, I included the multiprocessing pool and manager. I create a pool, for each item in a list, it calles the scrap class, starts a phantomjs instance and scrapes. Since I'm using multiprocessing.pool, and I want a way to pass data between the threads, I read that multiprocessing.manager was the way forward. If I wrote
manager = Manager()
info = manager.dict([])
it would create a dict that could be accessed by all threads. It all worked perfectly.
My issue is that the client wants to run this on a windows machine (I wrote the entire thing on OSX) I assumed, it would be as simple as installing python, selenium and launching it. I had errors which later lead me to writing if __name__ == '__main__: at the top of my main.py file, and indenting everything to be inside. The issue is, when I have class scrape(): outside of the if statement, it cannot see the global info, since it is declared outside of the scope. If I insert the class scrape(): inside the if __name__ == '__main__': then i get an attribute error saying
AttributeError: 'module' object has no attribute 'scrape'
And if I go back to declaring manager = manager() and info = manager.dict([]) outside of the if __name__ == '__main__' then I get the error in windows about making sure I use if __name__ == '__main__' it doesn't seem like I can win with this project at the moment.
Code Layout...
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
main()
Basically, the scrape_items is called by main, then scrape_items uses pool.map(do_scrape, s) so it calls the do_scrape class and passes the list of items to it one by one. The do_scrape then scrapes a web page based on the item url in "s" then saves that info in the global info which is the multiprocessing.manager dict. The above code does not show any if __name__ == '__main__': statements, it is an outline of how it works on my OSX setup. It runs and completes the task as is. If someone could issue a few pointers, I would appreciate it. Thanks
It would be helpful to see your code, but its sounds like you just need to explicitly pass your shared dict to scrape, like this:
import multiprocessing
from functools import partial
def scrape(info, item):
# Use info in here
if __name__ == "__main__":
manager = multiprocessing.Manager()
info = manager.dict()
pool = multiprocessing.Pool()
func = partial(scrape, info) # use a partial to make it easy to pass the dict to pool.map
items = [1,2,3,4,5] # This would be your actual data
results = pool.map(func, items)
#pool.apply_async(scrape, [shared_dict, "abc"]) # In case you're not using map...
Note that you shouldn't put all your code inside the if __name__ == "__main__": guard, just the code that's actually creating processes via multiprocessing, this includes creating the Manager and the Pool.
Any method you want to run in a child process must be declared at the top level of the module, because it has to be importable from __main__ in the child process. When you declared scrape inside the if __name__ ... guard, it could no longer be imported from the __main__ module, so you saw the AttributeError: 'module' object has no attribute 'scrape' error.
Edit:
Taking your example:
import multiprocessing
from functools import partial
date = str(datetime.date.today())
#class do_scrape():
# def __init__():
# def...
def do_scrape(info, s):
# do stuff
# Also note that do_scrape should probably be a function, not a class
def scrape_items():
# scrape_items is called by main(), which is protected by a`if __name__ ...` guard
# so this is ok.
manager = multiprocessing.Manager()
info = manager.dict([])
pool = multiprocessing.Pool()
func = partial(do_scrape, info)
s = [1,2,3,4,5] # Substitute with the real s
results = pool.map(func, s)
def save_scrape():
def update_price():
def main():
scrape_items()
if __name__ == "__main__":
# Note that you can declare manager and info here, instead of in scrape_items, if you wanted
#manager = multiprocessing.Manager()
#info = manager.dict([])
main()
One other important note here is that the first argument to map should be a function, not a class. This is stated in the docs (multiprocessing.map is meant to be equivalent to the built-in map).
Find the starting point of your program, and make sure you wrap only that with your if statement. For example:
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
if __name__ == "__main__":
main()
Essentially the contents of the if are only executed if you called this file directly when running your python code. If this file/module is included as an import from another file, all attributes will be defined, so you can access various attributes without actually beginning execution of the module.
Read more here:
What does if __name__ == "__main__": do?
Related
In a custom class I have the following code:
class CustomClass():
triggerQueue: multiprocessing.Queue
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
pool = multiprocessing.Pool(5)
while True:
try:
queueString = self.triggerQueue.get_nowait()
pool.apply_async(func=self.poolFunc, args=(queueString,))
except queue.Empty:
break
What I intend to do is:
add a trigger to the queue (not implemented in this snippet) -> works as intended
run an endless loop within the listenerFunc that reads all triggers from the queue (if any are found) -> works as intended
pass trigger to poolFunc which is to be executed asynchronosly -> not working
It works as soon as I source my poolFun() outside of the class like
def poolFunc(queueString):
print(queueString)
class CustomClass():
[...]
But why is that so? Do I have to pass the self argument somehow? Is it impossible to perform it this way in general?
Thank you for any hint!
There are several problems going on here.
Your instance method, poolFunc, is missing a self parameter.
You are never properly terminating the Pool. You should take advantage of the fact that a multiprocessing.Pool object is a context manager.
You're calling apply_async, but you're never waiting for the results. Read the documentation: you need to call the get method on the AsyncResult object to receive the result; if you don't do this before your program exits your poolFunc function may never run.
By making the Queue object part of your class, you won't be able to pass instance methods to workers.
We can fix all of the above like this:
import multiprocessing
import queue
triggerQueue = multiprocessing.Queue()
class CustomClass:
def poolFunc(self, queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for res in results:
print(res.get())
c = CustomClass()
for i in range(10):
triggerQueue.put(f"testval{i}")
c.listenerFunc()
You can, as you mention, also replace your instance method with a static method, in which case we can keep triggerQueue as part of the class:
import multiprocessing
import queue
class CustomClass:
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
#staticmethod
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = self.triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for r in results:
print(r.get())
c = CustomClass()
for i in range(10):
c.triggerQueue.put(f"testval{i}")
c.listenerFunc()
But we still need to reap the pool_async results.
Okay, I found an answer and a workaround:
the answer is based the anser of noxdafox to this question.
Instance methods cannot be serialized that easily. What the Pickle protocol does when serialising a function is simply turning it into a string.
For a child process would be quite hard to find the right object your instance method is referring to due to separate process address spaces.
A functioning workaround is to declare the poolFunc() as static function like
#staticmethod
def poolFunc(queueString):
print(queueString)
I need to run the same function 10 times that for reasons of data linked to a login, it needs to be inside another function:
from multiprocessing import Pool
def main():
def inside(a):
print(a)
Pool.map(inside, 'Ok' * 10)
if __name__ == '__main__':
main()
from multiprocessing import Pool
def main():
def inside(a):
print(a)
Pool.map(main.inside, 'Ok' * 10)
if __name__ == '__main__':
main()
In both attempts the result is this:
AttributeError: 'function' object has no attribute 'map'
How can I do this by keeping the function inside the other function?
Is there a way to do this?
AttributeError: 'function' object has no attribute 'map'
We need to instantiate Pool from multiprocessing and call map method of that pool object.
You have to move inside method to some class because Pool uses pickel to serialize and deserialize methods and if its inside some method then it cannot be imported by pickel.
Pool needs to pickle (serialize) everything it sends to its
worker-processes (IPC). Pickling actually only saves the name of a
function and unpickling requires re-importing the function by name.
For that to work, the function needs to be defined at the top-level,
nested functions won't be importable by the child and already trying
to pickle them raises an exception (more).
Please visit this link of SO.
from multiprocessing import Pool
class Wrap:
def inside(self, a):
print(a)
def main():
pool = Pool()
pool.map(Wrap().inside, 'Ok' * 10)
if __name__ == '__main__':
main()
If you don't want to wrap inside method inside of a class move the inside method to global scope so it can be pickled
from multiprocessing import Pool
def inside(a):
print(a)
def main():
with Pool() as pool:
pool.map(inside, 'Ok'*10)
if __name__ == '__main__':
main()
To start with, here is some code that works
from multiprocessing import Pool, Manager
import random
manager = Manager()
dct = manager.dict()
def do_thing(n):
for i in range(10_000_000):
i += 1
dct[n] = random.randint(0, 9)
with Pool(2) as pool:
pool.map(do_thing, range(10))
Now if I try to make a class out of this:
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
self.manager = Manager()
self.dct = self.manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
I run into: TypeError: Pickling an AuthenticationString object is disallowed for security reasons. Now from here, I get the hint that Python is trying to pickle the Manager which as I understand has its own dedicated process, and processes can't be pickled because they contain an AuthenticationString.
I don't know enough about how forking works (I'm on Linux, so I understand this is the default method for starting new processes) to understand exactly why the Manager instance needs to be pickled.
So here are my questions:
Why is this happening?
How can I use a Manager when doing multiprocessing within a class? PS: I want to be able to import SomeClass from this module.
Is what I'm asking for unreasonable or unconventional?
PS: I know I can do this exact snippet without the Manager by exploiting the fact that pool.map will return things in order, so something like this: res = pool.map(self.do_thing, range(10)) then dct = {k: v for k, v in zip(range(10), res)}. But that's besides the point of the question.
To answer your questions:
Q1 - Why is this happening?
Each worker process created by the Pool.map() needs to execute the instance method self.do_thing(). In order to do that Python pickles the instance and passes it to the subprocess (which unpickles it). If each instance has a Manager it will be a problem because they're not pickleable. Part of the unpickling process involves importing the module that defines the class and restoring the instance's attributes (which were also pickled).
Q2 - How to fix it
You can avoid the problem by having the class create its own class-level Manager (shared by all instances of the class). Here the __init__() method creates the manager class attribute the first time an instance is created and from that point on, further instances will reuse this — it's sometimes called "lazy initialization"
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
# Lazy creation of class attribute.
try:
manager = getattr(type(self), 'manager')
except AttributeError:
manager = type(self).manager = Manager()
self.dct = manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
print('done')
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
Q3 - Is this a reasonable thing to do?
In my opinion, yes.
I have a server that contains a class which performs an expensive computation
during its initialization. I want to initialize this class once, inside the main() method of the server module, before starting the server. Then, I want other modules that import the server module to be able to retrieve the instance of this class.
Example (the sleep emulates the server running)
import time
# I want to store the shared_instance of this global variable
shared_instance = None
class Shared:
def __init__(self):
# Expensive computation that I only want to run once
pass
def main():
global shared_instance
shared_instance = Shared() # Now instance_of_scorer is not None anymore
print(shared_instance)
print("Starting server...")
time.sleep(1000)
if __name__ == '__main__':
main()
When I run this server it prints:
<__main__.Shared object at 0x000001865A3C4320>
Starting server...
Now I have other module that should be able to see the instance:
import server
print(server.shared_instance)
However, shared_instance is not '<main.Shared object at 0x000001865A3C4320>' as expected. It is 'None'. Could you please tell me want I'm doing wrong and how can I solve this issue and achieve this functionality?.
Many thanks
When using multiprocessing in Python, and you're importing a module, why is is that any instance variables in the module are pass by copy to the child process, whereas and arguments passed in the args() parameter are pass by reference.
Does this have to do with thread safety perhaps?
foo.py
class User:
def __init__(self, name):
self.name = name
foo_user = User('foo')
main.py
import multiprocessing
from foo import User, foo_user
def worker(main_foo):
print(main_foo.name) #prints 'main user'
print(foo_user.name) #prints 'foo user', why doesn't it print 'override'
if __name__ == '__main__':
main_foo = User('main user')
foo_user.name = 'override'
p = multiprocessing.Process(target=worker, args=(main_foo,))
p.start()
p.join()
EDIT: I'm an idiot, self.name = None should have been self.name = name. I made the correction in my code and forgot to copy it back over.
Actually, it does print override. Look at this:
$ python main.py
None
override
But! This only happens on *Nix. My guess is that you are running on Windows. The difference being that, in Windows, a fresh copy of the interpreter is spawned to just run your function, and the change you made to foo_user.name is not made, because in this new instance, __name__ is not __main__, so that bit of code is not executed. This is done to prevent infinite recursion.
You'll see the difference if you add this line to your function:
def worker(main_foo):
print(__name__)
...
This prints __main__ on *Nix. However, it will not be __main__ for Windows.
You'll want to move that line out of the if __name__ == __main__ block, if you want it to work.