Python Multiprocessing Manager - List Name Error? - python

I am trying to use a Shared List that will update scraped information from Selenium so i can later export this info or use it how i chose. For some reason it is giving me this error:
NameError: name 'scrapedinfo' is not defined...
This is really strange to me because i declared the list Global AND I used the multiprocessing.Manager() to create the list. I have double checked my code many times and it is not a case sensitive error. I also tried to past the list through the functions as a variable but this created other problems and did not work. Any help is greatly appreciated!
from selenium import webdriver
from multiprocessing import Pool
def browser():
driver = webdriver.Chrome()
return driver
def test_func(link):
driver = browser()
driver.get(link)
def scrape_stuff(driver):
#Scrape things
scrapedinfo.append(#Scraped Stuff)
def multip():
manager = Manager()
#Declare list here
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, chunks)
print(scrapedinfo)
multip()

In Windows, multiprocessing executes a new python process and then tries to pickle/unpickle a limited view of the parent state for the child. Global variables that are not passed in the map call are not included. scrapedinfo is not created in the child and you get the error.
One solution is to pass scrapedinfo in the map call. Hacking down to a quick example,
from multiprocessing import Pool, Manager
def test_func(param):
scrapedinfo, link = param
scrapedinfo.append("i scraped stuff from " + str(link))
def multip():
manager = Manager()
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, list((scrapedinfo, chunk) for chunk in chunks))
print(scrapedinfo)
if __name__=="__main__":
multip()
But you are doing more work than you need to with the Manager. map passes the worker's return value back to the parent process (and handles chunking). So you could do:
from multiprocessing import Pool, Manager
def test_func(link):
return "i scraped stuff from " + link
def multip():
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
pool = Pool(processes=3)
scrapedinfo = pool.map(test_func, links)
print(scrapedinfo)
if __name__=="__main__":
multip()
And avoid the extra processsing of a clunky list proxy.

Related

Python multiprocessing a class

I am trying to multiprocess selenium where each process is spawned with a selenium driver and a session (each process is connected with a different account).
I have a list of URLs to visit.
Each URL needs to be visited once by one of the account (no matter which one).
To avoid some nasty global variable management, I tried to initialize each process with a class object using the initializer of multiprocessing.pool.
After that, I can't figure out how to distribute tasks to the process knowing that the function used by each process has to be in the class.
Here is a simplified version of what I'm trying to do :
from selenium import webdriver
import multiprocessing
account = [{'account':1},{'account':2}]
class Collector():
def __init__(self, account):
self.account = account
self.driver = webdriver.Chrome()
def parse(self, item):
self.driver.get(f"https://books.toscrape.com{item}")
if __name__ == '__main__':
processes = 1
pool = multiprocessing.Pool(processes,initializer=Collector,initargs=[account.pop()])
items = ['/catalogue/a-light-in-the-attic_1000/index.html','/catalogue/tipping-the-velvet_999/index.html']
pool.map(parse(), items, chunksize = 1)
pool.close()
pool.join()
The problem comes on the the pool.map line, there is no reference to the instantiated object inside the subprocess.
Another approach would be to distribute URLs and parse during the init but this would be very nasty.
Is there a way to achieve this ?
Since Chrome starts its own process, there is really no need to be using multiprocessing when multithreading will suffice. I would like to offer a more general solution to handle the case where you have N URLs you want to retrieve where N might be very large but you would like to limit the number of concurrent Selenium sessions you have to MAX_DRIVERS where MAX_DRIVERS is a significantly smaller number. Therefore, you only want to create one driver session for each thread in the pool and reuse it as necessary. Then the problem becomes calling quit on the driver when you are finished with the pool so that you don't leave any Selenium processes behind running.
The following code uses threadlocal storage, which is unique to each thread, to store the current driver instance for each pool thread and uses a class destructor to call the driver's quit method when the class instance is destroyed:
from selenium import webdriver
from multiprocessing.pool import ThreadPool
import threading
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
threadLocal = threading.local()
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has been "quitted".')
#classmethod
def create_driver(cls):
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = cls()
threadLocal.the_driver = the_driver
return the_driver.driver
def process(i, a):
print(f'Processing account {a}')
driver = Driver.create_driver()
driver.get(f'{baseurl}{i}')
def main():
global threadLocal
# We never want to create more than
MAX_DRIVERS = 8 # Rather arbitrary
POOL_SIZE = min(len(urls), MAX_DRIVERS)
pool = ThreadPool(POOL_SIZE)
pool.map(process, urls)
# ensure the drivers are "quitted":
del threadLocal
import gc
gc.collect() # a little extra insurance
pool.close()
pool.join()
if __name__ == '__main__':
main()
I'm not entirely certain if this solves your problem.
If you have one account per URL then you could do this:
from selenium import webdriver
from multiprocessing import Pool
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
def process(i, a):
print(f'Processing account {a}')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get(f'{baseurl}{i}')
def main():
with Pool() as pool:
pool.starmap(process, zip(items, accounts))
if __name__ == '__main__':
main()
If the number of accounts doesn't match the number of URLs, you have said that it doesn't matter which account GETs from which URL. So, in that case, you could just select the account to use at random (random.choice())

python multiprocessing module updates a class attribute inside the called process but is not updated globally

with my multiprocessing code
def __get_courses_per_moderator(self, moderator, page=None):
print("started")
try:
response = self.class_service.courses().list(pageSize=0,
courseStates='ACTIVE',
teacherId=moderator, pageToken=None).execute()
for data in response["courses"]:
self.courses_in_classroom.append(data["name"])
print(self.courses_in_classroom) # THE modifications are clear
except Exception as e:
print("__get_courses_per_moderator ERROR-{}".format(e))
def get_courses_from_classroom(self):
# batch = service.new_batch_http_request(callback=self.__callback)
pool = []
for email in self.additional_emails:
try:
process = multiprocessing.Process(
target=self.__get_courses_per_moderator, args=[email])
process.start()
pool.append(process)
except Exception as e:
print("get_courses_from_classroom ERROR-{}".format(e))
for process in pool:
process.join()
print("joined")
print(self.courses_in_classroom) # the attribute is empty.
As far as I understand python is synchronous. So when the processes update the class attribute, the value should be there right? or should I try to return it and then concat after join() ?
A simple explanation would be lovely.
I fixed it by using a process manager and putting that manager outside of the class.
manager = multiprocessing.Manager()
course_list = manager.list()
class x:
#append things to course_list

Object local to a thread in multiprocessing.dummy.Pool

I'm using multiprocessing.dummy.Pool to issue RESTful API calls in parallel.
For now the code looks like:
from multiprocessing.dummy import Pool
def onecall(args):
env = args[0]
option = args[1]
return env.call(option) # call() returns a list
def call_all():
threadpool = Pool(processes=4)
all_item = []
for item in threadpool.imap_unordered(onecall, ((create_env(), x) for x in range(100))):
all_item.extend(item)
return all_item
In the code above, env object wraps a requests.Session() object and thus is in charge of maintaining connection session. The 100 tasks use 100 different env objects. Thus, each task just creates 1 connection, make 1 API call, and disconnect.
However, to enjoy the benefit of HTTP keep-alive, I want the 100 tasks to share 4 env objects (one object per thread) so each connection serves multiple API calls one-by-one. How should I achieve that?
Using threading.local seems to work.
from multiprocessing.dummy import Pool
import threading
tlocal = threading.local()
def getEnv():
try:
return tlocal.env
except AttributeError:
tlocal.env = create_env()
return tlocal.env
def onecall(args):
option = args[0]
return getEnv().call(option) # call() returns a list
def call_all():
threadpool = Pool(processes=4)
all_item = []
for item in threadpool.imap_unordered(onecall, ((x,) for x in range(100))):
all_item.extend(item)
return all_item

Python - How to run array batches

I am new to Python and I am currently working on a multiple webpage scraper. While I was playing around with Python I found out about Threading, this really speeds up the code. The problem is that the script scrapes lots of sites and I like to this in an array of 'batches' while using Threading.
When I've got an array of 1000 items I'd like to grab 10 items. When the script is done with these 10 items, grab 10 new items until there is nothing left
I hope someone can help me. Thanks in advance!
import subprocess
import threading
from multiprocessing import Pool
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
pool = Pool()
sites = ["http://site1.com", "http://site2.com", "http://site3.com", "http://site4.com"]
results = pool.imap(scrape, sites)
for result in results:
print(result)
In the future I use an sqlite database where I store all the URLs (this will replace the array). When I run the script I want the control of stopping the process and continue whenever I want to. This is not my question but the context of my problem.
Question: ... array of 1000 items I'd like to grab 10 items
for p in range(0, 1000, 10):
process10(sites[p:p+10])
How about using Process and Queue from multiprocessing? Writing a worker function and calling it from a loop makes it run as batches. With Process, jobs can be started and stopped when needed and give you more control over them.
import subprocess
from multiprocessing import Process, Queue
def worker(url_queue, result_queue):
for url in iter(url_queue.get, 'DONE'):
scrape_result = scrape(url)
result_queue.put(scrape_result)
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
sites = ['http://site1.com', "http://site2.com", "http://site3.com", "http://site4.com", "http://site5.com",
"http://site6.com", "http://site7.com", "http://site8.com", "http://site9.com", "http://site10.com",
"http://site11.com", "http://site12.com", "http://site13.com", "http://site14.com", "http://site15.com",
"http://site16.com", "http://site17.com", "http://site18.com", "http://site19.com", "http://site20.com"]
url_queue = Queue()
result_queue = Queue()
processes = []
for url in sites:
url_queue.put(url)
for i in range(10):
p = Process(target=worker, args=(url_queue, result_queue))
p.start()
processes.append(p)
url_queue.put('DONE')
for p in processes:
p.join()
result_queue.put('DONE')
for response in iter(result_queue.get, 'DONE'):
print response
Note that Queue is the FIFO Queue that supports putting and pulling elements.

Appending a Global List with threading - unable to access results outside of thread

I am using Threading to speed up collecting data from a website via RESTful API. I am storing the results in a list that I will walk later. I have made the list global, however when I try to print(list) outside of a thread, I see no results. Within, I can print(list) and see that it is appending correctly with all data. What's the best way to collect this data?
global global_list
global_list = []
pool = ActivePool()
s = threading.Sempahore(3)
def get_data(s, pool, i, list):
with s:
name = threading.currentThread().getName()
pool.makeActive(Name)
data = []
session = get_session(login info and url)#function for establishing connection to remote site
request = {API Call 'i'}
response = session.post(someurl, json=request)
session.close()
data = response.json()
global_list.append(data)
pool.makeInactive(name)
def main():
for i in api_list:
t = threading.Thread(target=get_data, name=i['uniqueID'], args=(s, pool, i, global_list))
print(global_list)
if __name__ == '__main__':
main()
I was able to use Queue! Per the below Question, I stored the data I needed with put() within the thread and then storage it in a list I appended with get() outside of the threads.
Return Value From Thread

Categories