I have a service that takes screenshots of given url using Selenium Web Driver.
It workes Ok, raises a process -> takes the screenshot -> closes the process.
the problem is - it takes too long to return.
is there a way that the web driver process stays always-on and waits for requests?
here is my code
class WebDriver(webdriver.Chrome):
def __init__(self, *args, **kwargs):
logger.info('Start WebDriver instance.')
self.start_time = datetime.now()
self.lock = threading.Lock()
kwargs['chrome_options'] = self.get_chrome_options()
super().__init__(*args, **kwargs)
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
logger.info(f'Quiting Webdriver instance {id(self)}, took {datetime.now() - self.start_time}')
self.quit()
#staticmethod
def get_chrome_options():
chrome_options = ChromeOptions()
chrome_options.headless = True
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument("--no-sandbox") # Bypass OS security model
chrome_options.add_argument('--disable-dev-shm-usage') # overcome limited resource problems
chrome_options.add_argument("--lang=en")
chrome_options.add_argument("--disable-infobars") # disabling infobars
chrome_options.add_argument("--disable-extensions") # disabling extensions
chrome_options.add_argument("--hide-scrollbars")
return chrome_options
def capture_screenshot_from_html_string(self, html_str, window_size):
with tempfile.TemporaryDirectory() as tmpdirname:
html_filename = tmpdirname + f'/template.html'
with open(html_filename, 'w') as f:
f.write(html_str)
url = 'file://' + html_filename
img_str = self.capture_screenshot(url, window_size)
return img_str
def capture_screenshot(self, url, window_size):
self.lock.acquire()
try:
self.set_window_size(*window_size)
self.get(url)
self.maximize_window()
self.set_page_load_timeout(PAGE_LOAD_TIMEOUT)
img_str = self.get_screenshot_as_png()
except Exception as exc:
logger.error(f'Error capturing screenshot url: {url}; {exc}')
img_str = None
finally:
self.lock.release()
return img_str
After some research i found a solution and im posting it to maybe help others in similar problem.
using py-object-pool library.
Object pool library creates a pool of resource class instance and use them in your project. Pool is implemented using python built in library Queue.
Each time creating a new browser instance is time consuming task which will make client to wait.
If you have one browser instance and manage with browser tab, it will become cumbersome to maintain and debug in case of any issue arises.
Object Pool will help you to manage in that situation as it creates resource pool and provides to each client when it requests. Thus separating the process from one another without waiting or creating new instance on the spot.
Code Example
ff_browser_pool = ObjectPool(FirefoxBrowser, min_init=2)
with ff_browser_pool.get() as (browser, browser_stats):
title = browser.get_page_title('https://www.google.co.in/')
for more information see link below
https://pypi.org/project/py-object-pool/
Related
Cant send commands to selenium webdriver in detached session because link http://localhost:port died.
But if i put breakpoint 1 link stay alive
import multiprocessing
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_driver_pool(q):
options = Options()
driver = webdriver.Chrome(options=options)
pass #breakpoint 1
return driver.command_executor._url
windows_pool = multiprocessing.Pool(processes=1)
result = windows_pool.map(create_driver_pool, [1])
print(result)
pass # breakpoint 2 for testing link
why is this happening and what can i do about it?
After some research i finally found the reason for this behavor.
Thanks https://bentyeh.github.io/blog/20190527_Python-multiprocessing.html and some googling about signals.
This is not signals at all.
I found this code in selenium.common.service
def __del__(self):
print("del detected")
# `subprocess.Popen` doesn't send signal on `__del__`;
# so we attempt to close the launched process when `__del__`
# is triggered.
try:
self.stop()
except Exception:
pass
This is handler for garbage collector function, that killing subprocess via SIGTERM
self.process.terminate()
self.process.wait()
self.process.kill()
self.process = None
But if you in the debug mode with breakpoint, garbage collector wont collect this object, and del wont start.
I have an automation project made with Python and Selenium which I'm trying to make it run with multiple browsers in parallel.
The current workflow:
open a browser for manual login
save cookies for later use
in a loop, open additional browsers, load the saved session in each newly opened browser
The described workflow is opening some browsers, one by one, until all required browsers are opened.
My code contains several classes: Browser and Ui.
The object instantiated with Ui class contains a method which at some point executes the following code:
for asset in Inventory.assets:
self.browsers[asset] = ui.Browser()
# self.__open_window(asset) # if it is uncommented, the code is working properly without multi threading part; all the browsers are opened one by one
# try 1
# threads = []
# for asset in Inventory.assets:
# threads.append(Thread(target=self.__open_window, args=(asset,), name=asset))
# for thread in threads:
# thread.start()
# try 2
# with concurrent.futures.ThreadPoolExecutor() as executor:
# futures = []
# for asset in Inventory.assets:
# futures.append(executor.submit(self.__open_window, asset=asset))
# for future in concurrent.futures.as_completed(futures):
# print(future.result())
The problem appear when self.__open_window is executed within a thread. There i get an error related to Selenium, something like: 'NoneType' object has no attribute 'get', when self.driver.get(url) is called from the Browser class.
def __open_window(self, asset):
self.interface = self.browsers[asset]
self.interface.open_browser()
In class Browser:
def open_browser(self, driver_path=""):
# ...
options = webdriver.ChromeOptions()
# ...
#
web_driver = webdriver.Chrome(executable_path=driver_path, options=options)
#
self.driver = web_driver
self.opened_tabs["default"] = web_driver.current_window_handle
#
# ...
def get_url(self, url):
try:
self.driver.get(url) # this line cause problems ...
except Exception as e:
print(e)
My questions are:
Why do i have this issue in a multi threading environment?
What should i do in order to make the code work properly?
Thank You
I found the mistake, it was because of a wrong object reference.
After modification the code is working well.
I updated the following lines at __open_window:
def __open_window(self, asset, browser):
browser.interface = self.browsers[asset]
browser.interface.open_browser()
and in # try 1 code section:
threads.append(Thread(target=self.__open_window, args=(asset, browser, ), name=asset))
I am trying to multiprocess selenium where each process is spawned with a selenium driver and a session (each process is connected with a different account).
I have a list of URLs to visit.
Each URL needs to be visited once by one of the account (no matter which one).
To avoid some nasty global variable management, I tried to initialize each process with a class object using the initializer of multiprocessing.pool.
After that, I can't figure out how to distribute tasks to the process knowing that the function used by each process has to be in the class.
Here is a simplified version of what I'm trying to do :
from selenium import webdriver
import multiprocessing
account = [{'account':1},{'account':2}]
class Collector():
def __init__(self, account):
self.account = account
self.driver = webdriver.Chrome()
def parse(self, item):
self.driver.get(f"https://books.toscrape.com{item}")
if __name__ == '__main__':
processes = 1
pool = multiprocessing.Pool(processes,initializer=Collector,initargs=[account.pop()])
items = ['/catalogue/a-light-in-the-attic_1000/index.html','/catalogue/tipping-the-velvet_999/index.html']
pool.map(parse(), items, chunksize = 1)
pool.close()
pool.join()
The problem comes on the the pool.map line, there is no reference to the instantiated object inside the subprocess.
Another approach would be to distribute URLs and parse during the init but this would be very nasty.
Is there a way to achieve this ?
Since Chrome starts its own process, there is really no need to be using multiprocessing when multithreading will suffice. I would like to offer a more general solution to handle the case where you have N URLs you want to retrieve where N might be very large but you would like to limit the number of concurrent Selenium sessions you have to MAX_DRIVERS where MAX_DRIVERS is a significantly smaller number. Therefore, you only want to create one driver session for each thread in the pool and reuse it as necessary. Then the problem becomes calling quit on the driver when you are finished with the pool so that you don't leave any Selenium processes behind running.
The following code uses threadlocal storage, which is unique to each thread, to store the current driver instance for each pool thread and uses a class destructor to call the driver's quit method when the class instance is destroyed:
from selenium import webdriver
from multiprocessing.pool import ThreadPool
import threading
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
threadLocal = threading.local()
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has been "quitted".')
#classmethod
def create_driver(cls):
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = cls()
threadLocal.the_driver = the_driver
return the_driver.driver
def process(i, a):
print(f'Processing account {a}')
driver = Driver.create_driver()
driver.get(f'{baseurl}{i}')
def main():
global threadLocal
# We never want to create more than
MAX_DRIVERS = 8 # Rather arbitrary
POOL_SIZE = min(len(urls), MAX_DRIVERS)
pool = ThreadPool(POOL_SIZE)
pool.map(process, urls)
# ensure the drivers are "quitted":
del threadLocal
import gc
gc.collect() # a little extra insurance
pool.close()
pool.join()
if __name__ == '__main__':
main()
I'm not entirely certain if this solves your problem.
If you have one account per URL then you could do this:
from selenium import webdriver
from multiprocessing import Pool
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
def process(i, a):
print(f'Processing account {a}')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get(f'{baseurl}{i}')
def main():
with Pool() as pool:
pool.starmap(process, zip(items, accounts))
if __name__ == '__main__':
main()
If the number of accounts doesn't match the number of URLs, you have said that it doesn't matter which account GETs from which URL. So, in that case, you could just select the account to use at random (random.choice())
I am working on Crawl project using Selenium and Webdriver. Since that data I need to crawl while big, I want to split it to 2 threads and run at the same time. But when I start 2 Webdrivers at the same time, my code can not recognize which driver belong to which thread to fill the information.
This is the code that I setup 2 threads run main function:
if __name__ == '__main__':
data = load_data(INPUT_DIR)
t1_data = data[:250]
t2_data = data[250:]
try:
_thread.start_new_thread(main, (t1_data, LOG_FILE_T1))
_thread.start_new_thread(main, (t2_data, LOG_FILE_T2))
except:
print ("Error: unable to start thread")
while 1:
pass
This code I start the Webdriver:
def start_driver():
global driver
options = Options()
options.add_argument("--disable-notification")
options.add_argument("--disable-infobars")
options.add_argument("--mute-audio")
#options.add_argument("headless")
driver = webdriver.Chrome(options=options)
Two Webdrivers after started, I will fill in the username/password
information on facebook.com
def login(email, password):
""" Logging into our own profile """
try:
driver.get('https://mbasic.facebook.com')
time.sleep(DELAY_TIME)
driver.find_element_by_name('email').send_keys(email)
driver.find_element_by_name('pass').send_keys(password)
driver.find_element_by_name('login').click()
# deal with "Not Now" button if it show off at first time
not_now_button = driver.find_element_by_xpath("//a")
if not_now_button.size != 0:
not_now_button.click()
except Exception as e:
print('Error in Login')
print(e)
exit()
At send_keys step, both threads fill in the same text box in 1 Webdriver.
How can I change my code for 2 threads can see different Webdrive and fill the info in ?
Just found out the solution, I want to share here if someone needed.
Instead of global driver, I will change to a local driver and pass it to each function.
def start_driver():
options = Options()
options.add_argument("--disable-notification")
options.add_argument("--disable-infobars")
options.add_argument("--mute-audio")
#options.add_argument("headless")
driver = webdriver.Chrome(options=options)
return driver
def login(driver, email, password):
""" Logging into our own profile """
try:
driver.get('https://mbasic.facebook.com')
time.sleep(DELAY_TIME)
driver.find_element_by_name('email').send_keys(email)
driver.find_element_by_name('pass').send_keys(password)
driver.find_element_by_name('login').click()
# deal with "Not Now" button if it show off at first time
not_now_button = driver.find_element_by_xpath("//a")
if not_now_button.size != 0:
not_now_button.click()
except Exception as e:
print('Error in Login')
print(e)
exit()
By this, 2 threads can see different Webdriver and fill their own information.
Here's just about the simplest open and close you can do with webdriver and phantom:
from selenium import webdriver
crawler = webdriver.PhantomJS()
crawler.set_window_size(1024,768)
crawler.get('https://www.google.com/')
crawler.quit()
On windows (7), every time I run my code to test something out, new instances of the conhost.exe and phantomjs.exe processes begin and never quit. Am I doing something stupid here? I figured the processes would quit when the crawler.quit() did...
Go figure. Problem resolved with a reboot.
Rebooting is not a solution for this problem. I have experimented this hack in LINUX system. Try modifying the stop() function defined in service.py
def stop(self):
"""
Cleans up the process
"""
if self._log:
self._log.close()
self._log = None
#If its dead dont worry
if self.process is None:
return
#Tell the Server to properly die in case
try:
if self.process:
self.process.stdin.close()
#self.process.kill()
self.process.send_signal(signal.SIGTERM)
self.process.wait()
self.process = None
except OSError:
# kill may not be available under windows environment
pass
Added line send_signal explicitly to give the signal to quit phantomjs process. Don't forget to add import signal statement at start of this file.