I launch multiple selenium sessions with multiprocessing and want to close them all or a few when user wants it
schema of my code:
from multiprocessing import Process
def drop():
...
driver = webdriver.Chrome(executable_path=os.path.join(path, '\\chromedriver.exe'),chrome_options=options)
...
for i in range(10):
s=Process(target = drop)
s.start()
input('press enter to stop all sessions')
###close selenium driver event
you can create an array process_list = [] and append the 's' variable in the list for every iteration process_list.append(s) and then to stop all processes
for i in process_list:
driver.close()
i.terminate()
Adding each driver instance to a list when creating them and then iterating through them one by one to close them afterwards might be able to solve your issue.
from multiprocessing import Process
list_drivers = []
def drop():
...
driver = webdriver.Chrome(executable_path=os.path.join(path, '\\chromedriver.exe'),chrome_options=options)
list_drivers.append(driver)
...
for i in range(10):
s=Process(target = drop)
s.start()
input('press enter to stop all sessions')
for driver in list_drivers:
driver.quit()
###close selenium driver event
Related
I am trying to multiprocess selenium where each process is spawned with a selenium driver and a session (each process is connected with a different account).
I have a list of URLs to visit.
Each URL needs to be visited once by one of the account (no matter which one).
To avoid some nasty global variable management, I tried to initialize each process with a class object using the initializer of multiprocessing.pool.
After that, I can't figure out how to distribute tasks to the process knowing that the function used by each process has to be in the class.
Here is a simplified version of what I'm trying to do :
from selenium import webdriver
import multiprocessing
account = [{'account':1},{'account':2}]
class Collector():
def __init__(self, account):
self.account = account
self.driver = webdriver.Chrome()
def parse(self, item):
self.driver.get(f"https://books.toscrape.com{item}")
if __name__ == '__main__':
processes = 1
pool = multiprocessing.Pool(processes,initializer=Collector,initargs=[account.pop()])
items = ['/catalogue/a-light-in-the-attic_1000/index.html','/catalogue/tipping-the-velvet_999/index.html']
pool.map(parse(), items, chunksize = 1)
pool.close()
pool.join()
The problem comes on the the pool.map line, there is no reference to the instantiated object inside the subprocess.
Another approach would be to distribute URLs and parse during the init but this would be very nasty.
Is there a way to achieve this ?
Since Chrome starts its own process, there is really no need to be using multiprocessing when multithreading will suffice. I would like to offer a more general solution to handle the case where you have N URLs you want to retrieve where N might be very large but you would like to limit the number of concurrent Selenium sessions you have to MAX_DRIVERS where MAX_DRIVERS is a significantly smaller number. Therefore, you only want to create one driver session for each thread in the pool and reuse it as necessary. Then the problem becomes calling quit on the driver when you are finished with the pool so that you don't leave any Selenium processes behind running.
The following code uses threadlocal storage, which is unique to each thread, to store the current driver instance for each pool thread and uses a class destructor to call the driver's quit method when the class instance is destroyed:
from selenium import webdriver
from multiprocessing.pool import ThreadPool
import threading
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
threadLocal = threading.local()
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has been "quitted".')
#classmethod
def create_driver(cls):
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = cls()
threadLocal.the_driver = the_driver
return the_driver.driver
def process(i, a):
print(f'Processing account {a}')
driver = Driver.create_driver()
driver.get(f'{baseurl}{i}')
def main():
global threadLocal
# We never want to create more than
MAX_DRIVERS = 8 # Rather arbitrary
POOL_SIZE = min(len(urls), MAX_DRIVERS)
pool = ThreadPool(POOL_SIZE)
pool.map(process, urls)
# ensure the drivers are "quitted":
del threadLocal
import gc
gc.collect() # a little extra insurance
pool.close()
pool.join()
if __name__ == '__main__':
main()
I'm not entirely certain if this solves your problem.
If you have one account per URL then you could do this:
from selenium import webdriver
from multiprocessing import Pool
items = ['/catalogue/a-light-in-the-attic_1000/index.html',
'/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'
def process(i, a):
print(f'Processing account {a}')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get(f'{baseurl}{i}')
def main():
with Pool() as pool:
pool.starmap(process, zip(items, accounts))
if __name__ == '__main__':
main()
If the number of accounts doesn't match the number of URLs, you have said that it doesn't matter which account GETs from which URL. So, in that case, you could just select the account to use at random (random.choice())
I am trying to run multiple selenium instances in which I need to enter captchas, but I am a beginner in multiprocessing.
So while running and its time to give input it shows an error:
EOFError: EOF when reading a line
Here is an example of the code I am running:
import time
from selenium import webdriver
import multiprocessing
def first():
chromedriver = "C:\chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.set_window_size(1000, 1000)
driver.get('https://www.google.com/')
time.sleep(5)
captcha1 = input("in1: ")
print(inn)
def sec():
chromedriver = "C:\chromedriver"
driverr = webdriver.Chrome(chromedriver)
driverr.set_window_size(1000, 1000)
driverr.get('https://www.google.com/')
captcha2 = input("in2: ")
print(ins)
if __name__ == '__main__':
p1 = multiprocessing.Process(target=first)
p2 = multiprocessing.Process(target=sec)
p1.start()
p2.start()
p1.join()
p2.join()
Not only do I need to know how to give input but in this instance the 'captcha2' input would be needed first, so the 'captcha1' would have to wait until 'captcha2' is given...
You need to send messages requesting user input back to the main process so that it (and only it) can ask the user about them. The simplest way to do this is probably to create a multiprocessing.Queue object for the requests (so that the main process can listen to all children) and a Pipe for each process for the answers. Each request would of course be labeled with an identifier for the process sending it so that the response could be sent to the right place.
I want to run multiple chrome instances with selenium. I tried to loop the webdrivers but selenium keeps shutting the instances down.
Here is the code:
from selenium import webdriver
user = str(input("Do you want to run this program? "))
amount = 0
if user == "yes":
amount = int(input("How many instances do you want to run? "))
for w in range(1, amount+1):
webdriver.Chrome("path of my driver")
elif user == "no":
print("Program is closing...")
else:
print("Invalid input")
The weird thing is that the instances wont close if i write them without a loop:
from selenium import webdriver
user = str(input("Do you want to run this program? "))
if user == "yes":
driver1 = webdriver.Chrome("path of driver")
driver2 = webdriver.Chrome("path of driver")
driver3 = webdriver.Chrome("path of driver")
driver4 = webdriver.Chrome("path of driver")
driver5 = webdriver.Chrome("path of driver")
elif user == "no":
print("Program is closing...")
else:
print("Invalid input")
Is there any solution for my problem?
To close the instances when you write it without a loop then do the following.
driver.close() - It closes the the browser window on which the focus
is set.
driver.quit() – It basically calls driver.dispose method which in turn
closes all the browser windows and ends the WebDriver session
gracefully.
You should use driver.quit whenever you want to end the program. It will close all opened browser window and terminates the WebDriver session.
Consider writing your loop example in the following way, this will handle shutting the instances much better.
Using either multithreading or multiprocessing or subprocess module to trigger the task in parallel (near parallel).
Multithreading example
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Firefox()
url = 'https://www.google.co.in'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print t.name + ' started!'
thread_list.append(t)
# Wait for all thre<ads to complete
for thread in thread_list:
thread.join()
print 'Test completed!'
Here, spawning 5 browsers to run test cases at one time. Instead of implementing the test logic I have put sleep time of 2 seconds for the purpose of demonstration. The code will fire up 5 firefox browsers (tested with python 2.7), open google and wait for 2 seconds before quitting.
Logs:
C:\Python27\python.exe C:/Users/swadchan/Documents/TestPyCharm/stackoverflow/so49617485.py
Test 0 started!
Test 1 started!
Test 2 started!
Test 3 started!
Test 4 started!
Test completed!
Process finished with exit code 0
I am working on Crawl project using Selenium and Webdriver. Since that data I need to crawl while big, I want to split it to 2 threads and run at the same time. But when I start 2 Webdrivers at the same time, my code can not recognize which driver belong to which thread to fill the information.
This is the code that I setup 2 threads run main function:
if __name__ == '__main__':
data = load_data(INPUT_DIR)
t1_data = data[:250]
t2_data = data[250:]
try:
_thread.start_new_thread(main, (t1_data, LOG_FILE_T1))
_thread.start_new_thread(main, (t2_data, LOG_FILE_T2))
except:
print ("Error: unable to start thread")
while 1:
pass
This code I start the Webdriver:
def start_driver():
global driver
options = Options()
options.add_argument("--disable-notification")
options.add_argument("--disable-infobars")
options.add_argument("--mute-audio")
#options.add_argument("headless")
driver = webdriver.Chrome(options=options)
Two Webdrivers after started, I will fill in the username/password
information on facebook.com
def login(email, password):
""" Logging into our own profile """
try:
driver.get('https://mbasic.facebook.com')
time.sleep(DELAY_TIME)
driver.find_element_by_name('email').send_keys(email)
driver.find_element_by_name('pass').send_keys(password)
driver.find_element_by_name('login').click()
# deal with "Not Now" button if it show off at first time
not_now_button = driver.find_element_by_xpath("//a")
if not_now_button.size != 0:
not_now_button.click()
except Exception as e:
print('Error in Login')
print(e)
exit()
At send_keys step, both threads fill in the same text box in 1 Webdriver.
How can I change my code for 2 threads can see different Webdrive and fill the info in ?
Just found out the solution, I want to share here if someone needed.
Instead of global driver, I will change to a local driver and pass it to each function.
def start_driver():
options = Options()
options.add_argument("--disable-notification")
options.add_argument("--disable-infobars")
options.add_argument("--mute-audio")
#options.add_argument("headless")
driver = webdriver.Chrome(options=options)
return driver
def login(driver, email, password):
""" Logging into our own profile """
try:
driver.get('https://mbasic.facebook.com')
time.sleep(DELAY_TIME)
driver.find_element_by_name('email').send_keys(email)
driver.find_element_by_name('pass').send_keys(password)
driver.find_element_by_name('login').click()
# deal with "Not Now" button if it show off at first time
not_now_button = driver.find_element_by_xpath("//a")
if not_now_button.size != 0:
not_now_button.click()
except Exception as e:
print('Error in Login')
print(e)
exit()
By this, 2 threads can see different Webdriver and fill their own information.
I have a program that creates a multiprocessing pool to handle a webextraction job. Essentially, a list of product ID's is fed into a pool of 10 processes that handle the queue. The code is pretty simple:
import multiprocessing
num_procs = 10
products = ['92765937', '20284759', '92302047', '20385473', ...etc]
def worker():
for workeritem in iter(q.get, None):
time.sleep(10)
get_product_data(workeritem)
q.task_done()
q.task_done()
q = multiprocessing.JoinableQueue()
procs = []
for i in range(num_procs):
procs.append(multiprocessing.Process(target=worker))
procs[-1].daemon = True
procs[-1].start()
for product in products:
time.sleep(10)
q.put(product)
q.join()
for p in procs:
q.put(None)
q.join()
for p in procs:
p.join()
The get_product_data() function takes the product, opens an instance of Selenium, and navigates to a site, logs in, and collects the details of the product and outputs to a csv file. The problem is, randomly (literally... it happens at different points of the website's navigation or extraction process) Selenium will stop doing whatever it's doing and just sit there and stop doing it's job. No exceptions are thrown or anything. I've done everything I can in the get_product_data() function to get this to not happen, but it seems to just be a problem with Selenium (i've tried using Firefox, PhantomJS, and Chrome as it's driver, and still run into the same problem no matter what).
Essentially, the process should never run for longer than, say, 10 minutes. Is there any way to kill a process and restart it with the same product id if it has been running for longer than the specified time?
This is all running on a Debian Wheezy box with Python 2.7.
You could write your code using multiprocessing.Pool and the timeout() function suggested by #VooDooNOFX. Not tested, consider it an executable pseudo-code:
#!/usr/bin/env python
import signal
from contextlib import closing
from multiprocessing import Pool
class Alarm(Exception):
pass
def alarm_handler(*args):
raise Alarm("timeout")
def mp_get_product_data(id, timeout=10, nretries=3):
signal.signal(signal.SIGALRM, alarm_handler) #XXX could move it to initializer
for i in range(nretries):
signal.alarm(timeout)
try:
return id, get_product_data(id), None
except Alarm as e:
timeout *= 2 # retry with increased timeout
except Exception as e:
break
finally:
signal.alarm(0) # disable alarm, no need to restore handler
return id, None, str(e)
if __name__=="__main__":
with closing(Pool(num_procs)) as pool:
for id, result, error in pool.imap_unordered(mp_get_product_data, products):
if error is not None: # report and/or reschedule
print("error: {} for {}".format(error, id))
pool.join()
You need to ask Selenium to wait an explicit amount of time, or wait for some implicit DOM object to be available. Take a quick look at the selenium docs about that.
From the link, here's a process that waits 10 seconds for the DOM element myDynamicElement to appear.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
ff = webdriver.Firefox()
ff.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
except TimeoutException as why:
# Do something to reject this item, possibly by re-adding it to the worker queue.
finally:
ff.quit()
If nothing is available in the given time period, a selenium.common.exceptions.TimeoutException is raised, which you can catch in a try/except loop like above.
EDIT
Another option is to ask multiprocessing to timeout the process after some amount of time. This is done using the built-in library signal. Here's an excellent example of doing this, however it's still up to you to add that item back into the work queue when you detect a process has been killed. You can do this in the def handler section of the code.