Python undetectable_webdriver won't open in a loop - python

I am trying to open a site multiple times in a loop to test if different credentials have expired so that I can notify our users. I'm achieving this by opening the database, getting the records, calling the chrome driver to open the site, and inputting the values into the site. The first loop works but when the next one initiates the driver hangs and eventually outputs the error:
"unknown error: cannot connect to chrome at 127.0.0.1:XXXX from chrome not reachable"
This error commonly occurs when there is already an instance running. I have tried to prevent this by using both driver.close() and driver.quit() when the first loop is done but to no avail. I have taken care of all other possibilities of detection such as using proxies, different user agents, and also using the undetected_chromedriver by https://github.com/ultrafunkamsterdam/undetected-chromedriver.
The core issue I am looking to solve is being able to open an instance of the chrome driver, close it and open it back again all in the same execution loop until all the credentials I am testing have finished. I have abstracted the code and provided an isolated version that replicates the issue:
# INSTALL CHROMDRIVER USING "pip install undetected-chromedriver"
import undetected_chromedriver.v2 as uc
# Python Libraries
import time
options = uc.ChromeOptions()
options.add_argument('--no-first-run')
driver = uc.Chrome(options=options)
length = 8
count = 0
if count < length:
print("Im outside the loop")
while count < length:
print("This is loop ",count)
time.sleep(2)
with driver:
print("Im inside the loop")
count =+ 1
driver.get("https://google.com")
time.sleep(5)
print("Im at the end of the loop")
driver.quit() # Used to exit the browser, and end the session
# driver.close() # Only closes the window in focus
I recommend using a python virtualenv to keep packages consistent. I am using python3.9 on a Linux machine. Any solutions, suggestions, or workarounds would be greatly appreciated.

You are quitting your driver in the loop and then trying to access the executor address, which no longer exists, hence your error. You need to reinitialize the driver by moving it down within the loop, before the while statement.
from multiprocessing import Process, freeze_support
import undetected_chromedriver as uc
# Python Libraries
import time
chroptions = uc.ChromeOptions()
chroptions.add_argument('--no-first-run enable_console_log = True')
# driver = uc.Chrome(options=chroptions)
length = 8
count = 0
if count < length:
print("Im outside the loop")
while count < length:
print("This is loop ",count)
driver = uc.Chrome(options=chroptions)
time.sleep(2)
with driver:
print("Im inside the loop")
count =+ 1
driver.get("https://google.com")
print("Session ID: ", end='') #added to show your session ID is changing
print(driver.session_id)
driver.quit()

Related

Looking for a timeout function in Python that will run on Windows

I've got a Python program on Windows 10 that runs a loop thousands of times with multiple functions that can sometimes hang and stop the program execution. Some are IO functions, and some are selenium webdriver functions. I'm trying to build a mechanism that will let me run a function, then kill that function after a specified number of seconds and try it again if that function didn't finish. If the function completes normally, let the program execution continue without waiting for the timeout to finish.
I've looked at at least a dozen different solutions, and can't find something that fits my requirements. Many require SIGNALS which is not available on Windows. Some spawn processes or threads which consume resources that can't easily be released, which is a problem when I'm going to run these functions thousands of times. Some work for very simple functions, but fail when a function makes a call to another function.
The situations this must work for:
Must run on Windows 10
A "driver.get" command for selenium webdriver to read a web page
A function that reads from or writes to a text file
A function that runs an external command (like checking my IP address or connecting to a VPN server)
I need to be able to specify a different timeout for each of these situations. A file write should take < 2 seconds, whereas a VPN server connection may take 20 seconds.
I've tried the following function libraries:
timeout-decorator 0.5.0
wrapt-timeout-decorator 1.3.1
func-timeout 4.3.5
Here is a trimmed version of my program that includes the functions I need to wrap in a timeout function:
import csv
import time
from datetime import date
from selenium import webdriver
import urllib.request
cities = []
total_cities = 0
city = ''
city_counter = 0
results = []
temp = ''
temp2 = 'IP address not found'
driver = None
if __name__ == '__main__':
#Read city list
with open('citylist.csv') as csvfile:
readCity = csv.reader(csvfile, delimiter='\n');
for row in csvfile:
city = row.replace('\n','')
cities.append(city.replace('"',''))
#Get my IP address
try:
temp = urllib.request.urlopen('http://checkip.dyndns.org')
temp = str(temp.read())
found = temp.find(':')
found2 = temp.find('<',found)
except:
pass
if (temp.find('IP Address:') > -1):
temp2 = temp[found+2:found2]
print(' IP: [',temp2,']\n',sep='')
total_cities = len(cities)
## Open browser for automation
try: driver.close()
except AttributeError: driver = None
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(options=options)
#Search for links
while (city_counter < total_cities):
city = cities[city_counter]
searchTerm = 'https://www.bing.com/search?q=skate park ' + city
## Perform search using designated search term
driver.get(searchTerm)
haystack = driver.page_source
driver.get(searchTerm)
found = 0
found2 = 0
while (found > -1):
found = haystack.find('<a href=',found2)
found2 = haystack.find('"',found+10)
if (haystack[found+9:found+13] == 'http'):
results.append(haystack[found+9:found2])
city_counter += 1
driver.close()
counter = 0
while counter < len(results):
print(counter,': ',results[counter],sep='')
counter += 1
The citylist.csv file:
"Oakland, CA",
"San Francisco, CA",
"San Jose, CA"

Selenium Python script has different behavior on Windows and Ubuntu environments

I've tried running a script on Windows and on Ubuntu, both using Python 3 and the latest versions of geckodriver, resulting in differing behavior. The full script is given below.
I'm trying to get the data for several different tests from a test prep site. There are different subjects, each of which has a specialization, each of which has a practice-test, each of which has several questions. The scrape function walks through the steps to get data of each type.
subject <--- specialization <---- practice-test *------ question
The get_questions function is where the difference shows up:
In Windows, it behaves as expected. After the last question's choice is clicked, it goes on to a results page.
In Ubuntu, when a choice is clicked on the last question, it reloads the last question and keeps clicking the same choice and reloading the same question.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pathlib
import time
import json
import os
driver=webdriver.Firefox(executable_path="./geckodriver.exe")
wait = WebDriverWait(driver, 15)
data=[]
def setup():
driver.get('https://www.varsitytutors.com/practice-tests')
try:
go_away_1= driver.find_element_by_class_name("ub-emb-iframe")
driver.execute_script("arguments[0].style.visibility='hidden'", go_away_1)
go_away_2= driver.find_element_by_class_name("ub-emb-iframe-wrapper")
driver.execute_script("arguments[0].style.visibility='hidden'", go_away_2)
go_away_3= driver.find_element_by_class_name("ub-emb-visible")
driver.execute_script("arguments[0].style.visibility='hidden'", go_away_3)
except:
pass
def get_subjects(subs=[]):
subject_clickables_xpath="/html/body/div[3]/div[9]/div/*/div[#data-subject]/div[1]"
subject_clickables=driver.find_elements_by_xpath(subject_clickables_xpath)
subject_names=map(lambda x : x.find_element_by_xpath('..').get_attribute('data-subject'), subject_clickables)
subject_pairs=zip(subject_names, subject_clickables)
return subject_pairs
def get_specializations(subject):
specialization_clickables_xpath="//div//div[#data-subject='"+subject+"']/following-sibling::div//div[#class='public_problem_set']//a[contains(.,'Practice Tests')]"
specialization_names_xpath="//div//div[#data-subject='"+subject+"']/following-sibling::div//div[#class='public_problem_set']//a[contains(.,'Practice Tests')]/../.."
specialization_names=map(lambda x : x.get_attribute('data-subject'), driver.find_elements_by_xpath(specialization_names_xpath))
specialization_clickables = driver.find_elements_by_xpath(specialization_clickables_xpath)
specialization_pairs=zip(specialization_names, specialization_clickables)
return specialization_pairs
def get_practices(subject, specialization):
practice_clickables_xpath="/html/body/div[3]/div[8]/div[3]/*/div[1]/a[1]"
practice_names_xpath="//*/h3[#class='subject_header']"
lengths_xpath="/html/body/div[3]/div[8]/div[3]/*/div[2]"
lengths=map(lambda x : x.text, driver.find_elements_by_xpath(lengths_xpath))
print(lengths)
practice_names=map(lambda x : x.text, driver.find_elements_by_xpath(practice_names_xpath))
practice_clickables = driver.find_elements_by_xpath(practice_clickables_xpath)
practice_pairs=zip(practice_names, practice_clickables)
return practice_pairs
def remove_popup():
try:
button=wait.until(EC.element_to_be_clickable((By.XPATH,"//button[contains(.,'No Thanks')]")))
button.location_once_scrolled_into_view
button.click()
except:
print('could not find the popup')
def get_questions(subject, specialization, practice):
remove_popup()
questions=[]
current_question=None
while True:
question={}
try:
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"/html/body/div[3]/div[7]/div[1]/div[2]/div[2]/table/tbody/tr/td[1]")))
question_number=driver.find_element_by_xpath('/html/body/div[3]/div[7]/div[1]/div[2]/div[2]/table/tbody/tr/td[1]').text.replace('.','')
question_pre=driver.find_element_by_class_name('question_pre')
question_body=driver.find_element_by_xpath('/html/body/div[3]/div[7]/div[1]/div[2]/div[2]/table/tbody/tr/td[2]/p')
answer_choices=driver.find_elements_by_class_name('question_row')
answers=map(lambda x : x.text, answer_choices)
question['id']=question_number
question['pre']=question_pre.text
question['body']=question_body.text
question['answers']=list(answers)
questions.append(question)
choice=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"input.test_button")))
driver.execute_script("arguments[0].click();", choice[3])
time.sleep(3)
except Exception as e:
if 'results' in driver.current_url:
driver.get(driver.current_url.replace('http://', 'https://'))
# last question has been answered; record results
remove_popup()
pathlib.Path('data/'+subject+'/'+specialization).mkdir(parents=True, exist_ok=True)
with open('data/'+subject+'/'+specialization+'/questions.json', 'w') as outfile:
json.dump(list(questions), outfile)
break
else:
driver.get(driver.current_url.replace('http://', 'https://'))
return questions
def scrape():
setup()
subjects=get_subjects()
for subject_name, subject_clickable in subjects:
subject={}
subject['name']=subject_name
subject['specializations']=[]
subject_clickable.click()
subject_url=driver.current_url.replace('http://', 'https://')
specializations=get_specializations(subject_name)
for specialization_name, specialization_clickable in specializations:
specialization={}
specialization['name']=specialization_name
specialization['practices']=[]
specialization_clickable.click()
specialization_url=driver.current_url.replace('http://', 'https://')
practices=get_practices(subject_name, specialization_name)
for practice_name, practice_clickable in practices:
practice={}
practice['name']=practice_name
practice_clickable.click()
questions=get_questions(subject_name, specialization_name, practice_name)
practice['questions']=questions
driver.get(specialization_url)
driver.get(subject_url)
data.append(subject)
print(data)
scrape()
Can anyone help me figure out what may be causing this?
It's just timing. The last question will take much longer than the 3 second sleep until it loads the next page. Waiting for the page to be gone fixes this and speeds up the script execution.
from selenium.common.exceptions import StaleElementReferenceException
<snip>
choice=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"input.test_button")))
choice[3].click()
try:
while choice[3].is_displayed():
time.sleep(1)
except StaleElementReferenceException as e:
continue
The first problem here is that you are using an Exception to break the loop. The proper way is to use a condition test, for example if result not in url, continue the loop, else, break the loop. The Exception can come in as a backup execution step.
The second is that just using sleep to wait for the results page is not enough, you need to test for the presence of an element on the result page. Or you could just watch for the title change:
wait = WebDriverWait(driver, 10)
wait.until(EC.title_contains("Results"))

Selenium create multiple browser sessions

I want to create multiple browser session and login with different accounts. If I use the code below it makes what I want but close all browser after the for loop ends. My guess is that python ends all processes after the focus is gone. How can I solve the problem? With multithreading?
I want that every session to stay open for 60 seconds.
def playroutine():
index = 0
for i in range(len(getlogindata())):
username, password = givemelogin(index)
index += 1
driver = webdriver.Chrome('/Users/fb/Documents/chromedriver') # Optional argument, if not specified will search path.
driver.get('[...]')
driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("password").send_keys(password)
driver.find_element_by_id("login-button").click()
time.sleep(2)
driver.get('[...]')
Thanks :)
You can't close all browsers after your loop ends, because the driver variable only exists in the context of your for loop.
You can however, close the drivers inside the loop, one at a time:
def playroutine():
index = 0
for i in range(len(getlogindata())):
username, password = givemelogin(index)
index += 1
driver = webdriver.Chrome('/Users/fb/Documents/chromedriver') # Optional argument, if not specified will search path.
driver.get('[...]')
driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("password").send_keys(password)
driver.find_element_by_id("login-button").click()
time.sleep(2)
# close the driver
driver.close()
driver.quit()
Or, you can keep track of the drivers in a list, and try to loop through them and close them -- this is a bit hacky, and I can't say I would recommend it:
def playroutine():
driver_list = []
index = 0
for i in range(len(getlogindata())):
username, password = givemelogin(index)
index += 1
driver = webdriver.Chrome('/Users/fb/Documents/chromedriver') # Optional argument, if not specified will search path.
# add this driver to your list to keep track of it
driver_list.append(driver)
driver.get('[...]')
driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("password").send_keys(password)
driver.find_element_by_id("login-button").click()
time.sleep(2)
driver.get('[...]')
# for loop is finished -- close all drivers
for driver in driver_list:
driver.close()
driver.quit()

How to reuse a selenium driver instance during parallel processing?

To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong)
Pseudocode:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver-line needs to be placed outside scrape(URL). However, everything outside scrape(URL) will not find its way to joblib's Parallel(n_jobs = 40). This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1: How to reuse driver instances during parallel processing in the above example?
Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?
Note: Flash and image loading is disabled in firefox_profile (code not shown)
1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.
2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)
import threading
from joblib import Parallel, delayed
from selenium import webdriver
def scrape(URL):
try:
driver = drivers[threading.current_thread().name]
except KeyError:
drivers[threading.current_thread().name] = webdriver.Firefox()
driver = drivers[threading.current_thread().name]
driver.get(URL)
results = do_something(driver)
if results:
safe_results("results.csv")
drivers = {}
Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
for driver in drivers.values():
driver.quit()
But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).

Selenium won't open a new url in for loop (Python & Chrome)

I can't seem to get Selenium to run this for loop correctly. It runs the first time without issue but when it starts the second loop to the program just stops running with no error message. I get the same results when I attempt this with a firefox browser. Maybe it has to do with me trying to start a browser instance when one is already running?
def bookroom(self):
sessionrooms=["611", "618"] #the list being used by the for loop
driver = webdriver.Firefox()
#for loop trying each room
for rooms in sessionrooms:
room=roomoptions[rooms][0]
sidenum=roomoptions[rooms][1]
bookingurl="https://coparooms.newschool.edu/copa-scheduler/Web/reservation.php?rid="+room+"&sid="+sidenum+"&rd="+self.startdate
driver.get(bookingurl)
time.sleep(3)
usernamefield = driver.find_element_by_id("email")
usernamefield.send_keys(self.username)
passwordfield = driver.find_element_by_id("password")
passwordfield.send_keys(self.password)
passwordfield.send_keys(Keys.RETURN)
time.sleep(5)
begin=Select(driver.find_element_by_name("beginPeriod"))
print(self.starttime)
begin.select_by_visible_text(convertarmy(self.starttime))
end=Select(driver.find_element_by_name("endPeriod"))
end.select_by_visible_text(convertarmy(self.endtime))
creates=driver.find_element_by_xpath("//button[contains(.,'Create')]")
creates.click() #clicks the confirm button
time.sleep(8)
xpathtest=driver.find_element_by_id("result").text
#if statement checks if creation was a success. If it is, exit the browser.
if "success" in xpathtest or "Success" in xpathtest:
print "Success!"
driver.exit()
else:
print "Failure"
time.sleep(2)
#if creation was not success try the next room in sessionrooms
Update:
I found the problem, it was just a matter of uneven spacing. Only some of the loop was "in the loop".
if "success" in xpathtest or "Success" in xpathtest:
print "Success!"
driver.exit()
You will want to break out of your loop here, as you're closing the driver, but using it in the next iteration of the loop without starting a new driver object

Categories