I use Selenium Chrome to extract information from online sources. Baically, I loop over a list of URLs (stored in mylinks) and load the webpages in the browser as follows:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("window-size=1200,800")
browser = webdriver.Chrome(chrome_options=options)
browser.implicitly_wait(30)
for x in mylinks:
try:
browser.get(x)
soup = BeautifulSoup(browser.page_source, "html.parser")
city = soup.find("div", {"class": "city"}).text
except:
continue
My problem is, that the browser "freezes" at some point. I know that this problem is caused by the webpage. As a consequence, my routine stops since the browser does not work any more. Also browser.implicitly_wait(30) does not help here. Neither explicit or implicit wait solves the problem.
I want to "timeout" the problem, meaning that I want to quit() the browser after x seconds (in case the browser freezes) and restart it.
I know that I could use a subprocess with timeout like:
def startprocess(filepath, waitingtime):
p = subprocess.Popen("C://mypath//" + filepath)
try:
p.wait(waitingtime)
except subprocess.TimeoutExpired:
p.kill()
However, for my task this solution would be second-best.
Question: is there an alternative way to timeout the browser.get(x) step in the loop above (in case the browser freezes) and to continue to the next step?
Related
I am trying to get logs from Chrome's console using Selenium with Python and a little bit lost because it is my first experience with it.
In general, the code works and prints logs, but I need to see some specific events, that I normally see by typing a command _newsb.getEv.getBuffer() (it is an imaginary command, I am using something similar). With Selenium, I am trying to input it like this driver.execute_script("_newsb.getEv.getBuffer()"), but I don't see events.
Here is the code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
dc = DesiredCapabilities.CHROME
dc['goog:loggingPrefs'] = { 'browser':'ALL' }
driver = webdriver.Chrome(desired_capabilities=dc, service_args=["--verbose", "--log-path=D:\\qc1.log"])
driver.implicitly_wait(30)
driver.get('https://www.somewebsite.com/')
page = driver.find_element_by_tag_name("html")
driver.execute_script("window.scrollTo(0, 400)")
time.sleep(3)
driver.execute_script("_newsb.getEv.getBuffer()")
time.sleep(3)
driver.execute_script("window.scrollTo(0, window.scrollY + 400)")
time.sleep(3)
driver.execute_script("window.scrollTo(0, window.scrollY + 400)")
time.sleep(1)
for entry in driver.get_log('browser'):
print(entry)
driver.quit()
Could someone please point out what I am doing wrong and how do you input in console with Selenium and Python? Any advice is appreciated.
Also, how to see all logs, including info and Verbose, not just errors?
options.set_capability("goog:loggingPrefs", { # old: loggingPrefs
"browser": "ALL"})
driver = webdriver.Chrome(
options=options
)
# returns a list of all events
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get('http://www.google.com')
driver.execute_script("console.error('This is error')")
driver.execute_script("console.info('This is info')")
driver.execute_script("console.log('This is log')")
logs = driver.get_log("browser")
print(logs)
sleep(100000)
This prints all the 3 types of logs i am not sure if you are looking for this
Currently having quite the issue with selenium.
I am trying to get all the links on a page, click each, obtain the data from the page and go back. Even when using the StaleElementReference exception handler, it will completely break the loop,despite using driver.back() as is advised.
The code is as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
from datetime import datetime
from pymongo import MongoClient
from selenium.common.exceptions import StaleElementReferenceException
options = Options()
options.page_load_strategy = 'none'
# options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = "https://www.depop.com/purevintage_clothing/"
# driver = webdriver.Chrome()
driver.get(url)
for link in links:
linkClass = link.get_attribute("class")
try:
if str(linkClass[:19]) == "styles__ProductCard":
action = ActionChains(driver)
action.move_to_element(link)
action.click().perform()
product = doSomethingFunction()
if product != None:
insertIntoDatabase(product)
driver.back()
except StaleElementReferenceException as e:
print(e)
driver.back()
I am aware indentation is a bit dodgy here, wrote this out manually as the rest of the processing code such as insertIntoDatabase I'm not sure is relevant here (please let me know if you need all of it)
Whenever I do this I end up with the error exception in a loop despite the driver.back() I'm sure the answer is staring me in the face and I'm a bit too dense to see it, but any help is appreciated here
Everytime you go back to the main page you need to get links because they are not present in the DOM anymore since you changed page; so you should do as follows:
links = driver.find_elements_by_xpath(path_to_elements)
for i in range(len(links)):
link = driver.find_elements_by_xpath(path_to_elements)[i]
linkClass = link.get_attribute("class")
if str(linkClass[:19]) == "styles__ProductCard":
action = ActionChains(driver)
action.move_to_element(link)
action.click().perform()
product = doSomethingFunction()
if product != None:
insertIntoDatabase(product)
driver.back()
I am writing a script that will check if the proxy is working. The program should:
1. Load the proxy from the list (txt).
2. Go to any page (for example wikipedia)
3. If the page has loaded (even not completely) it saves the proxy data to another txt file.
It must all be in the loop. It must also check whether the browser has displayed an error. I have a problem with always turning off the previous browser every time, after several loops several browsers are already open.
Ps. I replaced the iteration with a random number
from selenium import webdriver
import random
from configparser import ConfigParser
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import traceback
while 1:
ini = ConfigParser()
ini.read('liczba_proxy.ini')
random.random()
liczba_losowa = random.randint(1, 999)
f = open('user-agents.txt')
lines = f.readlines()
user_agent = lines[liczba_losowa]
user_agent = str(user_agent)
s = open('proxy_list.txt')
proxy = s.readlines()
i = ini.getint('liczba', 'liczba')
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % proxy[liczba_losowa])
chrome_options.add_argument(f'user-agent={user_agent}')
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='C:\Python\Driver\chromedriver.exe')
driver.get('https://en.wikipedia.org/wiki/Replication_error_phenotype')
def error_catching():
print("error")
driver.stop_client()
traceback.print_stack()
traceback.print_exc()
return False
def presence_of_element(driver, timeout=5):
try:
w = WebDriverWait(driver, timeout)
w.until(EC.presence_of_element_located((By.ID, 'siteNotice')))
print('work')
driver.stop_client()
return True
except:
print('not working')
driver.stop_client()
error_catching()
Without commenting on your code design:
In order to close a driver instance, use driver.close() or driver.quit() instead of your driver.stop_client().
The first one closes the the browser window on which the focus is set.
The second one basically closes all the browser windows and ends the WebDriver session gracefully.
Use
chrome_options.quit()
Obs.: Im pretty sure you should not use testcases like that... "while 1"? so you test will never end?
I guess you should setup your testes in TestCases and call the TheSuite to teste all your testcases and give you one feedback about whant pass or not, and maybe setup one cronjob to keep calling it by time to time.
Here one simple example mine using test cases with django and splinter (splinter is build on top of selenium)
https://github.com/Diegow3b/python-django-basictestcase/blob/master/myApp/tests/test_views.py
I would like to refresh a page if the loading time exceeds my expectation. So I plan to use existing function set_page_load_timeout(time_to_wait), but it turns out that call driver.get() seems not to work anymore.
I've written a simple program below and hit the problem.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
import time
driver = webdriver.Chrome()
time.sleep(5)
driver.set_page_load_timeout(2)
try:
driver.get("https://aws.amazon.com/")
except TimeoutException as e:
print str(e)
driver.set_page_load_timeout(86400)
time.sleep(5)
print "open page"
driver.get("https://aws.amazon.com/")
print "page loaded"
The environment info:
chrome=67.0.3396.99
chromedriver=2.40.565386 (45a059dc425e08165f9a10324bd1380cc13ca363),platform=Mac OS X 10.13.4 x86_64
Selenium Version: 3.12.0
or see:
environment
What you are seeing is an unfortunate situation when get/navigation timeouts the connection is not stable, so you can't operate well on the browser again.
The only workaround that exists as of now is to disable the pageLoadStrategy, but then you loose lot of good perks which automatically wait for pageLoad on get and click operations
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
import time
from selenium.webdriver import DesiredCapabilities
cap = DesiredCapabilities.CHROME
cap["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(desired_capabilities=cap)
I am using selenium + python, been using implicit waits and try/except code on python to catch errors. However I have been noticing that if the browser crashes (let's say the user closes the browser during the program's executing), my python program will hang, and the timeouts of implicit wait seems to not work when this happens. The below process will just stay there forever.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver
import datetime
import time
import sys
import os
def open_browser():
print "Opening web page..."
driver = webdriver.Chrome()
driver.implicitly_wait(1)
#driver.set_page_load_timeout(30)
return driver
driver = open_browser() # Opens web browser
# LET'S SAY I CLOSE THE BROWSER RIGHT HERE!
# IF I CLOSE THE PROCESS HERE, THE PROGRAM WILL HANG FOREVER
time.sleep(5)
while True:
try:
driver.get('http://www.google.com')
break
except:
driver.quit()
driver = open_browser()
The code you have provided will always hang in the event that there is an exception getting the google home page.
What is probably happening is that attempting to get the google home page is resulting in an exception which would normally halt the program, but you are masking that out with the except clause.
Attempt with the following amendment to your loop.
max_attemtps = 10
attempts = 0
while attempts <= max_attempts:
try:
print "Retrieving google"
driver.get('http://www.google.com')
break
except:
print "Retrieving google failed"
attempts += 1