Issues when scraping a web page using selenium in python - python

I have been given a model to run a successful web scraper on a selected website, however, when i alter this to collect data from a second website, it keeps returning as an error. I'm not sure if it is an error in the code or the website is refusing my requests. Could you please look through this and see where my issue lies. Any help hugely appreciated!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
try:
driver.get("http://www.caiso.com/TodaysOutlook/Pages/supply.aspx") # load the page
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.highcharts-legend-item highcharts-pie-series highcharts-color-0'))) # wait till relevant elements are on the page
except:
driver.quit() # quit if there was an error getting the page or we've waited 15 seconds and the stats haven't appeared.
stat_elements = driver.find_elements_by_css_selector('.highcharts-legend-item highcharts-pie-series highcharts-color-0')
for el in stat_elements:
print(el.find_element_by_css_selector('b').text)
print(el.find_element_by_css_selector('br').text)
driver.quit()

First of all you are passing wrong CSS as it should be like this
.highcharts-legend-item.highcharts-pie-series.highcharts-color-0
not as you have mentioned.
Then you are closing the browser and then trying to again close it getting the error
try:
driver.get("http://www.caiso.com/TodaysOutlook/Pages/supply.aspx") # load the page
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.highcharts-legend-item.highcharts-pie-series.highcharts-color-0'))) # wait till relevant elements are on the page
except:
driver.quit()
Next on list item you are fetching text
print(el.find_element_by_css_selector('b').text)
Debugged Code here:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
try:
driver.get("http://www.caiso.com/TodaysOutlook/Pages/supply.aspx") # load the page
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.highcharts-legend-item.highcharts-pie-series.highcharts-color-0'))) # wait till relevant elements are on the page
#driver.quit() # quit if there was an error getting the page or we've waited 15 seconds and the stats haven't appeared.
except TimeoutException:
pass
finally:
try:
stat_elements = driver.find_elements_by_css_selector('.highcharts-legend-item.highcharts-pie-series.highcharts-color-0')
for el in stat_elements:
for i in el.find_elements_by_tag_name('b'):
print(i.text)
for i in el.find_elements_by_tag_name('br'):
print(i.text)
except NoSuchElementException:
print("No Such Element Found")
driver.quit()
I hope this has solved your problem if not then let me know.

Related

Selenium not going to next page in scraper

I'm writing my first real scraper and although in general it's been going well, I've hit a wall using Selenium. I can't get it to go to the next page.
Below is the head of my code. The output below this is just printing out data in terminal for now and that's all working fine. It just stops scraping at the end of page 1 and shows me my terminal prompt. It never starts on page 2. I would be so grateful if anyone could make a suggestion. I've tried selecting the button at the bottom of the page I'm trying to scrape using both the relative and full Xpath (you're seeing the full one here) but neither work. I'm trying to click the right-arrow button.
I built in my own error message to indicate whether the driver successfully found the element by Xpath or not. The error message fires when I execute my code, so I guess it's not finding the element. I just can't understand why not.
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()
What you can do is to set up Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) and use a relative XPath to select the next page element. All of this in a loop (its range is the number of pages you have to deal with).
XPath for the next page link :
//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
Code could look like :
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
You were pretty close with while True and try-catch{} logic. To go to the next page using Selenium and python you have to induce WebDriverWait for element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class, 'state-active')]//following::li[1]/a[#href]"))).click()
print("Clicked for next page")
WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(#class, 'state-active')]//following::li[1]/a[#href]")))
except (TimeoutException):
print("No more pages")
break
driver.quit()
Console Output:
Clicked for next page
No more pages

unable to locate elements in chrome page using selenium python

this is the HTML code:
this is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
timeout = 20
driver = webdriver.Chrome()
driver.get('http://polarionprod1.delphiauto.net/polarion/#/project/10032024_MY21_FORD_PODS_SDPS_P702/wiki/10_Testing/SysTs_ATR')
wait = WebDriverWait(driver, 10)
men_menu=0
while(men_menu==0):
try:
men_menu=WebDriverWait(driver,
timeout).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'#polarion_type_icon')))
print(men_menu)
except TimeoutException:
if(men_menu==0):
print(men_menu)
continue
else:
men_menu.click()
break
i have used try and except in loop since the web page I am dealing with takes a lot of time to load.When i run the code the code seems to be always in try block where I am printing the value men_menu to see if the element is located. But it always prints 0. So that's how i confirmed the element is not getting located

How to make Selenium only click a button and nothing else? Inconsistent clicking

My goal: to scrape the amount of projects done by a user on khan academy.
To do so I need to parse the profile user page. But I need to click on show more to see all the project a user had done and then scrape them.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
# here is one example of a user
driver = webdriver.Chrome()
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# to infinite click on show more button until there is none
while True:
try:
showmore_project=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,'showMore_17tx5ln')))
showmore_project.click()
except TimeoutException:
break
except StaleElementReferenceException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
This code return 0 for print(len(project)). And that's not normal because when you manually check https://www.khanacademy.org/profile/trekcelt/projects you can see there that the amount of projects is definetly not 0.
The weird thing: at first, you can see (with the webdriver) that this code is working fine and then selenium clicks on something else than the show more button, it click on one of the project's link for example and thus change the page and that's why we get 0.
I don't understand how to correct my code so selenium is only clicking on the right button and nothing else.
Check out the following implementation to get the desired behavior. When the script is running, take a closer look at the scroll bar to see the progress.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
driver.execute_script("arguments[0].click();",showmore)
except Exception:
break
soup = BeautifulSoup(driver.page_source,'html.parser')
project = soup.find_all(class_='title_1usue9n')
print(len(project))
Another way would be:
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
showmore.location_once_scrolled_into_view
showmore.click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,'[class^="spinnerContainer"] > img[class^="loadingSpinner"]')))
except Exception:
break
Output at this moment:
381
I have modified the accepted answer to improve the performance of your script. Comment on how you can achieve it is in the code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import time
start_time = time.time()
# here is one example of a user
with webdriver.Chrome() as driver:
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# This code will wait until the first Show More is displayed (After page loaded)
showmore_project = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,
'showMore_17tx5ln')))
showmore_project.click()
# to infinite click on show more button until there is none
while True:
try:
# We will retrieve and click until we do not find the element
# NoSuchElementException will be raised when we reach the button. This will save the wait time of 10 sec
showmore_project= driver.find_element_by_css_selector('.showMore_17tx5ln [role="button"]')
# Using a JS to send the click will avoid Selenium to through an exception where the click would not be
# performed on the right element.
driver.execute_script("arguments[0].click();", showmore_project)
except StaleElementReferenceException:
continue
except NoSuchElementException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
print(time.time() - start_time)
Execution Time1: 14.343502759933472
Execution Time2: 13.955228090286255
Hope this help you!

My scraper fails to get all the items from a webpage

I've written some code in python in combination with selenium to parse different product names from a webpage. There are few load more buttons visible if the browser is made to scroll downward. The webpage displays it's full content if the page is made to scroll downmost until there is no load more button to click. My scraper seems to be doing good but I'm not getting all the results. There are around 200 products in that page but I'm getting 90 out of them. What change should I bring about in my scraper to get them all? Thanks in advance.
The webpage I'm dealing with: Page_Link
This is the script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("put_above_url_here")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".listing_item")))
for scroll in range(17):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
try:
load = driver.find_element_by_css_selector(".lm-btm")
load.click()
except Exception:
pass
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
Try below code to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
wait = WebDriverWait(driver, 10)
header = driver.find_element_by_tag_name("header")
driver.execute_script("arguments[0].style.display='none';", header)
while True:
try:
page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".listing_item")))
driver.execute_script("arguments[0].scrollIntoView();", page)
page.send_keys(Keys.END)
load = wait.until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "LOAD MORE")))
driver.execute_script("arguments[0].scrollIntoView();", load)
load.click()
wait.until(EC.staleness_of(load))
except:
break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
You should only Use Selenium as a last resort.
A simple look around in the webpage showed the API it called to get your data.
It returns a JSON output with all the details:
Link
You can now just loop over and store in a dataframe easily.
Very fast, fewer errors than selenium.

selenium and python3: selecting search box on www.rottentomatoes.com

I have a list of movies for which I want to get the reviews from rotten www.rottentomatoes.com, but I have run into a snag.
What I want is to be able to pass the title of each movie to website search box and then process the result to get the review I want.
At present, I cannot get beyond the search stage, because I have not been able to successfully locate the search box.
My code is as shown below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Chrome('/home/zona/chromedriver')
url = 'https://www.rottentomatoes.com/'
browser.get(url)
time.sleep(10)
try:
element = WebdriverWait(browser, 10).until(
EC.presence_of_element_located((By.XPATH,'//body//input[#name="search"]')))
element = browser.find_element_by_xpath('//body//input[#name="search"]')
element.clear()
element.send_keys("avatar")
except:
print("cound not find search box")
time.sleep(5)
browser.quit()
I get the output:
cound not find search box
Can someone please help me locate what I am doing wrong?
Apologies if this is too basic please, I am new to programming and to python.
It is just case-sensitivity issue.
You used WebdriverWait (lower case d) instead of WebDriverWait.
Note: Used trackback module to print the stack trace to know the exception details.
Try the following code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import traceback
browser = webdriver.Chrome(`/home/zona/chromedriver`)
url = 'https://www.rottentomatoes.com/'
browser.get(url)
time.sleep(5)
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH,'//body//input[#name="search"]')))
# element = browser.find_element_by_xpath('//body//input[#name="search"]')
element.clear()
element.send_keys("avatar")
except:
traceback.print_exc()
print("cound not find search box")
time.sleep(5)
browser.quit()

Categories