I'm writing my first real scraper and although in general it's been going well, I've hit a wall using Selenium. I can't get it to go to the next page.
Below is the head of my code. The output below this is just printing out data in terminal for now and that's all working fine. It just stops scraping at the end of page 1 and shows me my terminal prompt. It never starts on page 2. I would be so grateful if anyone could make a suggestion. I've tried selecting the button at the bottom of the page I'm trying to scrape using both the relative and full Xpath (you're seeing the full one here) but neither work. I'm trying to click the right-arrow button.
I built in my own error message to indicate whether the driver successfully found the element by Xpath or not. The error message fires when I execute my code, so I guess it's not finding the element. I just can't understand why not.
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()
What you can do is to set up Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) and use a relative XPath to select the next page element. All of this in a loop (its range is the number of pages you have to deal with).
XPath for the next page link :
//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
Code could look like :
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
You were pretty close with while True and try-catch{} logic. To go to the next page using Selenium and python you have to induce WebDriverWait for element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class, 'state-active')]//following::li[1]/a[#href]"))).click()
print("Clicked for next page")
WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(#class, 'state-active')]//following::li[1]/a[#href]")))
except (TimeoutException):
print("No more pages")
break
driver.quit()
Console Output:
Clicked for next page
No more pages
Related
I'm trying to execute a selenium program in Python to go to a new URL on click of a button in the current homepage. I'm new to selenium and any help regarding this would be appreciated. Here's my code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://nmit.ac.in'
driver = webdriver.Chrome()
driver.get(url)
try:
# wait 10 seconds before looking for element
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(By.LINK_TEXT, "Parent Portal")
)
except:
print()
driver.find_element(By.LINK_TEXT, "Parent Portal").click()
I have tried to increase the wait time as well as using all forms of the supported located strategies under the BY keyword, but to no avail. I keep getting this error.
As far as I know, you shouldn't be worried by those errors. Although, I proved your code and it's not finding any element in the web page. I can recommend you use the xpath of the element you want to find:
# wait 10 seconds before looking for element
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "/html//div[#class='wsmenucontainer']//button[#href='https://nmitparents.contineo.in/']"))
)
element.click()
#wait for the new view to be loaded
time.sleep(10)
driver.quit()
Psdt: you can use the extention Ranorex Selocity to extract a good and unique xpath of any element in a webpage and also test it!!
image
For one study research I would like to scrape some links from webpages which located out of viewport (to see this links you need to scroll down the page).
Page example (https://www.twitch.tv/lirik)
Link example: https://www.amazon.com/dp/B09FVR22R2
Link located in div class='Layout-sc-nxg1ff-0 itdjvg default-panel' (in total 16 links on the page).
I have write the script but I get empty list:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
I just get empty list after page was loaded. Here is output from the script above:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s.
when webdriver opens the page, I see there is no scroll down action. It just open a page and then close it after time.sleep cooldown.
How I can change the script to get the links properly?
Any help or advice would be appreciated!
You are using a wrong locator.
You should use expected conditions explicit waits instead of hardcoded pauses.
find_elements method returns a list of web elements while you want to the link inside the element(s).
This should work better:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[#class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()
To print the values of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
Console Output:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
So, i was creating a signup machine using selenium. When this page loads to https://mail.protonmail.com/create/new?language=en it is unable to find element by id/xpath of username. On the other side it was able to find password,passwordc elements. I tried to use WebDriverWait function but it is giving timeout error. Tried many things but this thing is still giving me error. If possible then suggest a way to find element of username on the final page or a perfect WebDriverWait code. Below is my code
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
url = 'https://protonmail.com/'
driver = webdriver.Chrome('C://Users/AAA/Desktop/chromedriver.exe')
driver.get(url)
driver.find_element_by_xpath('//*[#id="bs-example-navbar-collapse-1"]/ul/li[8]/a').click()
time.sleep(2)
driver.find_element_by_xpath('//*[#id="signup-plans"]/div[5]/div[1]/div[1]/div/div[1]/h4').click()
time.sleep(1)
driver.find_element_by_id('freePlan').click()
time.sleep(1)
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "username")))
driver.find_element_by_xpath('//*[#id="username"]').send_keys('santaking44455')
time.sleep(1)
driver.find_element_by_id('password').send_keys('25J8e5b8')
time.sleep(1)
driver.find_element_by_id('passwordc').send_keys('25J8e5b8')```
You can't find it because it's in an iframe tag higher up in the html source. Switch first to the iframe, then you should be able to interact with the element.
iframe=driver.find_element_by_xpath('//*[#title="Registration form"]')
driver.switch_to.frame(iframe)
driver.find_element_by_xpath('//*[#id="username"]').send_keys('santaking44455')
My goal: to scrape the amount of projects done by a user on khan academy.
To do so I need to parse the profile user page. But I need to click on show more to see all the project a user had done and then scrape them.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
# here is one example of a user
driver = webdriver.Chrome()
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# to infinite click on show more button until there is none
while True:
try:
showmore_project=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,'showMore_17tx5ln')))
showmore_project.click()
except TimeoutException:
break
except StaleElementReferenceException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
This code return 0 for print(len(project)). And that's not normal because when you manually check https://www.khanacademy.org/profile/trekcelt/projects you can see there that the amount of projects is definetly not 0.
The weird thing: at first, you can see (with the webdriver) that this code is working fine and then selenium clicks on something else than the show more button, it click on one of the project's link for example and thus change the page and that's why we get 0.
I don't understand how to correct my code so selenium is only clicking on the right button and nothing else.
Check out the following implementation to get the desired behavior. When the script is running, take a closer look at the scroll bar to see the progress.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
driver.execute_script("arguments[0].click();",showmore)
except Exception:
break
soup = BeautifulSoup(driver.page_source,'html.parser')
project = soup.find_all(class_='title_1usue9n')
print(len(project))
Another way would be:
while True:
try:
showmore = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'[class^="showMore"] > a')))
showmore.location_once_scrolled_into_view
showmore.click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,'[class^="spinnerContainer"] > img[class^="loadingSpinner"]')))
except Exception:
break
Output at this moment:
381
I have modified the accepted answer to improve the performance of your script. Comment on how you can achieve it is in the code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import time
start_time = time.time()
# here is one example of a user
with webdriver.Chrome() as driver:
driver.get('https://www.khanacademy.org/profile/trekcelt/projects')
# This code will wait until the first Show More is displayed (After page loaded)
showmore_project = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,
'showMore_17tx5ln')))
showmore_project.click()
# to infinite click on show more button until there is none
while True:
try:
# We will retrieve and click until we do not find the element
# NoSuchElementException will be raised when we reach the button. This will save the wait time of 10 sec
showmore_project= driver.find_element_by_css_selector('.showMore_17tx5ln [role="button"]')
# Using a JS to send the click will avoid Selenium to through an exception where the click would not be
# performed on the right element.
driver.execute_script("arguments[0].click();", showmore_project)
except StaleElementReferenceException:
continue
except NoSuchElementException:
break
# parsing the profile
soup=BeautifulSoup(driver.page_source,'html.parser')
# get a list of all the projects
project=soup.find_all(class_='title_1usue9n')
# get the number of projects
print(len(project))
print(time.time() - start_time)
Execution Time1: 14.343502759933472
Execution Time2: 13.955228090286255
Hope this help you!
I'm trying to do some webscraping from a betting website:
As part of the process, I have to click on the different buttons under the "Favourites" section on the left side to select different competitions.
Let's take the ENG Premier League button as example. I identified the button as:
(source: 666kb.com)
The XPath is: //*[#id="SportMenuF"]/div[3] and the ID is 91.
My code for clicking on the button is as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL Removed")
content = driver.find_element_by_xpath('//*[#id="SportMenuF"]/div[3]')
content.click()
Unfortunately, I always get this error message when I run the script:
"no such element: Unable to locate element:
{"method":"xpath","selector":"//*[#id="SportMenuF"]/div[3]"}"
I have tried different identifiers such as CCS Selector, ID and, as shown in the example above, the Xpath. I tried using waits and explicit conditions, too. None of this has worked.
I also attempted scraping some values from the website without any success:
from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL removed")
content = driver.find_elements_by_class_name('price-val')
for entry in content:
print entry.text
Same problem, nothing shows up.
The website embeddes an iframe from a different website. Could this be the cause of my problems? I tried scraping directly from the iframe URL, too, which didn't work, either.
I would appreciate any suggestions.
Sometimes elements are either hiding behind an iframe, or they haven't loaded yet
For the iframe check, try:
driver.switch_to.frame(0)
For the wait check, try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '-put the x-path here-')))