For one study research I would like to scrape some links from webpages which located out of viewport (to see this links you need to scroll down the page).
Page example (https://www.twitch.tv/lirik)
Link example: https://www.amazon.com/dp/B09FVR22R2
Link located in div class='Layout-sc-nxg1ff-0 itdjvg default-panel' (in total 16 links on the page).
I have write the script but I get empty list:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
I just get empty list after page was loaded. Here is output from the script above:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s.
when webdriver opens the page, I see there is no scroll down action. It just open a page and then close it after time.sleep cooldown.
How I can change the script to get the links properly?
Any help or advice would be appreciated!
You are using a wrong locator.
You should use expected conditions explicit waits instead of hardcoded pauses.
find_elements method returns a list of web elements while you want to the link inside the element(s).
This should work better:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[#class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()
To print the values of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
Console Output:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Related
Can not find the element
The code was writen using python with visual studio code
from time import time
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
paginaHit = 'https://hit.com.do/solicitud-de-verificacion/'
driver.get(paginaHit)
driver.maximize_window()
time.sleep(5)
bl = 'SMLU7318830A'
elementoBL = driver.find_element(By.XPATH, '//*[#id="billoflanding"]').send_keys(bl)
# WebDriverWait(driver,2).until(EC.element_to_be_clickable((By.NAME, "bl"))).Click()
The code is OK, but can not find the element in the webpage.
The portion of the page you are trying to access is inside an EMBED tag. It looks similar to an IFRAME so I would start by switching the context to the EMBED tag and then try searching for the element.
driver = webdriver.Chrome()
paginaHit = 'https://hit.com.do/solicitud-de-verificacion/'
driver.get(paginaHit)
driver.maximize_window()
embed = driver.find_element(By.CSS_SELECTOR, "embed")
driver.switch_to.frame(embed)
bl = 'SMLU7318830A'
wait =WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.ID, "billoflanding"))).send_keys(bl)
Couple of additional points:
Don't use sleeps... sleeps are a bad practice. Instead use WebDriverWait when you need to wait for something to happen.
If you are using an ID to find an element, use By.ID and not XPath. ID should be preferred, when available. Next should be a CSS selector and then finally, XPATH only when needed, e.g. to locate elements by contained text or to do complicated DOM traversal.
I'm writing my first real scraper and although in general it's been going well, I've hit a wall using Selenium. I can't get it to go to the next page.
Below is the head of my code. The output below this is just printing out data in terminal for now and that's all working fine. It just stops scraping at the end of page 1 and shows me my terminal prompt. It never starts on page 2. I would be so grateful if anyone could make a suggestion. I've tried selecting the button at the bottom of the page I'm trying to scrape using both the relative and full Xpath (you're seeing the full one here) but neither work. I'm trying to click the right-arrow button.
I built in my own error message to indicate whether the driver successfully found the element by Xpath or not. The error message fires when I execute my code, so I guess it's not finding the element. I just can't understand why not.
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()
What you can do is to set up Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) and use a relative XPath to select the next page element. All of this in a loop (its range is the number of pages you have to deal with).
XPath for the next page link :
//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
Code could look like :
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
You were pretty close with while True and try-catch{} logic. To go to the next page using Selenium and python you have to induce WebDriverWait for element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class, 'state-active')]//following::li[1]/a[#href]"))).click()
print("Clicked for next page")
WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(#class, 'state-active')]//following::li[1]/a[#href]")))
except (TimeoutException):
print("No more pages")
break
driver.quit()
Console Output:
Clicked for next page
No more pages
I've created a script in python using selenium to click on a like button available in a webpage. I've used xpath in order to locate that button and I think I've used it correctly. However, the script doesn't seem to find that button and as a results it throws TimeoutException error pointing at the very line containing the xpath.
As it is not possible to hit that like button without logging in, I expect the script to get the relevant html connected to that button so that I understand I could locate it correctly.
I've tried with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.instagram.com/p/CBi_eIuAwbG/"
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get(link)
item = wait.until(EC.visibility_of_element_located((By.XPATH,"//button[./svg[#aria-label='Like']]")))
print(item.get_attribute("innerHTML"))
How can I locate that like button visible as heart symbol using selenium?
To click on Like Button induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.
Then scroll the element into view and click.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver.get("https://www.instagram.com/p/CBi_eIuAwbG/")
element=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//button[.//*[name()='svg' and #aria-label='Like']]")))
element.location_once_scrolled_into_view
element.click()
You can do it like this
likeSVG = driver.find_element(By.CSS_SELECTOR, 'svg[aria-label="Like"]')
likeBtn = likeSVG.find_element(By.XPATH, './..')
likeBtn.click()
likeBtn is equal to the parent of the likeSVG div as you can use XPATH similar to file navigation commands in a CLI.
Try using the .find_element_by_xpath(xPath) method (Uses full xpath):
likeXPATH = "/html/body/div[1]/section/main/div/div[1]/article/div[2]/section[1]/span[1]/button"
likeElement = driver.find_element_by_xpath(likeXPATH)
likeElement.click()
I'm learning to use Selenium for web scraping. I have a couple of questions with the website I'm working with:
-The website has multiple pages to go over and I can't seem to find a way to locate the pages' paths and go over them. For example, the following code returns link_page as NoneType.
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league')
time.sleep(0.5)
results_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[2]/ul/li[3]/span')
results_button.click()
time.sleep(3)
season_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/ul/li[2]/span/strong/a')
season_button.click()
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/div/a[3]/span').get_attribute('href')
print(link_page.text)
driver.get(link_page)
-For some reason I have to use the results_button to be able to get the href of matches. For example, the following code tries to go the page directy (as an attempt to circumvent problem 1 above), but the link_page returns a NoSuchElementException error.
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league/results/#/page/2')
time.sleep(3)
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/table/tbody/tr[11]/td[2]/a').get_attribute('href')
print(link_page.text)
driver.get(link_page)
To locate the pages to go over them using Selenium you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies:
Using XPATH:
driver.get('https://www.oddsportal.com/soccer/england/premier-league/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='RESULTS']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='2018/2019']"))).click()
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='active-page']//following::a[#x-page]/span[not(contains(., '|')) and not(contains(., 'ยป'))]/..")))])
Console Output:
['https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/2/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/3/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/4/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/5/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/6/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/7/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/8/']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I'm trying to do some webscraping from a betting website:
As part of the process, I have to click on the different buttons under the "Favourites" section on the left side to select different competitions.
Let's take the ENG Premier League button as example. I identified the button as:
(source: 666kb.com)
The XPath is: //*[#id="SportMenuF"]/div[3] and the ID is 91.
My code for clicking on the button is as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL Removed")
content = driver.find_element_by_xpath('//*[#id="SportMenuF"]/div[3]')
content.click()
Unfortunately, I always get this error message when I run the script:
"no such element: Unable to locate element:
{"method":"xpath","selector":"//*[#id="SportMenuF"]/div[3]"}"
I have tried different identifiers such as CCS Selector, ID and, as shown in the example above, the Xpath. I tried using waits and explicit conditions, too. None of this has worked.
I also attempted scraping some values from the website without any success:
from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL removed")
content = driver.find_elements_by_class_name('price-val')
for entry in content:
print entry.text
Same problem, nothing shows up.
The website embeddes an iframe from a different website. Could this be the cause of my problems? I tried scraping directly from the iframe URL, too, which didn't work, either.
I would appreciate any suggestions.
Sometimes elements are either hiding behind an iframe, or they haven't loaded yet
For the iframe check, try:
driver.switch_to.frame(0)
For the wait check, try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '-put the x-path here-')))