Python: Getting current page url while turning next pages

Python: Getting current page url while turning next pages - python

I'm witting Python script which extract current page url by going to next page, and extract page url.
I can confirm that the browser is up and connecting to start page. But after that, Nothing will happen.
e.g)
start page:
`https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/?multiarea=26&dateunspecified=1&page=1`
URL I want extract is following 4 pages:
・https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/?multiarea=26&dateunspecified=1&page=1
・https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/?multiarea=26&dateunspecified=1&page=2
・https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/?multiarea=26&dateunspecified=1&page=3
・https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/?multiarea=26&dateunspecified=1&page=4
I wrote script as below.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
from time import sleep
import time
options = Options()
driver = webdriver.Chrome('path',options=options)
pageURL = 'https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/'
driver.get(pageURL)
sleep(3)
elem_urls = []
while True:
url = driver.current_url
for urls in url:
elem_urls.append(urls)
try:
next_button = driver.find_elemenent_by_class_name('f-list-paging__next')
next_button.click()
sleep(3)
except Exception:
break

To extract the links for the pages you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.f-list-paging-num__link")))])
Using XPATH:
driver.get('https://www.jtb.co.jp/kokunai-hotel/list/kyoto/feature/couple_yado/')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(#class, 'f-list-paging-num__link')]")))])
Console Output:
['https://www.jtb.co.jp/kokunai-hotel/list/kyoto/?page=1', 'https://www.jtb.co.jp/kokunai-hotel/list/kyoto/?page=2', 'https://www.jtb.co.jp/kokunai-hotel/list/kyoto/?page=3', 'https://www.jtb.co.jp/kokunai-hotel/list/kyoto/?page=4']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Related

Wait for some time before getting the website source code

I am trying to scrape a website to get the heading and summary of the news. The problem I am facing is that when we first open the website, a redirect appears and we have to wait 8 seconds for the website to load. The problem I am facing is that the web data that is beign stored is that of the redirect instead of the main website.
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
# Specify the path to the ChromeDriver executable
chrome_driver_path = "C:/webdrivers/chromedriver"
# Initialize the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Navigate to website
driver.get("https://economictimes.indiatimes.com/markets/stocks/news")
time.sleep(10)
data2, data4 = [], []
while True:
# Extract data
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = soup.find_all("div", {"class": "example-class"})
for item in data:
data2.append(item.find_all('h3'))
data4.append(item.find_all('p'))
try:
# Find the "Load More" button
load_more_button = driver.find_element_by_css_selector("div.autoload_continue")
# Click the button
load_more_button.click()
except:
break
# Close the browser
driver.quit()
print(data2)

You could check for switch to your final url:
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))
Example
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://economictimes.indiatimes.com/markets/stocks/news'
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))

An ideal approach would be to wait for the News heading within the webpage to be visibible.
Solution
To wait for the News heading to be visibible you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1.h1")))
Using XPATH:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[#class='h1' and text()='News']")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Alternative
You can also wait for the Page Title of the webpage to contain Stocks in News Today as follows:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 10).until(EC.title_contains("Stocks in News Today"))
References
You can find a couple of relevant detailed discussions in:
Python selenium get page title
How to make selenium wait before getting contents from the actual website which loads after the landing page through IEDriverServer and IE

Why selenium and firefox webdriver cannot crawl wesite tags loaded by ajax

I want to get some HTML tags' texts from bonbast which some elements are loaded by ajax (for example tag with "ounce_top" id). I have tried selenium and geckodriver but again I can not crawl these tags and also when robotic firefox (geckodriver) opens, these elements are not shown on the web page! I have no idea why it happens. How can I crawl this website?
Code trials:
from selenium import webdriver
from bs4 import BeautifulSoup
url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
a = driver.find_element_by_id(id_="ounce_top")

The desired element is a dynamic element, so ideally to extract the desired text i.e. 1,817.43 you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://bonbast.com/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#ounce_top"))).text)
Using XPATH:
driver.get("https://bonbast.com/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[#id='ounce_top']"))).text)
Console Output:
1,817.43
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

To do that with Selenium you will need to add a wait / delay. Preferably to use the expected conditions explicit wait.
I guess you are trying to get the text value inside that element?
This should work:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 20)
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
your_gold_value = wait.until(EC.visibility_of_element_located((By.ID, "ounce_top"))).text

Python | Selenium Issue with scrolling down and find by class name

For one study research I would like to scrape some links from webpages which located out of viewport (to see this links you need to scroll down the page).
Page example (https://www.twitch.tv/lirik)
Link example: https://www.amazon.com/dp/B09FVR22R2
Link located in div class='Layout-sc-nxg1ff-0 itdjvg default-panel' (in total 16 links on the page).
I have write the script but I get empty list:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
I just get empty list after page was loaded. Here is output from the script above:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s.
when webdriver opens the page, I see there is no scroll down action. It just open a page and then close it after time.sleep cooldown.
How I can change the script to get the links properly?
Any help or advice would be appreciated!

You are using a wrong locator.
You should use expected conditions explicit waits instead of hardcoded pauses.
find_elements method returns a list of web elements while you want to the link inside the element(s).
This should work better:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[#class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()

To print the values of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
Console Output:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

How to click a link on a web page using Python Selenium?

I am trying to click a link on a web page with Python Selenium but I am getting this exception:
no such element: Unable to locate element:
I have already tried using find_element_by_xpath, find_element_by_partial_link_text and find_element_by_link_text.
This is my code:
import time
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/me/Downloads/projetos/chromedriver_win32/chromedriver.exe') # Optional argument, if not specified will search path.
driver.get('http://10.7.0.4/web/guest/br/websys/webArch/mainFrame.cgi');
time.sleep(10) # Let the user actually see something!
#elem = driver.find_element_by_xpath('//*[#id="machine"]/div[1]/div[1]/dl[2]/dt/a')
elem = driver.find_element_by_link_text('Mensagens (2item(ns))')
elem.click()
print("Fim...")
This is the element I need to click:
Mensagens (2item(ns))

You can try with explicit waits and with the customized css :
CSS_SELECTOR :
a[href*='../../websys/webArch/getStatus.cgi']
Sample code :
wait = WebDriverWait(driver, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[href*='../../websys/webArch/getStatus.cgi']"))).click()
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Update 1 :
wait = WebDriverWait(driver, 10)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#href, 'ebsys/webArch/getStatus.cgi') and contains(text(), 'Mensagens')]"))).click()

Unable to locate the pagination links when scraping a website using Selenium and Python

I'm learning to use Selenium for web scraping. I have a couple of questions with the website I'm working with:
-The website has multiple pages to go over and I can't seem to find a way to locate the pages' paths and go over them. For example, the following code returns link_page as NoneType.
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league')
time.sleep(0.5)
results_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[2]/ul/li[3]/span')
results_button.click()
time.sleep(3)
season_button = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/ul/li[2]/span/strong/a')
season_button.click()
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/div/a[3]/span').get_attribute('href')
print(link_page.text)
driver.get(link_page)
-For some reason I have to use the results_button to be able to get the href of matches. For example, the following code tries to go the page directy (as an attempt to circumvent problem 1 above), but the link_page returns a NoSuchElementException error.
from selenium import webdriver
import time
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.oddsportal.com/soccer/england/premier-league/results/#/page/2')
time.sleep(3)
link_page = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[6]/table/tbody/tr[11]/td[2]/a').get_attribute('href')
print(link_page.text)
driver.get(link_page)

To locate the pages to go over them using Selenium you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies:
Using XPATH:
driver.get('https://www.oddsportal.com/soccer/england/premier-league/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='RESULTS']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='2018/2019']"))).click()
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='active-page']//following::a[#x-page]/span[not(contains(., '|')) and not(contains(., '»'))]/..")))])
Console Output:
['https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/2/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/3/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/4/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/5/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/6/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/7/', 'https://www.oddsportal.com/soccer/england/premier-league-2018-2019/results/#/page/8/']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Getting current page url while turning next pages - python

Related

Wait for some time before getting the website source code

Why selenium and firefox webdriver cannot crawl wesite tags loaded by ajax

Python | Selenium Issue with scrolling down and find by class name

How to click a link on a web page using Python Selenium?

Unable to locate the pagination links when scraping a website using Selenium and Python

Categories

Resources