Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website - python

I'm scraping an E-Commerce website, Lazada using Selenium and bs4, I manage to scrape on the 1st page but I unable to iterate to the next page. What I'm tyring to achieve is to scrape the whole pages based on the categories I've selected.
Here what I've tried :
# Run the argument with incognito
option = webdriver.ChromeOptions()
option.add_argument(' — incognito')
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=option)
driver.get('https://www.lazada.com.my/')
driver.maximize_window()
# Select category item #
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
t = 10
try:
WebDriverWait(driver,t).until(EC.visibility_of_element_located((By.ID,"a2o4k.searchlistcategory.0.i0.460b6883jV3Y0q")))
except TimeoutException:
print('Page Refresh!')
driver.refresh()
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
print('Page Load!')
#Soup and select element
def getData(np):
soup = bs(driver.page_source, "lxml")
product_containers = soup.findAll("div", class_='c2prKC')
for p in product_containers:
title = (p.find(class_='c16H9d').text)#title
selling_price = (p.find(class_='c13VH6').text)#selling price
try:
original_price=(p.find("del", class_='c13VH6').text)#original price
except:
original_price = "-1"
if p.find("i", class_='ic-dynamic-badge ic-dynamic-badge-freeShipping ic-dynamic-group-2'):
freeShipping = 1
else:
freeShipping = 0
try:
discount = (p.find("span", class_='c1hkC1').text)
except:
discount ="-1"
if p.find(("div", {'class':['c16H9d']})):
url = "https:"+(p.find("a").get("href"))
else:
url = "-1"
nextpage_elements = driver.find_elements_by_class_name('ant-pagination-next')[0]
np=webdriver.ActionChains(driver).move_to_element(nextpage_elements).click(nextpage_elements).perform()
print("- -"*30)
toSave = [title,selling_price,original_price,freeShipping,discount,url]
print(toSave)
writerows(toSave,filename)
getData(np)

The problem might be that the driver is trying to click the button before the element is even loaded correctly.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(PATH, chrome_options=option)
# use this code after driver initialization
# this is make the driver wait 5 seconds for the page to load.
driver.implicitly_wait(5)
url = "https://www.lazada.com.ph/catalog/?q=phone&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e359dTYxZXo"
driver.get(url)
next_page_path = "//ul[#class='ant-pagination ']//li[#class=' ant-pagination-next']"
# the following code will wait 5 seconds for
# element to become clickable
# and then try clicking the element.
try:
next_page = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, next_page_path)))
next_page.click()
except Exception as e:
print(e)
EDIT 1
Changed the code to make the driver wait for the element to become clickable. You can add this code inside a while loop for iterating multiple times and break the loop if the button is not found and is not clickable.

Related

selenium: stale element reference: element is not attached to the page document

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import chromedriver_autoinstaller
chromedriver_autoinstaller.install()
TYPES = ['user', 'verified_audience', 'top_critics']
TYPE = TYPES[2]
URL = 'https://www.rottentomatoes.com/m/dunkirk_2017/reviews'
PAGES = 2
driver = Chrome()
driver.get(URL)
data_reviews = []
while PAGES != 0:
wait = WebDriverWait(driver, 30)
reviews = wait.until(lambda _driver: _driver.find_elements(
By.CSS_SELECTOR, '.review_table_row'))
# Extracting review data
for review in reviews:
if TYPE == 'top_critics':
critic_name_el = review.find_element(
By.CSS_SELECTOR, '[data-qa=review-critic-link]')
critic_review_text_el = review.find_element(
By.CSS_SELECTOR, '[data-qa=review-text]')
data_reviews.append(critic_name_el.text)
try:
next_button_el = driver.find_element(
By.CSS_SELECTOR, '[data-qa=next-btn]:not([disabled=disabled])'
)
if not next_button_el:
PAGES = 0
next_button_el.click() # refresh new reviews
PAGES -= 1
except Exception as e:
driver.quit()
Here, a rotten tomatoes review page is being opened and the reviews are being scraped, but when the next button is clicked and the new reviews are going to be scraped, this error pops up... I am guessing that the new reviews have not been loaded and trying to access them is causing the problem, I tried driver.implicitly_wait but that doesn't work too.
The error originates from line 33, data_reviews.append(critic_name_el.text)

By clicking a next page button next_button_el the new page is being loaded but this process takes some time while your Selenium code continues instantly after that click so probably on this line reviews = wait.until(lambda _driver: _driver.find_elements(By.CSS_SELECTOR, '.review_table_row')) it collects the elements on the old page but then the page is being refreshed so some of these elements critic_name_el collected after that (still on the old page) is no more there since the old page is refreshed.
To make your code working you need to introduce a short delay after clicking the next page button, as following:
data_reviews = []
while PAGES != 0:
wait = WebDriverWait(driver, 30)
reviews = wait.until(lambda _driver: _driver.find_elements(
By.CSS_SELECTOR, '.review_table_row'))
# Extracting review data
for review in reviews:
if TYPE == 'top_critics':
critic_name_el = review.find_element(
By.CSS_SELECTOR, '[data-qa=review-critic-link]')
critic_review_text_el = review.find_element(
By.CSS_SELECTOR, '[data-qa=review-text]')
data_reviews.append(critic_name_el.text)
try:
next_button_el = driver.find_element(
By.CSS_SELECTOR, '[data-qa=next-btn]:not([disabled=disabled])'
)
if not next_button_el:
PAGES = 0
next_button_el.click() # refresh new reviews
PAGES -= 1
time.sleep(2)
except Exception as e:
driver.quit()
Also I'd suggest to wait for elements visibility, not just presence here:
reviews = wait.until(lambda _driver: _driver.find_elements(By.CSS_SELECTOR, '.review_table_row'))
Also you need to understand that driver.implicitly_wait do not introduce any actual pause. This just sets the timeout for find_element and find_elements methods.

navigate to next page and get href link

how to navigate page to the last page and get all href link from unchanged link page?
hhere my code is:
url = 'https://hoaxornot.detik.com/paging#'
options = webdriver.ChromeOptions()
pathToChromeDriver = "C:/Program Files/Google/Chrome/Application/chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
options=options)
try:
browser.get(url)
browser.implicitly_wait(10)
html = browser.page_source
page = 1
while page <= 2:
paging = browser.find_elements_by_xpath('//*[#id="number_filters"]/a[{}]'.format(page)).click()
for p in paging:
articles = p.find_elements_by_xpath('//*[#id="results-search-hoax-paging"]/div/div/article/a')
for article in articles:
print(article.get_attribute("href"))
page += 1
finally:
browser.quit()

wait=WebDriverWait(browser,60)
browser.get("https://hoaxornot.detik.com/paging#")
page=1
articles=[]
while True:
try:
time.sleep(1)
pagearticles=wait.until(EC.visibility_of_all_elements_located((By.XPATH,'//*[#id="results-search-hoax-paging"]/div/div/article/a')))
for article in pagearticles:
articles.append(article.get_attribute("href"))
page+=1
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="number_filters"]/a[{}]'.format(page)))).click()
except:
break
print(articles)
Here's a simple way to loop through the pages and wait for the element's visibility to come up so you can obtain their values instead of an empty list.
Import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
Outputs:
['https://news.detik.com/berita/d-5900248/video-jembatan-ambruk-disebut-di-samarinda-faktanya-bukan-di-indonesia', 'https://news.detik.com/berita/d-5898607/kantor-walkot-jakbar-diviralkan-rusak-akibat-gempa-ini-faktanya', 'https://news.detik.com/berita/d-5896931/polisi-di-singkawang-diviralkan-berbahasa-china-di-publik-begini-faktanya', 'https://news.detik.com/berita-jawa-timur/d-5895069/video-viral-hutan-baluran-banjir-dipastikan-hoax-polisi-itu-video-lama', 'https://news.detik.com/internasional/d-5873027/beredar-video-ledakan-parah-di-dubai-ternyata-3-insiden-lama-beda-negara', 'https://news.detik.com/berita/d-5865905/awas-ikut-tertipu-sejumlah-warga-ke-kantor-pln-bali-gegara-hoax-rekrutmen', 'https://news.detik.com/berita/d-5863802/beredar-pesan-gambar-kpk-pantau-muktamar-nu-di-lampung-ini-faktanya', 'https://news.detik.com/berita/d-5842083/viral-video-ayah-pukuli-anak-pakai-balok-kayu-begini-faktanya', 'https://news.detik.com/berita/d-5798562/video-mobil-ngebut-190-kmjam-dikaitkan-vanessa-angel-dipastikan-hoax', 'https://news.detik.com/berita/d-5755035/muncul-isu-liar-jokowi-joget-tanpa-masker-di-papua-ini-faktanya', 'https://news.detik.com/berita/d-5729500/beredar-edaran-penerima-bantuan-pesantren-kemenag-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5715146/5-bersaudara-di-surabaya-butuh-diadopsi-karena-papa-mama-meninggal-covid-19-hoaks', 'https://news.detik.com/berita/d-5714873/minta-maaf-ustaz-royan-jelaskan-viral-5-polisi-angkat-poster-demo-jokowi', 'https://health.detik.com/berita-detikhealth/d-5714239/viral-bawang-putih-tarik-cairan-dari-paru-paru-pasien-corona-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5699731/awas-hoax-viral-info-vaksin-palsu-beredar-di-indonesia-ini-faktanya', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5688266/hoax-pesan-bantuan-subsidi-gaji-rp-35-juta-jangan-dibuka', 'https://news.detik.com/berita-jawa-timur/d-5658878/2-sekolah-ditolak-warga-bondowoso-jadi-tempat-isolasi-satgas-tak-patah-arang', 'https://news.detik.com/berita/d-5655368/viral-video-demo-rusuh-di-jl-gajah-mada-polisi-pastikan-hoax', 'https://news.detik.com/berita/d-5755035/muncul-isu-liar-jokowi-joget-tanpa-masker-di-papua-ini-faktanya', 'https://news.detik.com/berita/d-5729500/beredar-edaran-penerima-bantuan-pesantren-kemenag-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5715146/5-bersaudara-di-surabaya-butuh-diadopsi-karena-papa-mama-meninggal-covid-19-hoaks', 'https://news.detik.com/berita/d-5714873/minta-maaf-ustaz-royan-jelaskan-viral-5-polisi-angkat-poster-demo-jokowi', 'https://health.detik.com/berita-detikhealth/d-5714239/viral-bawang-putih-tarik-cairan-dari-paru-paru-pasien-corona-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5699731/awas-hoax-viral-info-vaksin-palsu-beredar-di-indonesia-ini-faktanya', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5688266/hoax-pesan-bantuan-subsidi-gaji-rp-35-juta-jangan-dibuka', 'https://news.detik.com/berita-jawa-timur/d-5658878/2-sekolah-ditolak-warga-bondowoso-jadi-tempat-isolasi-satgas-tak-patah-arang', 'https://news.detik.com/berita/d-5655368/viral-video-demo-rusuh-di-jl-gajah-mada-polisi-pastikan-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5645668/heboh-ajakan-tolak-ppkm-darurat-di-pekalongan-ini-kata-polisi', 'https://news.detik.com/berita/d-5643373/heboh-tim-covid-buru-warga-tanjungpinang-langgar-ppkm-darurat-ini-faktanya', 'https://news.detik.com/berita/d-5638774/viral-rusa-keliaran-di-jalanan-denpasar-saat-ppkm-darurat-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5635282/deretan-hoax-air-kelapa-netralkan-vaksin-hingga-obati-covid-19', 'https://news.detik.com/berita-jawa-tengah/d-5633158/beredar-pesan-ada-pasien-corona-kabur-di-kudus-ternyata', 'https://news.detik.com/berita-jawa-tengah/d-5622194/viral-tim-sar-klaten-kewalahan-jasad-covid-belum-dimakamkan-ini-faktanya', 'https://news.detik.com/berita/d-5607406/beredar-isu-sutiyoso-meninggal-keluarga-tidak-benar', 'https://news.detik.com/berita-jawa-tengah/d-5603576/waspada-ada-akun-wa-catut-bupati-klaten-minta-sumbangan', 'https://news.detik.com/berita-jawa-tengah/d-5603472/heboh-pesan-berantai-soal-varian-baru-corona-di-kudus-ini-faktanya', 'https://news.detik.com/berita/d-5591931/beredar-poster-konvensi-capres-nu-2024-pbnu-pastikan-hoax', 'https://health.detik.com/berita-detikhealth/d-5591504/viral-hoax-makan-bawang-3-kali-sehari-sembuhkan-corona-ini-faktanya', 'https://news.detik.com/berita/d-5590632/viral-tes-antigen-pakai-air-keran-hasilnya-positif-satgas-kepri-menepis', 'https://news.detik.com/internasional/d-5586179/fakta-di-balik-aksi-penyiar-malaysia-tutup-1-mata-untuk-palestina', 'https://inet.detik.com/cyberlife/d-5585732/waspada-6-hoax-vaksin-bermagnet-hingga-china-siapkan-senjata-biologis', 'https://health.detik.com/berita-detikhealth/d-5533468/viral-jadi-sulit-ereksi-karena-vaksin-sinovac-ini-penjelasan-dokter', 'https://health.detik.com/berita-detikhealth/d-5527149/viral-cacing-di-masker-impor-dari-china-ini-fakta-di-baliknya', 'https://finance.detik.com/energi/d-5526617/viral-gaji-petugas-kebersihan-pertamina-rp-13-juta-manajemen-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5519314/fakta-fakta-gibran-disebut-duduk-di-meja-menteri-pupr-duduk-di-kursi', 'https://finance.detik.com/energi/d-5511928/awas-hoax-bbm-langka-imbas-kilang-kebakaran-pertamina-stok-luber', 'https://news.detik.com/berita-jawa-tengah/d-5511550/viral-gibran-duduk-di-atas-meja-depan-menteri-basuki-begini-faktanya', 'https://news.detik.com/berita/d-5507088/geger-kaca-bus-transmetro-deli-medan-diduga-ditembak-begini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5487986/viral-lansia-non-dki-bisa-vaksin-corona-di-senayan-dipastikan-hoax', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5487983/awas-hoax-pesan-berantai-soal-vaksinasi-lansia-di-istora-senayan', 'https://health.detik.com/berita-detikhealth/d-5480124/hoax-tak-ada-larangan-minum-obat-jantung-sebelum-vaksin-covid-19', 'https://health.detik.com/berita-detikhealth/d-5473657/hoax-kemenkes-bantah-puluhan-wartawan-terkapar-setelah-vaksinasi-covid-19', 'https://health.detik.com/berita-detikhealth/d-5368305/minum-air-putih-bisa-atasi-kekentalan-darah-pasien-covid-19-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5360703/viral-info-penemu-vaksin-covid-19-sinovac-meninggal-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5357602/pasien-jalan-ngangkang-seperti-penguin-disebut-karena-anal-swab-ini-faktanya', 'https://finance.detik.com/moneter/d-5351004/kabar-bi-di-lockdown-bank-internasional-swiss-dipastikan-hoax', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5350942/hoax-jangan-percaya-pesan-berantai-dana-bagi-bagi-uang-tunai', 'https://health.detik.com/berita-detikhealth/d-5340874/sederet-hoax-vaksin-jokowi-disebut-salah-suntik-hingga-tak-sampai-habis', 'https://health.detik.com/berita-detikhealth/d-5338133/hoax-viral-kasdim-0817-gresik-wafat-usai-vaksin-covid-19-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5337075/viral-urutan-mandi-agar-tak-kena-stroke-ini-faktanya', 'https://news.detik.com/berita/d-5328895/foto-bayi-selamat-dari-sriwijaya-air-sj182-dipastikan-hoax', 'https://health.detik.com/berita-detikhealth/d-5324630/viral-vaksin-covid-19-memperbesar-penis-bpom-hoax-lah', 'https://news.detik.com/berita-jawa-timur/d-5321500/wawali-surabaya-terpilih-armuji-dikabarkan-meninggal-ketua-dprd-hoaks', 'https://news.detik.com/berita/d-5287986/beredar-chat-kapolda-metro-soal-sikat-laskar-hrs-dipastikan-hoax', 'https://news.detik.com/berita/d-5286913/video-ambulans-fpi-masuk-rs-saat-ricuh-diviralkan-ini-faktanya', 'https://news.detik.com/berita-jawa-tengah/d-5280091/viral-bendung-gerak-serayu-jebol-kepala-upt-itu-kapal-ponton-hanyut', 'https://news.detik.com/berita-jawa-tengah/d-5279872/viral-asrama-isolasi-mandiri-ugm-penuh-ternyata-begini-faktanya', 'https://news.detik.com/berita/d-5275107/kpu-makassar-bantah-keluarkan-flyer-hasil-survei-paslon-pilwalkot-berlogo-kpu', 'https://news.detik.com/berita-jawa-tengah/d-5264429/beredar-voice-note-binatang-buas-gunung-merapi-turun-ke-selo-kades-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5262931/viral-peta-bahaya-gunung-merapi-sejauh-10-km-bpptkg-itu-peta-2010', 'https://health.detik.com/berita-detikhealth/d-5254580/viral-tips-sembuhkan-covid-19-dalam-waktu-5-menit-dokter-paru-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5253524/video-jenazah-covid-19-diviralkan-bola-mata-hilang-keluarga-sebut-hoaks', 'https://news.detik.com/berita/d-5287986/beredar-chat-kapolda-metro-soal-sikat-laskar-hrs-dipastikan-hoax', 'https://news.detik.com/berita/d-5286913/video-ambulans-fpi-masuk-rs-saat-ricuh-diviralkan-ini-faktanya', 'https://news.detik.com/berita-jawa-tengah/d-5280091/viral-bendung-gerak-serayu-jebol-kepala-upt-itu-kapal-ponton-hanyut', 'https://news.detik.com/berita-jawa-tengah/d-5279872/viral-asrama-isolasi-mandiri-ugm-penuh-ternyata-begini-faktanya', 'https://news.detik.com/berita/d-5275107/kpu-makassar-bantah-keluarkan-flyer-hasil-survei-paslon-pilwalkot-berlogo-kpu', 'https://news.detik.com/berita-jawa-tengah/d-5264429/beredar-voice-note-binatang-buas-gunung-merapi-turun-ke-selo-kades-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5262931/viral-peta-bahaya-gunung-merapi-sejauh-10-km-bpptkg-itu-peta-2010', 'https://health.detik.com/berita-detikhealth/d-5254580/viral-tips-sembuhkan-covid-19-dalam-waktu-5-menit-dokter-paru-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5253524/video-jenazah-covid-19-diviralkan-bola-mata-hilang-keluarga-sebut-hoaks', 'https://news.detik.com/berita/d-3124615/benarkah-sesuap-lele-mengandung-3000-sel-kanker', 'https://news.detik.com/berita/d-3124915/loket-tiket-konser-bon-jovi-di-gbk-dibakar-hoax']

Stale Element error after a specific element in a list

Trying to get the tyres' details from this page. https://eurawheels.com/fr/catalogue/BBS
links = driver.find_elements_by_xpath('//div[#class="col-xs-1 col-md-3"]//a')
parent_window = driver.current_window_handle
x = 0
for j in range(len(links)):
driver.execute_script('window.open(arguments[0]);', links[j])
#scraping here
if x == 0:
driver.close()
driver.switch_to.window(parent_window)
x += 1
else:
driver.back()
driver.refresh() #refresh page
tyres = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH, '//div[#class="card-body text-center"]//a'))) #redefine links
time.sleep(4)
It works for 10 links but then the links go stale. Cannot figure out what needs to be changed. Any help is welcome.

You need to add scroll element into the view before executing driver.execute_script('window.open(arguments[0]);', links[j]) since not all the elements are initially loaded on the page.
So your code should look like following:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
links = driver.find_elements_by_xpath('//div[#class="col-xs-1 col-md-3"]//a')
parent_window = driver.current_window_handle
x = 0
for j in range(len(links)):
actions.move_to_element(j).perform()
driver.execute_script('window.open(arguments[0]);', links[j])
#scraping here
if x == 0:
driver.close()
driver.switch_to.window(parent_window)
x += 1
else:
driver.back()
driver.refresh() #refresh page
tyres = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH, '//div[#class="card-body text-center"]//a'))) #redefine links
time.sleep(4)

Try this:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = 'https://eurawheels.com/fr/catalogue/BBS'
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,15)
driver.get(link)
linklist = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,".card-body > a")))
for i,elem in enumerate(linklist):
linklist[i].click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".spinner-border[role='status']")))
time.sleep(2) #if you kick out this delay, your script will run very fast but you may end up getting same results multiple times.
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h3"))).text
print(item)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.modal-title + button[class='close'][data-dismiss='modal']"))).click()
driver.back()

Selenium cannot get all elements of a page

i am using selenium to go search on agoda and scrape all the hotel name in the page, but the output only return 2 names.
Then i tried to add a line to scroll to the bottom, now the output gives me first 2 names and last 2 names (first two from beginning, last two from bottom)
I don't understand what's the problem, i added time.sleep() for each step so the whole page should have been loaded completely. Does selenium limit by page view that it can only scrape those element in sight?
my code below:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(30)
def scrape():
r = requests.get(current_page)
if r.status_code == requests.codes.ok:
print('start scraping!')
hotel = driver.find_elements_by_class_name('hotel-name')
hotels = []
for h in hotel:
if hotel:
hotels.append(h.text)
print(hotels, file=open("output.txt", 'a', encoding="utf-8"))
scrape()
Here is the page i want to scrape

Try to use below script to scroll page down until no more results appeared on page and then scrape all available names:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
driver.get('https://www.agoda.com/pages/agoda/default/DestinationSearchResult.aspx?asq=8wUBc629jr0%2B3O%2BxycijdcaVIGtokeWrEO7ShJumN8xsNvkFkEV9bUgNnbx6%2Bx22ncbzTLOPBjT84OgAAKXmu6quf8aEKRA%2FQH%2BGoyXgowLt%2BXyB8OpN1h2WP%2BnBM%2FwNPzD%2BpaeII93w%2Bs4dMWI4QPJNbZJ8DWvRiPsrPVVBJY7ilpMPlUermwV1UKIKfuyeis3BqRkJh9FzJOs0E98zXQ%3D%3D&city=9590&cid=-142&tick=636818018163&languageId=20&userId=3c2c4cb9-ba6d-4519-8ef4-c85dfd280b8f&sessionId=d4qzq2tgymjrwsf22lnadxpc&pageTypeId=1&origin=HK&locale=zh-TW&aid=130589&currencyCode=HKD&htmlLanguage=zh-tw&cultureInfoName=zh-TW&ckuid=3c2c4cb9-ba6d-4519-8ef4-c85dfd280b8f&prid=0&checkIn=2019-01-16&checkOut=2019-01-17&rooms=1&adults=2&children=0&priceCur=HKD&los=1&textToSearch=%E5%A4%A7%E9%98%AA&productType=-1&travellerType=1')
# Get initial list of names
hotels = wait(driver, 15).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'hotel-name')))
while True:
# Scroll down to last name in list
driver.execute_script('arguments[0].scrollIntoView();', hotels[-1])
try:
# Wait for more names to be loaded
wait(driver, 15).until(lambda driver: len(wait(driver, 15).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'hotel-name')))) > len(hotels))
# Update names list
hotels = wait(driver, 15).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'hotel-name')))
except:
# Break the loop in case no new names loaded after page scrolled down
break
# Print names list
print([hotel.text for hotel in hotels])

Python - Selenium next page

I am trying to make a scraping application to scrape Hants.gov.uk and right now I am working on it just clicking the pages instead of scraping. When it gets to the last row on page 1 it just stopped, so what I did was make it click button "Next Page" but first it has to go back to the original URL. It clicks page 2, but after page 2 is scraped it doesn't go to page 3, it just restarts page 2.
Can somebody help me fix this issue?
Code:
import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result = []
for link in links:
if link not in result:
result.append(link)
else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
start()
driver.get(url)
for i in range(5):
driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
url = driver.current_url
start()
driver.get(url)
driver.close()

try this:
import time
# import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome()
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
result = []
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result.extend(links)
def start2():
for link in result:
# if link not in result:
# result.append(link)
# else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
while True:
start()
element = driver.find_element_by_class_name('rdpPageNext')
try:
check = element.get_attribute('onclick')
if check != "return false;":
element.click()
else:
break
except:
break
print(result)
start2()
driver.get(url)

As per the url https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True to click through all the pages you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click()
numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scrapping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[#class='rdpWrap rdpNumPart']//a[#class='rdpCurrentPage']/span//following::span[1]"))).click()
driver.quit()
Console Output:
8
Perform your scrapping here on page 1
Perform your scrapping here on page 2
Perform your scrapping here on page 3
Perform your scrapping here on page 4
Perform your scrapping here on page 5
Perform your scrapping here on page 6
Perform your scrapping here on page 7
Perform your scrapping here on page 8

hi #Feitan Portor you have written the code absolutely perfect the only reason that you are redirected back to the first page is because you have given url = driver.current_url in the last for loop where it is the url that remains static and only the java script that instigates the next click event so just remove url = driver.current_url and driver.get(url)
and you are good to go i have tested my self
also to get the current page that your scraper is in just add this part in the for loop so you will get to know where your scraper is :
ss = driver.find_element_by_class_name('rdpCurrentPage').text
print(ss)
Hope this solves your confusion

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website - python

Related

selenium: stale element reference: element is not attached to the page document

navigate to next page and get href link

Stale Element error after a specific element in a list

Selenium cannot get all elements of a page

Python - Selenium next page

Categories

Resources