Scraping next pages issue with selenium

Scraping next pages issue with selenium - python

I am trying to scrape basic information on google. The code that I am using is the following. Unfortunately it does not move to the next page and I am not figuring the reason why. I am using selenium and google chrome as browser (no firefox). Could you please tell me what is wrong in my code?
driver.get('https://www.google.com/advanced_search?q=google&tbs=cdr:1,cd_min:3/4/2020,cd_max:3/4/2020&hl=en')
search = driver.find_element_by_name('q')
search.send_keys('tea')
search.submit()
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
titles = []
while True:
next_page_btn =driver.find_elements_by_xpath("//a[#id='pnnext']")
for r in result_div:
if len(next_page_btn) <1:
print("no more pages left")
break
else:
try:
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
print(title)
if title != '' :
titles.append(title)
except:
continue
element =WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'pnnext')))
driver.execute_script("return arguments[0].scrollIntoView();", element)
element.click()

I set q in the query string to be an empty string. Used as_q not q for the search box name. And reordered your code a bit. I put a page limit in to stop it going on forever.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
driver = webdriver.Chrome()
driver.get('https://www.google.com/advanced_search?q=&tbs=cdr:1,cd_min:3/4/2020,cd_max:3/4/2020&hl=en')
search = driver.find_element_by_name('as_q')
search.send_keys('tea')
search.submit()
titles = []
page_limit = 5
page = 0
while True:
soup = BeautifulSoup(driver.page_source, 'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
for r in result_div:
for title in r.find_all('h3'):
title = title.get_text()
print(title)
titles.append(title)
next_page_btn = driver.find_elements_by_id('pnnext')
if len(next_page_btn) == 0 or page > page_limit:
break
element = WebDriverWait(driver, 5).until(expected_conditions.element_to_be_clickable((By.ID, 'pnnext')))
driver.execute_script("return arguments[0].scrollIntoView();", element)
element.click()
page = page + 1
driver.quit()

Related

navigate to next page and get href link

how to navigate page to the last page and get all href link from unchanged link page?
hhere my code is:
url = 'https://hoaxornot.detik.com/paging#'
options = webdriver.ChromeOptions()
pathToChromeDriver = "C:/Program Files/Google/Chrome/Application/chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
options=options)
try:
browser.get(url)
browser.implicitly_wait(10)
html = browser.page_source
page = 1
while page <= 2:
paging = browser.find_elements_by_xpath('//*[#id="number_filters"]/a[{}]'.format(page)).click()
for p in paging:
articles = p.find_elements_by_xpath('//*[#id="results-search-hoax-paging"]/div/div/article/a')
for article in articles:
print(article.get_attribute("href"))
page += 1
finally:
browser.quit()

wait=WebDriverWait(browser,60)
browser.get("https://hoaxornot.detik.com/paging#")
page=1
articles=[]
while True:
try:
time.sleep(1)
pagearticles=wait.until(EC.visibility_of_all_elements_located((By.XPATH,'//*[#id="results-search-hoax-paging"]/div/div/article/a')))
for article in pagearticles:
articles.append(article.get_attribute("href"))
page+=1
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="number_filters"]/a[{}]'.format(page)))).click()
except:
break
print(articles)
Here's a simple way to loop through the pages and wait for the element's visibility to come up so you can obtain their values instead of an empty list.
Import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
Outputs:
['https://news.detik.com/berita/d-5900248/video-jembatan-ambruk-disebut-di-samarinda-faktanya-bukan-di-indonesia', 'https://news.detik.com/berita/d-5898607/kantor-walkot-jakbar-diviralkan-rusak-akibat-gempa-ini-faktanya', 'https://news.detik.com/berita/d-5896931/polisi-di-singkawang-diviralkan-berbahasa-china-di-publik-begini-faktanya', 'https://news.detik.com/berita-jawa-timur/d-5895069/video-viral-hutan-baluran-banjir-dipastikan-hoax-polisi-itu-video-lama', 'https://news.detik.com/internasional/d-5873027/beredar-video-ledakan-parah-di-dubai-ternyata-3-insiden-lama-beda-negara', 'https://news.detik.com/berita/d-5865905/awas-ikut-tertipu-sejumlah-warga-ke-kantor-pln-bali-gegara-hoax-rekrutmen', 'https://news.detik.com/berita/d-5863802/beredar-pesan-gambar-kpk-pantau-muktamar-nu-di-lampung-ini-faktanya', 'https://news.detik.com/berita/d-5842083/viral-video-ayah-pukuli-anak-pakai-balok-kayu-begini-faktanya', 'https://news.detik.com/berita/d-5798562/video-mobil-ngebut-190-kmjam-dikaitkan-vanessa-angel-dipastikan-hoax', 'https://news.detik.com/berita/d-5755035/muncul-isu-liar-jokowi-joget-tanpa-masker-di-papua-ini-faktanya', 'https://news.detik.com/berita/d-5729500/beredar-edaran-penerima-bantuan-pesantren-kemenag-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5715146/5-bersaudara-di-surabaya-butuh-diadopsi-karena-papa-mama-meninggal-covid-19-hoaks', 'https://news.detik.com/berita/d-5714873/minta-maaf-ustaz-royan-jelaskan-viral-5-polisi-angkat-poster-demo-jokowi', 'https://health.detik.com/berita-detikhealth/d-5714239/viral-bawang-putih-tarik-cairan-dari-paru-paru-pasien-corona-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5699731/awas-hoax-viral-info-vaksin-palsu-beredar-di-indonesia-ini-faktanya', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5688266/hoax-pesan-bantuan-subsidi-gaji-rp-35-juta-jangan-dibuka', 'https://news.detik.com/berita-jawa-timur/d-5658878/2-sekolah-ditolak-warga-bondowoso-jadi-tempat-isolasi-satgas-tak-patah-arang', 'https://news.detik.com/berita/d-5655368/viral-video-demo-rusuh-di-jl-gajah-mada-polisi-pastikan-hoax', 'https://news.detik.com/berita/d-5755035/muncul-isu-liar-jokowi-joget-tanpa-masker-di-papua-ini-faktanya', 'https://news.detik.com/berita/d-5729500/beredar-edaran-penerima-bantuan-pesantren-kemenag-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5715146/5-bersaudara-di-surabaya-butuh-diadopsi-karena-papa-mama-meninggal-covid-19-hoaks', 'https://news.detik.com/berita/d-5714873/minta-maaf-ustaz-royan-jelaskan-viral-5-polisi-angkat-poster-demo-jokowi', 'https://health.detik.com/berita-detikhealth/d-5714239/viral-bawang-putih-tarik-cairan-dari-paru-paru-pasien-corona-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5699731/awas-hoax-viral-info-vaksin-palsu-beredar-di-indonesia-ini-faktanya', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5688266/hoax-pesan-bantuan-subsidi-gaji-rp-35-juta-jangan-dibuka', 'https://news.detik.com/berita-jawa-timur/d-5658878/2-sekolah-ditolak-warga-bondowoso-jadi-tempat-isolasi-satgas-tak-patah-arang', 'https://news.detik.com/berita/d-5655368/viral-video-demo-rusuh-di-jl-gajah-mada-polisi-pastikan-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5645668/heboh-ajakan-tolak-ppkm-darurat-di-pekalongan-ini-kata-polisi', 'https://news.detik.com/berita/d-5643373/heboh-tim-covid-buru-warga-tanjungpinang-langgar-ppkm-darurat-ini-faktanya', 'https://news.detik.com/berita/d-5638774/viral-rusa-keliaran-di-jalanan-denpasar-saat-ppkm-darurat-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5635282/deretan-hoax-air-kelapa-netralkan-vaksin-hingga-obati-covid-19', 'https://news.detik.com/berita-jawa-tengah/d-5633158/beredar-pesan-ada-pasien-corona-kabur-di-kudus-ternyata', 'https://news.detik.com/berita-jawa-tengah/d-5622194/viral-tim-sar-klaten-kewalahan-jasad-covid-belum-dimakamkan-ini-faktanya', 'https://news.detik.com/berita/d-5607406/beredar-isu-sutiyoso-meninggal-keluarga-tidak-benar', 'https://news.detik.com/berita-jawa-tengah/d-5603576/waspada-ada-akun-wa-catut-bupati-klaten-minta-sumbangan', 'https://news.detik.com/berita-jawa-tengah/d-5603472/heboh-pesan-berantai-soal-varian-baru-corona-di-kudus-ini-faktanya', 'https://news.detik.com/berita/d-5591931/beredar-poster-konvensi-capres-nu-2024-pbnu-pastikan-hoax', 'https://health.detik.com/berita-detikhealth/d-5591504/viral-hoax-makan-bawang-3-kali-sehari-sembuhkan-corona-ini-faktanya', 'https://news.detik.com/berita/d-5590632/viral-tes-antigen-pakai-air-keran-hasilnya-positif-satgas-kepri-menepis', 'https://news.detik.com/internasional/d-5586179/fakta-di-balik-aksi-penyiar-malaysia-tutup-1-mata-untuk-palestina', 'https://inet.detik.com/cyberlife/d-5585732/waspada-6-hoax-vaksin-bermagnet-hingga-china-siapkan-senjata-biologis', 'https://health.detik.com/berita-detikhealth/d-5533468/viral-jadi-sulit-ereksi-karena-vaksin-sinovac-ini-penjelasan-dokter', 'https://health.detik.com/berita-detikhealth/d-5527149/viral-cacing-di-masker-impor-dari-china-ini-fakta-di-baliknya', 'https://finance.detik.com/energi/d-5526617/viral-gaji-petugas-kebersihan-pertamina-rp-13-juta-manajemen-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5519314/fakta-fakta-gibran-disebut-duduk-di-meja-menteri-pupr-duduk-di-kursi', 'https://finance.detik.com/energi/d-5511928/awas-hoax-bbm-langka-imbas-kilang-kebakaran-pertamina-stok-luber', 'https://news.detik.com/berita-jawa-tengah/d-5511550/viral-gibran-duduk-di-atas-meja-depan-menteri-basuki-begini-faktanya', 'https://news.detik.com/berita/d-5507088/geger-kaca-bus-transmetro-deli-medan-diduga-ditembak-begini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5487986/viral-lansia-non-dki-bisa-vaksin-corona-di-senayan-dipastikan-hoax', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5487983/awas-hoax-pesan-berantai-soal-vaksinasi-lansia-di-istora-senayan', 'https://health.detik.com/berita-detikhealth/d-5480124/hoax-tak-ada-larangan-minum-obat-jantung-sebelum-vaksin-covid-19', 'https://health.detik.com/berita-detikhealth/d-5473657/hoax-kemenkes-bantah-puluhan-wartawan-terkapar-setelah-vaksinasi-covid-19', 'https://health.detik.com/berita-detikhealth/d-5368305/minum-air-putih-bisa-atasi-kekentalan-darah-pasien-covid-19-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5360703/viral-info-penemu-vaksin-covid-19-sinovac-meninggal-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5357602/pasien-jalan-ngangkang-seperti-penguin-disebut-karena-anal-swab-ini-faktanya', 'https://finance.detik.com/moneter/d-5351004/kabar-bi-di-lockdown-bank-internasional-swiss-dipastikan-hoax', 'https://finance.detik.com/berita-ekonomi-bisnis/d-5350942/hoax-jangan-percaya-pesan-berantai-dana-bagi-bagi-uang-tunai', 'https://health.detik.com/berita-detikhealth/d-5340874/sederet-hoax-vaksin-jokowi-disebut-salah-suntik-hingga-tak-sampai-habis', 'https://health.detik.com/berita-detikhealth/d-5338133/hoax-viral-kasdim-0817-gresik-wafat-usai-vaksin-covid-19-ini-faktanya', 'https://health.detik.com/berita-detikhealth/d-5337075/viral-urutan-mandi-agar-tak-kena-stroke-ini-faktanya', 'https://news.detik.com/berita/d-5328895/foto-bayi-selamat-dari-sriwijaya-air-sj182-dipastikan-hoax', 'https://health.detik.com/berita-detikhealth/d-5324630/viral-vaksin-covid-19-memperbesar-penis-bpom-hoax-lah', 'https://news.detik.com/berita-jawa-timur/d-5321500/wawali-surabaya-terpilih-armuji-dikabarkan-meninggal-ketua-dprd-hoaks', 'https://news.detik.com/berita/d-5287986/beredar-chat-kapolda-metro-soal-sikat-laskar-hrs-dipastikan-hoax', 'https://news.detik.com/berita/d-5286913/video-ambulans-fpi-masuk-rs-saat-ricuh-diviralkan-ini-faktanya', 'https://news.detik.com/berita-jawa-tengah/d-5280091/viral-bendung-gerak-serayu-jebol-kepala-upt-itu-kapal-ponton-hanyut', 'https://news.detik.com/berita-jawa-tengah/d-5279872/viral-asrama-isolasi-mandiri-ugm-penuh-ternyata-begini-faktanya', 'https://news.detik.com/berita/d-5275107/kpu-makassar-bantah-keluarkan-flyer-hasil-survei-paslon-pilwalkot-berlogo-kpu', 'https://news.detik.com/berita-jawa-tengah/d-5264429/beredar-voice-note-binatang-buas-gunung-merapi-turun-ke-selo-kades-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5262931/viral-peta-bahaya-gunung-merapi-sejauh-10-km-bpptkg-itu-peta-2010', 'https://health.detik.com/berita-detikhealth/d-5254580/viral-tips-sembuhkan-covid-19-dalam-waktu-5-menit-dokter-paru-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5253524/video-jenazah-covid-19-diviralkan-bola-mata-hilang-keluarga-sebut-hoaks', 'https://news.detik.com/berita/d-5287986/beredar-chat-kapolda-metro-soal-sikat-laskar-hrs-dipastikan-hoax', 'https://news.detik.com/berita/d-5286913/video-ambulans-fpi-masuk-rs-saat-ricuh-diviralkan-ini-faktanya', 'https://news.detik.com/berita-jawa-tengah/d-5280091/viral-bendung-gerak-serayu-jebol-kepala-upt-itu-kapal-ponton-hanyut', 'https://news.detik.com/berita-jawa-tengah/d-5279872/viral-asrama-isolasi-mandiri-ugm-penuh-ternyata-begini-faktanya', 'https://news.detik.com/berita/d-5275107/kpu-makassar-bantah-keluarkan-flyer-hasil-survei-paslon-pilwalkot-berlogo-kpu', 'https://news.detik.com/berita-jawa-tengah/d-5264429/beredar-voice-note-binatang-buas-gunung-merapi-turun-ke-selo-kades-hoax', 'https://news.detik.com/berita-jawa-tengah/d-5262931/viral-peta-bahaya-gunung-merapi-sejauh-10-km-bpptkg-itu-peta-2010', 'https://health.detik.com/berita-detikhealth/d-5254580/viral-tips-sembuhkan-covid-19-dalam-waktu-5-menit-dokter-paru-pastikan-hoax', 'https://news.detik.com/berita-jawa-timur/d-5253524/video-jenazah-covid-19-diviralkan-bola-mata-hilang-keluarga-sebut-hoaks', 'https://news.detik.com/berita/d-3124615/benarkah-sesuap-lele-mengandung-3000-sel-kanker', 'https://news.detik.com/berita/d-3124915/loket-tiket-konser-bon-jovi-di-gbk-dibakar-hoax']

Python scraping pages after hitting the "load more results" button

How do you scrape a web page with infinite scrolling?
My first try was using Selenium, but it detects as robot:
from selenium import webdriver
import time
import pandas as pd
url = 'https://www.bloomberg.com/search?query=indonesia%20mining'
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options)
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0
while driver.find_elements_by_css_selector('.contentWell__a8d28605a5'):
driver.find_element_by_css_selector('.contentWell__a8d28605a5').click()
page_num += 1
print("getting page number "+str(page_num))
time.sleep(1)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
titles = soup.find_all('div', {"class":"text__d88756958e withThumbnail__c4ffc902a6"})
df = pd.DataFrame(columns=['judul', 'link'])
news = {}
for t in titles:
news['judul'] = t.find('a', {'class':'headline__96ba1917df'}).text.strip()
news['link'] = t.find('a', {'class':'headline__96ba1917df'}).get('href')
df = df.append(news, ignore_index=True)
any idea how to limit the maximum page number?

Scraping with selenium and BeautifulSoup doesn´t return all the items in the page

So I came from the question here
Now I am able to interact with the page, scroll down the page, close the popup that appears and click at the bottom to expand the page.
The problem is when I count the items, the code only returns 20 and it should be 40.
I have checked the code again and again - I'm missing something but I don't know what.
See my code below:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
#options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\\chromedriver.exe", options=options)
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
iter=1
while True:
scrollHeight = driver.execute_script("return document.documentElement.scrollHeight")
Height=10*iter
driver.execute_script("window.scrollTo(0, " + str(Height) + ");")
if Height > scrollHeight:
print('End of page')
break
iter+=1
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
time.sleep(3)
ver_mas = driver.find_elements_by_class_name('button-load-more')
for x in range(len(ver_mas)):
if ver_mas[x].is_displayed():
driver.execute_script("arguments[0].click();", ver_mas[x])
time.sleep(10)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div',class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
````=
What is wrong?. I newbie in the scraping world.
Regards

Your while and for statements don't work as intended.
Using while True: is a bad practice
You scroll until the bottom - but the button-load-more button isn't displayed there - and Selenium will not find it as displayed
find_elements_by_class_name - looks for multiple elements - the page has only one element with that class
if ver_mas[x].is_displayed(): if you are lucky this will be executed only once because the range is 1
Below you can find the solution - here the code looks for the button, moves to it instead of scrolling, and performs a click. If the code fails to found the button - meaning that all the items were loaded - it breaks the while and moves forward.
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
iter = 1
while iter > 0:
time.sleep(3)
try:
ver_mas = driver.find_element_by_class_name('button-load-more')
actions = ActionChains(driver)
actions.move_to_element(ver_mas).perform()
driver.execute_script("arguments[0].click();", ver_mas)
except NoSuchElementException:
break
iter += 1
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div', class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website

I'm scraping an E-Commerce website, Lazada using Selenium and bs4, I manage to scrape on the 1st page but I unable to iterate to the next page. What I'm tyring to achieve is to scrape the whole pages based on the categories I've selected.
Here what I've tried :
# Run the argument with incognito
option = webdriver.ChromeOptions()
option.add_argument(' — incognito')
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=option)
driver.get('https://www.lazada.com.my/')
driver.maximize_window()
# Select category item #
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
t = 10
try:
WebDriverWait(driver,t).until(EC.visibility_of_element_located((By.ID,"a2o4k.searchlistcategory.0.i0.460b6883jV3Y0q")))
except TimeoutException:
print('Page Refresh!')
driver.refresh()
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
print('Page Load!')
#Soup and select element
def getData(np):
soup = bs(driver.page_source, "lxml")
product_containers = soup.findAll("div", class_='c2prKC')
for p in product_containers:
title = (p.find(class_='c16H9d').text)#title
selling_price = (p.find(class_='c13VH6').text)#selling price
try:
original_price=(p.find("del", class_='c13VH6').text)#original price
except:
original_price = "-1"
if p.find("i", class_='ic-dynamic-badge ic-dynamic-badge-freeShipping ic-dynamic-group-2'):
freeShipping = 1
else:
freeShipping = 0
try:
discount = (p.find("span", class_='c1hkC1').text)
except:
discount ="-1"
if p.find(("div", {'class':['c16H9d']})):
url = "https:"+(p.find("a").get("href"))
else:
url = "-1"
nextpage_elements = driver.find_elements_by_class_name('ant-pagination-next')[0]
np=webdriver.ActionChains(driver).move_to_element(nextpage_elements).click(nextpage_elements).perform()
print("- -"*30)
toSave = [title,selling_price,original_price,freeShipping,discount,url]
print(toSave)
writerows(toSave,filename)
getData(np)

The problem might be that the driver is trying to click the button before the element is even loaded correctly.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(PATH, chrome_options=option)
# use this code after driver initialization
# this is make the driver wait 5 seconds for the page to load.
driver.implicitly_wait(5)
url = "https://www.lazada.com.ph/catalog/?q=phone&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e359dTYxZXo"
driver.get(url)
next_page_path = "//ul[#class='ant-pagination ']//li[#class=' ant-pagination-next']"
# the following code will wait 5 seconds for
# element to become clickable
# and then try clicking the element.
try:
next_page = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, next_page_path)))
next_page.click()
except Exception as e:
print(e)
EDIT 1
Changed the code to make the driver wait for the element to become clickable. You can add this code inside a while loop for iterating multiple times and break the loop if the button is not found and is not clickable.

Python - Selenium next page

I am trying to make a scraping application to scrape Hants.gov.uk and right now I am working on it just clicking the pages instead of scraping. When it gets to the last row on page 1 it just stopped, so what I did was make it click button "Next Page" but first it has to go back to the original URL. It clicks page 2, but after page 2 is scraped it doesn't go to page 3, it just restarts page 2.
Can somebody help me fix this issue?
Code:
import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result = []
for link in links:
if link not in result:
result.append(link)
else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
start()
driver.get(url)
for i in range(5):
driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
url = driver.current_url
start()
driver.get(url)
driver.close()

try this:
import time
# import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome()
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
result = []
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result.extend(links)
def start2():
for link in result:
# if link not in result:
# result.append(link)
# else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
while True:
start()
element = driver.find_element_by_class_name('rdpPageNext')
try:
check = element.get_attribute('onclick')
if check != "return false;":
element.click()
else:
break
except:
break
print(result)
start2()
driver.get(url)

As per the url https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True to click through all the pages you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click()
numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scrapping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[#class='rdpWrap rdpNumPart']//a[#class='rdpCurrentPage']/span//following::span[1]"))).click()
driver.quit()
Console Output:
8
Perform your scrapping here on page 1
Perform your scrapping here on page 2
Perform your scrapping here on page 3
Perform your scrapping here on page 4
Perform your scrapping here on page 5
Perform your scrapping here on page 6
Perform your scrapping here on page 7
Perform your scrapping here on page 8

hi #Feitan Portor you have written the code absolutely perfect the only reason that you are redirected back to the first page is because you have given url = driver.current_url in the last for loop where it is the url that remains static and only the java script that instigates the next click event so just remove url = driver.current_url and driver.get(url)
and you are good to go i have tested my self
also to get the current page that your scraper is in just add this part in the for loop so you will get to know where your scraper is :
ss = driver.find_element_by_class_name('rdpCurrentPage').text
print(ss)
Hope this solves your confusion

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping next pages issue with selenium - python

Related

navigate to next page and get href link

Python scraping pages after hitting the "load more results" button

Scraping with selenium and BeautifulSoup doesn´t return all the items in the page

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website

Python - Selenium next page

Categories

Resources