Beautiful soup parsing of breakdown list

Beautiful soup parsing of breakdown list - python

I need to get data/string from finance yahoo. However, relevant information is "hidden" under breakdown list.
As you see, I can access other data, e.g. total revenue, cost of revenue. Problem occurs when I try to access data hidden under breakdown list - Current Assets, Inventory (which is under Total Assets and Current Assets sections).
Python raises AttributeError: 'NoneType' object has no attribute 'find_next' error which I do not find illustrative.
P.S. I found that problem are these elements by commenting out each line
import urllib.request as url
from bs4 import BeautifulSoup
company = input('enter companies abbreviation')
income_page = 'https://finance.yahoo.com/quote/' + company + '/financials/'
balance_page = 'https://finance.yahoo.com/quote/' + company + '/balance-sheet/'
set_income_page = url.urlopen(income_page).read()
set_balance_page = url.urlopen(balance_page).read()
soup_income = BeautifulSoup(set_income_page, 'html.parser')
soup_balance = BeautifulSoup(set_balance_page, 'html.parser')
revenue_element = soup_income.find('span', string='Total Revenue').find_next('span').text
cogs_element = soup_income.find('span', string='Cost of Revenue').find_next('span').text
ebit_element = soup_income.find('span', string='Operating Income').find_next('span').text
net_element = soup_income.find('span', string='Pretax Income').find_next('span').text
short_assets_element = soup_balance.find('span', string='Current Assets').find_next('span').text
inventory_element = soup_balance.find('span', string='Inventory').find_next('span').text

Here is an example of parsing this web page using selenium. It allows emulate user behavior: wait till page is loaded, close pop-up, extend treenode by click it and extract some information from it.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
company = input('enter companies abbreviation: ')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<<PATH_TO_CHROMEDRIVER>>', options=chrome_options)
# delay (how long selenium waits for element to be loaded)
DELAY = 30
# maximize browser window
wd.maximize_window()
# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/financials/')
# check for popup, close it
try:
btn = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[text()="I agree"]')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)
except:
pass
# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))
# parse content
soup_income = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')
# extract values
revenue_element = soup_income.find('span', string='Total Revenue').find_next('span').text
cogs_element = soup_income.find('span', string='Cost of Revenue').find_next('span').text
ebit_element = soup_income.find('span', string='Operating Income').find_next('span').text
net_element = soup_income.find('span', string='Pretax Income').find_next('span').text
# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/balance-sheet/')
# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))
# expand total assets
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Total Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)
# expand inventory
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Current Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)
# parse content
soup_balance = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')
# extract values
short_assets_element = soup_balance.find('span', string='Current Assets').find_next('span').text
inventory_element = soup_balance.find('span', string='Inventory').find_next('span').text
# close webdriver
wd.quit()
print(revenue_element)
print(cogs_element)
print(ebit_element)
print(net_element)
print(short_assets_element)
print(inventory_element)

Related

problem in clicking radio button can't able to select a radio button. Message: stale element reference: element is not attached to the page document

Error : selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document.
website I'm scraping https://www.telekom.de/unterwegs/apple/apple-iphone-13-pro/graphit-512gb I wanted to loop this tariff details with each section and each radio button shows different prices. I wanted to scrape, price details for each radio buttons one by one and checked radio button name along with price till end of the page. I have tried but I couldn't make success.
could anyone help on this. I will be helpful for me to learn. I have tried till get entered in to change tariff link and I'm facing issue to scrape a details. change tariff links given below links,
https://i.stack.imgur.com/RRyJa.png
https://i.stack.imgur.com/fNafB.png
https://i.stack.imgur.com/jFnLA.png
https://i.stack.imgur.com/WlyLU.png
"I'm trying to click a radio button and need to scrape a price details for selected radio button."
import xlwt
from selenium import webdriver
import re
import time
from datetime import date
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
class telekommobiles:
def __init__(self):
self.url="https://www.telekom.de/mobilfunk/geraete/smartphone?page=1&pageFilter=promotion"
self.country='DE'
self.currency='GBP'
self.VAT='Included'
self.shipping = 'free shipping within 3-4 weeks'
self.Pre_PromotionPrice ='N/A'
self.color ='N/A'
def telekom(self):
#try:
driver=webdriver.Chrome()
driver.maximize_window()
driver.get(self.url)
today = date.today()
#time.sleep(5)
WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH,"//*[#id='consentAcceptAll']")))
cookies = driver.find_element_by_css_selector('button.cl-btn.cl-btn--accept-all').click()
print("cookies accepted")
links_prod_check = []
prod_models = []
prod_manufacturer =[]
prod_memorys = []
product_colors =[]
product_price_monthly_payments = []
product_price_one_time_payments =[]
product_links = []
containers = driver.find_elements_by_css_selector('div[class="styles_item__12Aw4"]')
i = 1
for container in containers:
p_links =container.find_element_by_tag_name('a').get_attribute('href')
i = i + 1
product_links.append(p_links)
#print(p_links)
for links in product_links:
driver.get(links)
#time.sleep(5)
#print(driver.current_url)
#links_prod_check.append(driver.current_url)
coloroptions = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH,"//li[#data-qa='list_ColorVariant']")))
#print(coloroptions)
for i in range(len(coloroptions)):
coloroption = driver.find_elements_by_xpath("//li[#data-qa='list_ColorVariant']")
coloroption[i].click()
#print(coloroption[i])
time.sleep(3)
memoryoptions = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH,"//span[#class='phx-radio__element']")))
for i in range(len(memoryoptions)):
memoryoption = driver.find_elements_by_xpath("//span[#class='phx-radio__element']")
try:
memoryoption[i].click()
except:
pass
time.sleep(3)
change_traiff = driver.find_element_by_css_selector('button[class="phx-link phx-list-of-links__link js-mod tracking-added"]').click()
time.sleep(3)
section_loops = driver.find_elements_by_css_selector('section[class="tariff-catalog--layer"]')
for section_loop in section_loops:
#Headings
heading_1 = section_loop.find_element_by_css_selector('h2[class="page-title page-title--lowercase"]').text
print(heading_1)
looping_for_tariff = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH,"//span[#class='phx-radio__element']")))
subcontainers = section_loop.find_elements_by_css_selector('div[class="phx-tariff-box__section"]')
for subcontainer in subcontainers:
radio_buttons_list=subcontainer.find_elements_by_css_selector('div[class="phx-form__row phx-form__row--small phx-form__row--full-width phx-form__row--radio"]')
for radio in radio_buttons_list:
input=radio.find_elements_by_css_selector('span[class="phx-radio__element"]')
if input[0].is_enabled():
try:
ActionChains(driver).move_to_element(subcontainer).perform()
time.sleep(2)
input[0].click()
time.sleep(3)
except:
print('Not clickable')
pass
lable_list=radio.find_elements_by_css_selector('span[class="phx-radio__label"]')
label=""
if lable_list:
label=lable_list[0].text
heading_2 = subcontainer.find_element_by_css_selector('p[class="phx-t6 phx-t--medium"]').text
data_price_list= subcontainer.find_element_by_css_selector('div[class="phx-tariff-box__data-price"]')
volumn_list=data_price_list.find_elements_by_css_selector('div[data-qa="label_Tariff_VolumeSize"]')
volumn=""
if volumn_list:
volumn=volumn_list[0].text
price_list=subcontainer.find_elements_by_css_selector('p[class="phx-price phx-price--size_large phx-price--strong phx-price--color_brand"]')
price=""
nonBreakSpace = u'\xa0'
if price_list:
price=price_list[0].text
print(str(heading_2) + " " + str(label) + " " + str(volumn.replace(' ', '').replace( '\\r\\n','')) + " " + str(price))
#except:
#pass
telekom_de=telekommobiles()
telekom_de.telekom()

After selecting a different Option the page gets Refreshed, hence the issue. I was not able to find where you were trying to click on the buttons in your code. So tried to click on all the radio buttons with below code and was successful. Check the code once.
from selenium import webdriver
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.telekom.de/unterwegs/apple/apple-iphone-13-pro/sierrablau-128gb")
wait = WebDriverWait(driver,30)
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[text()='Accept All']"))).click()
radiooptions = wait.until(EC.presence_of_all_elements_located((By.XPATH,"//span[#class='phx-radio__element']")))
for i in range(len(radiooptions)):
radiooptions = driver.find_elements_by_xpath("//span[#class='phx-radio__element']")
radiooptions[i].click()
time.sleep(2)

please li element instead of span
//li[#data-qa='list_ColorVariant']
and also add wait once you click on it. 5secs. then click the next one

Web Scraping shopee.sg with selenium and BeautifulSoup in python

Whenever I am trying to scrape shopee.sg using selenium and BeautifulSoup I am not being able to extract all the data from a single page.
Example - For a search result consisting of 50 products information on the first 15 are getting extracted while the remaining are giving null values.
Now, I know this has got something to do with the scroller but I have no idea how to make it work. Any idea how to fix this?
Code as of now
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
import csv
# create object for chrome options
chrome_options = Options()
#base_url = 'https://shopee.sg/search?keyword=disinfectant'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
#chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
def get_url(search_term):
"""Generate an url from the search term"""
template = "https://www.shopee.sg/search?keyword={}"
search_term = search_term.replace(' ','+')
#add term query to url
url = template.format(search_term)
#add page query placeholder
url+= '&page={}'
return url
def main(search_term):
# invoke the webdriver
driver = webdriver.Chrome(options = chrome_options)
item_cost = []
item_name = []
url=get_url(search_term)
for page in range(0,3):
driver.get(url.format(page))
delay = 5 #seconds
try:
WebDriverWait(driver, delay)
print ("Page is ready")
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
#find the product description
for item_n in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
try:
description_soup = item_n.find('div',{'class':'yQmmFK _1POlWt _36CEnF'})
name = description_soup.text.strip()
except AttributeError:
name = ''
print(name)
item_name.append(name)
# find the price of items
for item_c in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
try:
price_soup = item_c.find('div',{'class':'WTFwws _1lK1eK _5W0f35'})
price_final = price_soup.find('span',{'class':'_29R_un'})
price = price_final.text.strip()
except AttributeError:
price = ''
print(price)
item_cost.append(price)
except TimeoutException:
print ("Loading took too much time!-Try again")
sleep(5)
rows = zip(item_name, item_cost)
with open('shopee_item_list.csv','w',newline='',encoding='utf-8') as f:
writer=csv.writer(f)
writer.writerow(['Product Description', 'Price'])
writer.writerows(rows)```

The issue was that the products that you were trying to scrape load dynamically as you scroll down the page. There may be more elegant solutions than mine, but I implemented a simple javascript scroller, using driver.execute_script (additional resource: https://www.geeksforgeeks.org/execute_script-driver-method-selenium-python)
Scroller
which scrolls to a tenth of the page's height, pauses for 500 milliseconds, and then continues.
driver.execute_script("""
var scroll = document.body.scrollHeight / 10;
var i = 0;
function scrollit(i) {
window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
i++;
if (i < 10) {
setTimeout(scrollit, 500, i);
}
}
scrollit(i);
""")
Additionally, you had two for loops, for item_n in soup.find_all(...), for item_c in soup.find_all(...) that were iterating over divs in the same class. I fixed that, in my code, so that you can get both the price and the name of each item while only using one for loop.
You also had try-except statements (in case there was an AttributeError, i.e. if the items you were finding in soup.find_all were NoneTypes). I simplified those into if statements, like this one
name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
name = name.text.strip()
else:
name = ''
And finally, you were using zip for two different lists (names and prices), to add to a csv file. I combined those individual lists into a nested list in the for loop, instead of appending to two separate lists and zipping at the end. This saves a step, though it is optional and may not be what you need.
Full (updated) code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
from time import sleep
# create object for chrome options
chrome_options = Options()
# base_url = 'https://shopee.sg/search?keyword=disinfectant'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
# chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
def get_url(search_term):
"""Generate an url from the search term"""
template = "https://www.shopee.sg/search?keyword={}"
search_term = search_term.replace(' ', '+')
# add term query to url
url = template.format(search_term)
# add page query placeholder
url += '&page={}'
return url
def main(search_term):
# invoke the webdriver
driver = webdriver.Chrome(options=chrome_options)
rows = []
url = get_url(search_term)
for page in range(0, 3):
driver.get(url.format(page))
WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result__item")))
driver.execute_script("""
var scroll = document.body.scrollHeight / 10;
var i = 0;
function scrollit(i) {
window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
i++;
if (i < 10) {
setTimeout(scrollit, 500, i);
}
}
scrollit(i);
""")
sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all('div', {'class': 'col-xs-2-4 shopee-search-item-result__item'}):
name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
name = name.text.strip()
else:
name = ''
price = item.find('div', {'class': 'WTFwws _1lK1eK _5W0f35'})
if price is not None:
price = price.find('span', {'class': '_29R_un'}).text.strip()
else:
price = ''
print([name, price])
rows.append([name, price])
with open('shopee_item_list.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Description', 'Price'])
writer.writerows(rows)

Selenium switch window stop after a specific iteration

I don't know why I'm getting this error I tried many exception to handle my error , but sometimes it goes fluently till a specific page and stops and sometimes it doesn't even starts.
Here's my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common import exceptions
from time import sleep
import pandas as pd
options = Options()
# Create our Players Salaries Dictionary
weekly_wages_players_dataset = pd.DataFrame(columns=['name', 'team', 'weekly wage'])
# We specify our Chrome driver path
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
web_driver = webdriver.Chrome(options=options, executable_path=path)
# Famous Championships
premier_league_url = "https://eurofootballrumours.com/premier-league-players-salaries/"
bundesliga_url = "https://eurofootballrumours.com/bundesliga-players-salaries/"
laliga_url = "https://eurofootballrumours.com/la-liga-players-salaries/"
seriea_url = "https://eurofootballrumours.com/serie-a-players-salaries/"
ligue_fr_url = "https://eurofootballrumours.com/ligue-1-players-salaries/"
championships = [premier_league_url, bundesliga_url, laliga_url, seriea_url, ligue_fr_url]
def players_info(driver, url, weekly_wages_players):
try:
driver.get(url)
except exceptions.InvalidSessionIdException as e:
print(e.message)
sleep(3)
# Let's get the teams values
teams = driver.find_elements_by_css_selector("h2")
for t in teams:
atags = t.find_elements_by_css_selector('a')
for atag in atags:
# In each atag, select the href
href = atag.get_attribute('href')
print(href)
# Open a new window
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
sleep(2)
# We get players infos
player_team = driver.find_element_by_class_name('wp-caption-text').text
# We get table content since it has all players name and their weekly wages
table_id = driver.find_element(By.TAG_NAME, "table")
tbody = table_id.find_element(By.TAG_NAME, "tbody")
rows = tbody.find_elements(By.TAG_NAME, "tr")
for row in rows:
player_name = row.find_elements(By.TAG_NAME, "td")[0].text
player_week_salary = row.find_elements(By.TAG_NAME, "td")[1].text
print(player_name)
print(player_week_salary)
# Now we store our result to our dataframe
weekly_wages_players = weekly_wages_players.append(
{'name': player_name, 'team': player_team, 'weekly wage': player_week_salary}, ignore_index=True)
# Close the tab with URL B
driver.close()
# Switch back to the first tab with URL A
driver.switch_to.window(driver.window_handles[0])
# We call our function through all the championships links
for championship in championships:
players_info(web_driver, championship, weekly_wages_players_dataset)
web_driver.close()
# We store our dataframe in an excel file
weekly_wages_players_dataset.to_excel('Weekly_Players_Wages.xlsx', index=False)
And this is the error I get :
driver.switch_to.window(driver.window_handles[0])
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website

I'm scraping an E-Commerce website, Lazada using Selenium and bs4, I manage to scrape on the 1st page but I unable to iterate to the next page. What I'm tyring to achieve is to scrape the whole pages based on the categories I've selected.
Here what I've tried :
# Run the argument with incognito
option = webdriver.ChromeOptions()
option.add_argument(' — incognito')
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=option)
driver.get('https://www.lazada.com.my/')
driver.maximize_window()
# Select category item #
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
t = 10
try:
WebDriverWait(driver,t).until(EC.visibility_of_element_located((By.ID,"a2o4k.searchlistcategory.0.i0.460b6883jV3Y0q")))
except TimeoutException:
print('Page Refresh!')
driver.refresh()
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
print('Page Load!')
#Soup and select element
def getData(np):
soup = bs(driver.page_source, "lxml")
product_containers = soup.findAll("div", class_='c2prKC')
for p in product_containers:
title = (p.find(class_='c16H9d').text)#title
selling_price = (p.find(class_='c13VH6').text)#selling price
try:
original_price=(p.find("del", class_='c13VH6').text)#original price
except:
original_price = "-1"
if p.find("i", class_='ic-dynamic-badge ic-dynamic-badge-freeShipping ic-dynamic-group-2'):
freeShipping = 1
else:
freeShipping = 0
try:
discount = (p.find("span", class_='c1hkC1').text)
except:
discount ="-1"
if p.find(("div", {'class':['c16H9d']})):
url = "https:"+(p.find("a").get("href"))
else:
url = "-1"
nextpage_elements = driver.find_elements_by_class_name('ant-pagination-next')[0]
np=webdriver.ActionChains(driver).move_to_element(nextpage_elements).click(nextpage_elements).perform()
print("- -"*30)
toSave = [title,selling_price,original_price,freeShipping,discount,url]
print(toSave)
writerows(toSave,filename)
getData(np)

The problem might be that the driver is trying to click the button before the element is even loaded correctly.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(PATH, chrome_options=option)
# use this code after driver initialization
# this is make the driver wait 5 seconds for the page to load.
driver.implicitly_wait(5)
url = "https://www.lazada.com.ph/catalog/?q=phone&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e359dTYxZXo"
driver.get(url)
next_page_path = "//ul[#class='ant-pagination ']//li[#class=' ant-pagination-next']"
# the following code will wait 5 seconds for
# element to become clickable
# and then try clicking the element.
try:
next_page = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, next_page_path)))
next_page.click()
except Exception as e:
print(e)
EDIT 1
Changed the code to make the driver wait for the element to become clickable. You can add this code inside a while loop for iterating multiple times and break the loop if the button is not found and is not clickable.

Python - Selenium next page

I am trying to make a scraping application to scrape Hants.gov.uk and right now I am working on it just clicking the pages instead of scraping. When it gets to the last row on page 1 it just stopped, so what I did was make it click button "Next Page" but first it has to go back to the original URL. It clicks page 2, but after page 2 is scraped it doesn't go to page 3, it just restarts page 2.
Can somebody help me fix this issue?
Code:
import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result = []
for link in links:
if link not in result:
result.append(link)
else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
start()
driver.get(url)
for i in range(5):
driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
url = driver.current_url
start()
driver.get(url)
driver.close()

try this:
import time
# import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
driver = webdriver.Chrome()
driver.get(url)
driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
result = []
def start():
elements = driver.find_elements_by_css_selector(".searchResult a")
links = [link.get_attribute("href") for link in elements]
result.extend(links)
def start2():
for link in result:
# if link not in result:
# result.append(link)
# else:
driver.get(link)
goUrl = urllib.request.urlopen(link)
soup = BeautifulSoup(goUrl.read(), "html.parser")
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
for i in range(20):
pass # Don't worry about all this commented code, it isn't relevant right now
#table = soup.find_element_by_id("table", {"class": "applicationDetails"})
#print(table.text)
# div = soup.select("div.applicationDetails")
# getDiv = div[i].split(":")[1].get_text()
# log = open("log.txt", "a")
# log.write(getDiv + "\n")
#log.write("\n")
while True:
start()
element = driver.find_element_by_class_name('rdpPageNext')
try:
check = element.get_attribute('onclick')
if check != "return false;":
element.click()
else:
break
except:
break
print(result)
start2()
driver.get(url)

As per the url https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True to click through all the pages you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click()
numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scrapping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[#class='rdpWrap rdpNumPart']//a[#class='rdpCurrentPage']/span//following::span[1]"))).click()
driver.quit()
Console Output:
8
Perform your scrapping here on page 1
Perform your scrapping here on page 2
Perform your scrapping here on page 3
Perform your scrapping here on page 4
Perform your scrapping here on page 5
Perform your scrapping here on page 6
Perform your scrapping here on page 7
Perform your scrapping here on page 8

hi #Feitan Portor you have written the code absolutely perfect the only reason that you are redirected back to the first page is because you have given url = driver.current_url in the last for loop where it is the url that remains static and only the java script that instigates the next click event so just remove url = driver.current_url and driver.get(url)
and you are good to go i have tested my self
also to get the current page that your scraper is in just add this part in the for loop so you will get to know where your scraper is :
ss = driver.find_element_by_class_name('rdpCurrentPage').text
print(ss)
Hope this solves your confusion

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful soup parsing of breakdown list - python

Related

problem in clicking radio button can't able to select a radio button. Message: stale element reference: element is not attached to the page document

Web Scraping shopee.sg with selenium and BeautifulSoup in python

Selenium switch window stop after a specific iteration

Next Page Iteration in Selenium/BeautfulSoup for Scraping E-Commerce Website

Python - Selenium next page

Categories

Resources