This is the first webpage that I've scraped, and some of the other solutions I've found don't quite seem to help. As you'll see, the "Next" button is still visible, but the CSS changes just a bit, when you get to the last page.
A few notes. I'm using python, selenium and google chrome.
I am trying to loop through each part of the table on this page: https://caearlyvoting.sos.ca.gov/
I have figured out how to loop through each county, and grab the information I need(i think). However, I am getting hung up on how to move to the next page when the table has more records than the 10 displayed by default.
I've tried variations of this
try:
next_page = driver.find_element_by_class_name('paginate_button')
next_page.click()
except NoSuchElementException:
pass
But no luck. I've tried getting the element in different ways but I run into the same issues.
Can someone help me figure out how to click through each page, grab what I need and then move onto the next county? I don't need help grabbing the info from the table, just clicking through the pages and then moving onto the next county.
EDIT
Here's the rest of the code based on a follow up. I am having difficulty structuring it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import pandas as pd
import time # not for production
# Name of the counties Single column with county names
county_df = pd.read_csv('Counties.csv')
# Path to driver on this computer
chrome_driver_path = r'C:\Windows\chromedriver'
# url to scrape
url = 'https://caearlyvoting.sos.ca.gov/'
with webdriver.Chrome(executable_path=chrome_driver_path)as driver:
# Open window, maximize and set an implicit wait
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(10)
actions = ActionChains(driver) #* New line here from stackoverflow
# find the county selection
county_selector = driver.find_element_by_id('CountyID')
# for loop tomove through the counties
for county in county_df['County'][:5]:
# Input the county namne
county_selector.send_keys(county)
### Code to grab data goes here
########* Code from stackoverflow ########
while True:
next_page = driver.find_element_by_css_selector(".paginate_button.next")
next_bnt_classes = next_page.get_attribute("class")
if "disabled" in next_bnt_classes:
break #last page reached, no more next pages, break the loop
else:
actions.move_to_element(next_page).perform()
time.sleep(0.5)
#get the actual next page button and click it
driver.find_element_by_css_selector(".paginate_button.next a").click()
You are using wrong locator.
Also the next page button can appear out of the view, on the bottom of the page, so you will have to scroll to that element and only after that click it.
On the last page the next page button is disabled.
In this case it contains disabled class name.
So your code can be:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
while True:
#grab the data from current page, after that:
next_page = driver.find_element_by_css_selector(".paginate_button.next")
next_bnt_classes = next_page.get_attribute("class")
if "disabled" in next_bnt_classes:
break #last page reached, no more next pages, break the loop
else:
next_page = driver.find_element_by_css_selector(".paginate_button.next")
actions.move_to_element(next_page).perform()
time.sleep(0.5)
#get the actual next page button and click it
driver.find_element_by_css_selector(".paginate_button.next a").click()
UPD
The working code is slightly different:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
while True:
#grab the data from current page, after that:
next_page = driver.find_element_by_css_selector(".paginate_button.next")
next_bnt_classes = next_page.get_attribute("class")
if next_bnt_classes == 'paginate_button next disabled':
break #last page reached, no more next pages, break the loop
else:
# Move to the next page for the county and append the data
next_page.click()
Related
I am new to selenium and trying to scrape:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/
I need all the details mentioned on this page an others as well.
Also, there are certain more pages containing the same information, need to scrape them as well. I try to scrape by making changes to the target URL:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/40
but the last item is changing and is not even similar to the page number. Page number 3 is having 40 at the end and page number 5:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/80
so not able to get the data through that.
Here is my code:-
def extract_url():
url = driver.find_elements(By.XPATH,"//h2[#class='resultTitle']//a")
for i in url:
dist.append(i.get_attribute("href"))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.find_element(By.XPATH,"//li[#class='btnNextPre']//a").click()
for _ in range(10):
extract_url()
working fine till page 5 but not after that. Could you please suggest how can I iterate over pages where the we don't know the number of pages and can extract data till teh last page.
You need the check the pagination link is disabled. Use infinite loop and check for pagination button is disabled.
Use WebDriverWait() and wait for visibility of the element.
Code:
driver.get("https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/")
counter=1
while(True):
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2.resultTitle >a")))
urllist=[item.get_attribute('href') for item in driver.find_elements(By.CSS_SELECTOR, "h2.resultTitle >a")]
print(urllist)
print("Page number :" +str(counter))
driver.execute_script("arguments[0].click();", driver.find_element(By.CSS_SELECTOR, "ul.pagination >li.btnNextPre>a"))
#check for pagination button disabled
if len(driver.find_elements(By.XPATH, "//li[#class='disabled']//a[text()='>']"))>0:
print("pagination not found!!!")
break
time.sleep(2) #To slowdown the loop
counter=counter+1
import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
I have written a simple web scraping code using Selenium but I want to scrape only the portion that is present 'before scroll'
Say, if it is this page I want to scrape - https://en.wikipedia.org/wiki/Pandas_(software) - Selenium reads information till the absolute last element/text which for me is the 'Powered by Media Wiki' button on the far bottom-right of the page.
What I want Selenium to do is stop after DataFrames (see screenshot) and not scroll down to the bottom.
And I also want to know where on the page it stops. I have checked multiple sources and most of them ask for infinite scroll websites. No one asks for just the 'visible' half of a page.
This is my code now:
from selenium import webdriver
EXECUTABLE = r"chromedriver.exe"
# get the URL
url = "https://en.wikipedia.org/wiki/Pandas_(software)"
# open the chromedriver
driver = webdriver.Chrome(executable_path = EXECUTABLE)
# google window is maximized so that all webpages are rendered in the same size
driver.maximize_window()
# make the driver wait for 30 seconds before throwing a time-out exception
driver.implicitly_wait(30)
# get URL
driver.get(url)
for element in driver.find_elements_by_xpath("//*"):
try:
#stuff
except:
continue
driver.close()
Absolutely any direction is appreciated. I have tried to be as clear as possible here but let me know if any more details are required.
I don't think that is possible. Observe the DOM, all the informational elements are under one section I mean one tag div[#id='content'], which is already visible to Selenium. Even if you try with //*, div[#id='content'] is visible.
And trying to check whether the element is visible though not scrolled, will also return True. (If someone knows to do what you are asking for, even I would like to know.)
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import _element_if_visible
driver = webdriver.Chrome(executable_path = 'path to chromedriver.exe')
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://en.wikipedia.org/wiki/Pandas_(software)")
elements = driver.find_elements_by_xpath("//div[#id='content']//*")
for element in elements:
try:
if _element_if_visible(element):
print(element.get_attribute("innerText"))
except:
break
driver.quit()
I am trying to get search results from yahoo search using python - selenium and bs4. I have been able to get the links successfuly but I am not able to click the button at the bottom to go to the next page. I tried one way, but it could't identify after the second page.
Here is the link:
https://in.search.yahoo.com/search;_ylt=AwrwSY6ratRgKEcA0Bm6HAx.;_ylc=X1MDMjExNDcyMzAwMgRfcgMyBGZyAwRmcjIDc2ItdG9wLXNlYXJjaARncHJpZANidkhMeWFsMlJuLnZFX1ZVRk15LlBBBG5fcnNsdAMwBG5fc3VnZwMxMARvcmlnaW4DaW4uc2VhcmNoLnlhaG9vLmNvbQRwb3MDMARwcXN0cgMEcHFzdHJsAzAEcXN0cmwDMTQEcXVlcnkDc3RhY2slMjBvdmVyZmxvdwR0X3N0bXADMTYyNDUzMzY3OA--?p=stack+overflow&fr=sfp&iscqry=&fr2=sb-top-search
This is what im doing to get data from page but need to put in a loop which changes pages:
page = BeautifulSoup(driver.page_source, 'lxml')
lnks = page.find('div', {'id': 'web'}).find_all('a', href = True)
for i in lnks:
print(i['href'])
You don't need to scroll down to the bottom. The next button is accessible without scrolling. Suppose you want to navigate 10 pages. The python script can be like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver=webdriver.Chrome()
driver.get('Yahoo Search URL')
# Let's create a loop containing the XPath for next button
# As well as waiting for the next button to be clickable.
for i in range(10):
WebDriverWait(driver, 5).until(EC.element_to_be_clickable(By.XPATH, '//a[#class="next"]'))
navigate = driver.find_element_by_xpath('//a[#class="next"]').click()
The next page button is on the bottom of the page so you first need to scroll to that element and then click it. Like this:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
next_page_btn = driver.find_element_by_css_selector("a.next")
actions.move_to_element(next_page_btn).build().perform()
time.sleep(0.5)
next_page_btn.click()
I've created a script in python to scrape the content populated upon initiating a serach in the search box in google map. My script can generate results by pressing that search button. Now I wish to keep parsing the results by pressing the next button (located at the bottom left) until there is none.
Site address
I'm using this motels in new jersey keyword as search.
I've tried with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.google.com/maps/search/")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input#searchboxinput"))).send_keys("motels in new jersey")
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#searchbox-searchbutton"))).click()
while True:
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-result-content"))):
name = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3[class='section-result-title'] > span"))).text
print(name)
try:
next_page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button[jsaction$='.nextPage'] > span")))
driver.execute_script("arguments[0].click();",next_page)
except TimeoutException: break
driver.quit()
The above script gives me the same results (from first page) several times no matter how far it goes clicking on that next button.
How can I get the accurate results from next pages?
Here is the logic that should work.
There is a server error (application issue) occurring when navigating through the list, so are waiting for the page to load the information and then check if server error displayed, if not then continue with population of the results.
driver.get("https://www.google.com/maps/search/")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input#searchboxinput"))).send_keys("motels in new jersey")
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#searchbox-searchbutton"))).click()
while True:
# wait until the information is loaded
wait.until_not(EC.presence_of_element_located((By.XPATH, "//div[#id='searchbox'][contains(#class,'loading')]")))
# check if there is any server error
if len(driver.find_elements_by_xpath("//div[#class='snackbar-message'][contains(.,'error')]"))>0:
# print the error message
print(driver.find_element_by_xpath("//div[#class='snackbar-message'][contains(.,'error')]").text)
# exit the loop
break
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-result-content"))):
name = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3[class='section-result-title'] > span"))).text
print(name)
try:
next_page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button[jsaction$='.nextPage'] > span")))
driver.execute_script("arguments[0].click();",next_page)
except TimeoutException: break
Being in a while True loop, your script does not wait for the next page to be rendered before searching for the name. The locators input#searchboxinput and button#searchbox-searchbutton are still active when the next page is loading. Thus your script will output the same names from the same page for as many iterations as will run before the next page is loaded.
I recommend a wait condition for the page loading, such as the presence of the spinner animation in the top left where the X button usually is. This should pause the execution until the next page is loaded. The div with id searchbox has a show-loading class appear only while that spinner is active. You can use that to determine if the page is still loading.
This is a follow-up question to this:
WebDriver element found, but click returns nothing
I am trying to scrape data from the URL in the code after making selections in the drop-down menu. I first click on Progress Monitoring and then Physical and Financial Project Summary. Then I make the following selections: State, District, Block, Year, Batch, and Collaboration. I would also like to check the Road Wise button and then click on the view button. After the table loads, I would like to click on the save button and download the excel file. In the code below I also loop through different selections under "State" item. Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import os
chromedriver = r"C:\Users\yuppal\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
browser = webdriver.Chrome(chromedriver)
browser.implicitly_wait(10)
browser.get("http://omms.nic.in")
browser.maximize_window()
#Click on the item Progress Monitoring
progElem = browser.find_element_by_link_text("Progress Monitoring").click()
#Click on the item Physical and Financial Project Sumamry
summElem = browser.find_element_by_link_text("Physical and Financial Project Summary").click()
#Find the element for state and create a list of different selection options
stateElem = browser.find_element_by_xpath("//select[#name='StateCode']")
state_options = stateElem.find_elements_by_tag_name("option")
#delete the first option in the list
del state_options[0]
def select_option(xpath, text):
'''
This function will select the remaining dropd-down menu items.
'''
elem = browser.find_element_by_xpath(xpath)
Select(elem).select_by_visible_text(text)
#run the loop for each option in the list of states
for option in state_options:
select_state = Select(stateElem).select_by_value(option.get_attribute("value"))
# Select the district.
select_option("//select[#name='DistrictCode']","All Districts")
# Select the block.
select_option("//select[#name='BlockCode']","All Blocks")
# Select the year.
select_option("//select[#name='Year']","All Years")
# Select the batch.
select_option("//select[#name='Batch']","All Batches")
# Select the funding agency.
select_option("//select[#name='FundingAgency']","Regular PMGSY")
# Check the road wise box.
time.sleep(10)
checkElem = WebDriverWait(browser, 120).until(EC.element_to_be_clickable((By.XPATH, "//input[#title='Road Wise']")))
browser.execute_script("arguments[0].click();", checkElem)
# Click on the view button.
time.sleep(10)
browser.find_element_by_xpath("//input[#type='button']").click()
# Switch to a new frame.
time.sleep(10)
frame = browser.find_element_by_xpath("//div[#id='loadReport']/iframe")
browser.switch_to.default_content()
#browser.switch_to.frame(frame)
WebDriverWait(browser, 120).until(EC.frame_to_be_available_and_switch_to_it(frame))
#browser.switch_to.frame(browser.find_element_by_xpath("//*[#id='loadReport']/iframe"))
# click on the save button
time.sleep(10)
WebDriverWait(browser, 120).until(EC.element_to_be_clickable((By.XPATH, "//a[#title='Export drop down menu']"))).click()
# Within the save button, Click on the "Excel" option.
time.sleep(10)
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, "//div/a[#title='Excel']"))).click()
# Switch back to the main content.
time.sleep(20)
browser.switch_to.default_content()
My issue is the "Road Wise" checkbox gets clicked only for some states. Thus the loop proceeds without clicking the checkbox for some states. I checked the HTML code and it is the same for all checkboxes.
I thought the problem might be that the "View" button gets clicked before the road wise button is clickable. So I put some waiting period before both road wise and view buttons. But that doesn't seem to help. So I can't really understand why the checkbox button isn't clicked for some iterations in the loop.
Before clicking on the checkbox, check that is already selected or not:
# Check the road wise box.
time.sleep(10)
checkElem = WebDriverWait(browser, 120).until(EC.element_to_be_clickable((By.XPATH, "//input[#title='Road Wise']")))
if checkElem.is_selected() != True:
browser.execute_script("arguments[0].click();", checkElem)
PS: In your case, the click will be only in the first iteration of the loop.