How to update a function's input by its own output - python

I am scraping a web page with a table that has multiple pages. I have a function that finds the next page button and hits it. The function needs to return to the main table page to do that. I have the link to that main table page hardcoded in a variable.
Once I moved to e.g page number 2, how do I update that table page link to the new page link? so that once it's done going inside elements of the table, it'll go back to the 2nd-page link and move on from there, not the first page.
def nextPage(driver, desiredPage):
driver.get(desiredPage)
time.sleep(55)
button = driver.find_element_by_xpath(XPATH WRITTEN HERE)
driver.execute_script("arguments[0].click();", button)
time.sleep(45)
next_page = driver.current_url
return next_page

You can use a while loop, to call the method again and again. You may choose a condition on page_nb or another variable to count, to avoid infinite code to run
page_nb = 1
while page_nb != 99:
page_nb = nextPage(driver, page_nb)

Related

Clicking through pagination links that appear in sets with selenium

This is my first time with selenium and the website I'm scraping (page) doesn't have a next page button and the pages for pagination don't change till you click the "..." and then it shows the next set of 10 pagination links. How do I loop through the clicking.
I've seen a few answers online but I don't couldn't adapt them to my code because of the links only come in sets. This is the code
from selenium.webdriver import Chrome
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
driver_path = 'Projects\Selenium Driver\chromedriver_win32'
driver = Chrome(executable_path=driver_path)
driver.get('https://business.nh.gov/nsor/search.aspx')
drop_down = driver.find_element(By.ID, 'ctl00_cphMain_lstStates')
select = Select(drop_down)
select.select_by_visible_text('NEW HAMPSHIRE')
driver.find_element(By.ID, 'ctl00_cphMain_btnSubmit').click()
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href = hrefs[:10]
pagination_links = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender tbody tr td table tbody a')
With your current code, the next page elements are already captured within list content[10:]. And the last page hyperlink with ellipsis is actually the next logical sequence. Using this fact, we can use a current page variable to keep track of the page being visited and use that to identify the right anchor tag element within list content for the next page.
With a do-while loop logic and using your code to scrape the required elements, here the primary code:
offenders_href = list()
curr_page = 1
while True:
# find all anchor tags with this table
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href += hrefs[:10]
curr_page += 1
# find next page element
for page_elem in content[10:]:
if page_elem.get_attribute("href").endswith('$'+str(curr_page)+"')"):
next_page = page_elem
break
else:
# last page reached, break out of while
break
print(f'clicking {next_page.text}...')
next_page.click()
sleep(1)
I placed this code in function launch_click_pages. Launching it with your URL, it is a able to scroll through pages (it kept going, but I stopped it at some page):
>>> launch_click_pages('https://business.nh.gov/nsor/search.aspx')
clicking 2...
clicking 3...
clicking 4...
clicking 5...
clicking 6...
clicking 7...
clicking 8...
clicking 9...
clicking 10...
clicking ......
clicking 12...
clicking 13...
clicking 14...
clicking 15...
^C
You can try to execute script e.g. driver.execute_script("javascript:__doPostBack('ctl00$cphMain$gvwOffender','Page$5')") and you will redirected to fifth page

scrapy+selenium how to crawl a different page list once i'm done with one?

i'm trying to scrape data from an "action/user trading" website, it is in italian so i'll try to be as clear as possible.
I'm also really new to Python and Scrapy, it is my first project.
The website has not an easy way to follow links, so i had to come up with a few things.
First i go to the general list, where all the pages are listed, this is pretty easy as the first page is "https://www.subito.it/annunci-italia/vendita/usato/?o=1" and goes onto "/?o=218776", i pick the first link of the page and open it with selenium, once here i get the data i need and the click the "next page" button, but here's the tricky part.
if i go to the same page using the same URL there isn't a "next page" button, it works only if you are first in the list page, and then click on the page link, from here you can now follow the other links.
i thought it would be done, but i was wrong. the general list is divided in pages (.../?o=1, .../?o=2, etc), each page has an X number of Links (i haven't counted them), when you are on one of the auction pages (coming from the list page so you can use the "next page" button) and you click the "next page" you follow the order of the links in the general list.
to be clearer if the general list has 200k pages, and each page has 50 links, when you click on the first link of the page you can then click "next page" for 49 times, after that the "next page" button is inactive and you can't go to older link, you must go back to the list and go to the next page, and repeat the process.
import scrapy
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
from selenium import webdriver
class NumeriSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.subito.it/annunci-italia/vendita/usato/?o=41',
]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[5]/div[2]/div[1]/div[3]/div[1]/a').click()
while True:
sel = Selector(text=self.driver.page_source)
item = {
'titolo': sel.xpath('//h1[#class= "classes_sbt-text-atom__2GBat classes_token-h4__3_Swu size-normal classes_weight-semibold__1RkLc ad-info__title"]/text()').get(),
'nome': sel.xpath("//p[#class='classes_sbt-text-atom__2GBat classes_token-subheading__3yij_ size-normal classes_weight-semibold__1RkLc user-name jsx-1261801758']/text()").get(),
'luogo': sel.xpath("//span[#class='classes_sbt-text-atom__2GBat classes_token-overline__2P5H8 size-normal ad-info__location__text']/text()").get()
}
yield item
next = self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[1]/section[1]/nav/div/div/button[2]')
try:
next.click()
except:
driver.quit()
this is the code i wrote with the help of the scrapy docs and many websites/stackoverflow pages.
i give it the page of the general list to scrape, in this case https://www.subito.it/annunci-italia/vendita/usato/?o=41, it finds the first link of the page (self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[5]/div[2]/div[1]/div[3]/div[1]/a').click())and then starts getting the data i want. Once it is done it clicks on the "next page" button (next = self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[1]/section[1]/nav/div/div/button[2]')) and repeats the "get data-click next page" process.
the last item page will have an inactive "next page" button, so at the moment the crawler is stuck, i manually close the browser, edit with notepad++ the "start_urls" link to be the page after the one i've just scraped, and run the crawler again to scrape this page.
I'd like it to be fully automatic, so i can leave it do its thing for hours (i'm saving the data in a json file atm).
the "inactive" next-page button is different by the active one only by a disabled="" attribute, how do i detect that? and once detected, how do i tell the crawler to go back to list page plus 1 and do again the data scraping process?
My issue is only detecting that inactive button and make a loop that adds 1 to the list page i gave(if i start with the link "https://www.subito.it/annunci-italia/vendita/usato/?o=1" it should then go to "https://www.subito.it/annunci-italia/vendita/usato/?o=2" and do the same thing)
It's possible to iterate on pages by overwriting start_requests method. to reach the purpose you need to write a loop to request all (in this case 219xxx) pages and extract second layer pages hrefs.
def start_requests(self):
pages_count = 1 # in this method you need to hard code your pages quantity
for i in range(pages_count)
url = 'https://www.subito.it/annunci-italia/vendita/usato/?o=%s' % str(i + 1)
scrapy.Request(url, callback=self.parse)
Or in a better way slso find out how many pages are there in the first layer which is always in the last class="unselected-page" element so you can find it with response.xpath('//*[#class="unselected-page"]//text()').getall()[-1] . In this case you'll need to make requests for first layer pages in first parse method.
def start_requests(self):
base_url = 'https://www.subito.it/annunci-italia/vendita/usato'
scrapy.Request(base_url, callback=self.parse_first_layer)
def parse_first_layer(self, response):
pages_count = int(response.xpath('//*[#class="unselected-page"]//text()').getall()[-1])
for i in range(pages_count)
url = 'https://www.subito.it/annunci-italia/vendita/usato/?o=%s' % str(i + 1)
scrapy.Request(url, callback=self.parse_second_layer)
After reaching the first layer links you can iterate over 50 links in every page like before.

Selenium clicking to next page until on last page

I am trying to keep clicking to next page on this website, each time appending the table data to a csv file and then when I reach the last page, append the table data and break the while loop
Unfortunately, for some reason it keeps staying on the last page, and I have tried several different methods to catch the error
while True:
try :
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Next'))).click()
except :
print("No more pages left")
break
driver.quit()
I also tried this one:
try:
link = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,'.pagination-next a')))
driver.execute_script('arguments[0].scrollIntoView();', link)
link.click()
except:
keep_going = False
I've tried putting print statements in and it just keeps staying on the last page.
Here is the HTML for the first page/last page for the next button, I'm not sure if I could do something utilizing this:
HTML for first page:
<li role="menuitem" ng-if="::directionLinks" ng-class="{disabled: noNext()||ngDisabled}" class="pagination-next" style="">Next</li>
Next
</li>
HTML for last page:
<li role="menuitem" ng-if="::directionLinks" ng-class="{disabled: noNext()||ngDisabled}" class="pagination-next disabled" style="">Next</li>
Next
</li>
You can solve the problem as below,
Next Button Will be enabled until the Last Page and it will be disabled in the Last Page.
So, you can create two list to find the enabled button element and disabled button element. At any point of time,either enabled element list or disabled element list size will be one.So, If the element is disabled, then you can break the while loop else click on the next button.
I am not familiar with python syntax.So,You can convert the below java code and then use it.It will work for sure.
Code:
boolean hasNextPage=true;
while(hasNextPage){
List<WebElement> enabled_next_page_btn=driver.findElements(By.xpath("//li[#class='pagination-next']/a"));
List<WebElement> disabled_next_page_btn=driver.findElements(By.xpath("//li[#class='pagination-next disabled']/a"));
//If the Next button is enabled/available, then enabled_next_page_btn size will be one.
// So,you can perform the click action and then do the action
if(enabled_next_page_btn.size()>0){
enabled_next_page_btn.get(0).click();
hasNextPage=true;
}else if(disabled_next_page_btn.size()>0){
System.out.println("No more Pages Available");
break;
}
}
The next_page_btn.index(0).click() wasn't working, but checking the len of next_page_btn worked to find if it was the last page, so I was able to do this.
while True:
next_page_btn = driver.find_elements_by_xpath("//li[#class = 'pagination-next']/a")
if len(next_page_btn) < 1:
print("No more pages left")
break
else:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Next'))).click()
Thanks so much for the help!
How about using a do/while loop and just check for the class "disabled" to be included in the attributes of the next button to exit out? (Excuse the syntax. I just threw this together and haven't tried it)
string classAttribute
try :
do
{
IWebElement element = driver.findElement(By.LINK_TEXT("Next"))
classAttribute = element.GetAttribute("class")
element.click()
}
while(!classAttribute.contains("disabled"))
except :
pass
driver.quit()
xPath to the button is:
//li[#class = 'pagination-next']/a
so every time you need to load next page you can click on this element:
next_page_btn = driver.find_elements_by_xpath("//li[#class = 'pagination-next']/a")
next_page_btn.index(0).click()
Note: you should add a logic:
while True:
next_page_btn = driver.find_elements_by_xpath("//li[#class = 'pagination-next']/a")
if len(next_page_btn) < 1:
print("No more pages left")
break
else:
# do stuff

Selenium click on a next-page link not loading the next page

I'm new to selenium and webscraping and I'm trying to get information from the link: https://www.carmudi.com.ph/cars/civic/distance:50km/?sort=suggested
Here's a snippet of the code I'm using:
while max_pages > 0:
results.extend(extract_content(driver.page_source))
next_page = driver.find_element_by_xpath('//div[#class="next-page"]')
driver.execute_script('arguments[0].click();', next_page)
max_pages -= 1
When I try to print results, I always get (max_pages) of the same results from page 1. The "Next page" button is visible in the page and when I try to find elements of the same class, it only shows 1 element. When I try getting the element by the exact xpath and performing the click action on it, it doesn't work as well. I enclosed it in a try-except block but there were no errors. Why might this be?
You are making this more complicated than it needs to be. There's no point in using JS clicks here... just use the normal Selenium clicks.
while True:
# do stuff on the page
next = driver.find_element_by_css_selector("a[title='Next page']")
if next
next.click()
else
break
replace:
next_page = driver.find_element_by_xpath('//div[#class="next-page"]')
driver.execute_script('arguments[0].click();', next_page)
with:
driver.execute_script('next = document.querySelector(".next-page"); next.click();')
If you try next = document.querySelector(".next-page"); next.click(); in console you can see it works.

Selenium Python - Access next pages of search results

I have to click on each search result one by one from this url:
Search Guidelines
I first extract the total number of results from the displayed text so that I can set the upper limit for iteration
upperlimit=driver.find_element_by_id("total_results")
number = int(upperlimit.text.split(' ')[0])
The loop is then defiend as
for i in range(1,number):
However, after going through the first 10 results on the first page, list index goes out of range (probably because there are no more links to click). I need to click on "Next" to get the next 10 results, and so on till I'm done with all search results. How can I go around doing that?
Any help would be appreciated!
The problem is that the value of element with id total_results changes after the page is loaded, at first it contains 117, then changes to 44.
Instead, here is a more robust approach. It processes page by page until there is no more pages left:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
url = 'http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true#/search/?searchText=bevacizumab&mode=&staticTitle=false&SEARCHTYPE_all2=true&SEARCHTYPE_all1=&SEARCHTYPE=GUIDANCE&TOPICLVL0_all2=true&TOPICLVL0_all1=&HIDEFILTER=TOPICLVL1&HIDEFILTER=TOPICLVL2&TREATMENTS_all2=true&TREATMENTS_all1=&GUIDANCETYPE_all2=true&GUIDANCETYPE_all1=&STATUS_all2=true&STATUS_all1=&HIDEFILTER=EGAPREFERENCE&HIDEFILTER=TOPICLVL3&DATEFILTER_ALL=ALL&DATEFILTER_PREV=ALL&custom_date_from=&custom_date_to=11-06-2014&PAGINATIONURL=%2FSearch.do%3FsearchText%40%40bevacizumab%26newsearch%40%40true%26page%40%40&SORTORDER=BESTMATCH'
driver.get(url)
page_number = 1
while True:
try:
link = driver.find_element_by_link_text(str(page_number))
except NoSuchElementException:
break
link.click()
print driver.current_url
page_number += 1
Basically, the idea here is to get the next page link, until there is no such ( NoSuchElementException would be thrown). Note that it would work for any number of pages and results.
It prints:
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=1
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=2#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=3#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=4#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=5#showfilter
There is not even the need to programatically press on the Next button, if you see carrefully, the url just needs a new parameter when browsing other result pages:
url = "http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page={}#showfilter"
for i in range(1,5):
driver.get(url.format(i))
upperlimit=driver.find_element_by_id("total_results")
number = int(upperlimit.text.split(' ')[0])
if you still want to programatically press on the next button you could use:
driver.find_element_by_class_name('next').click()
But I haven't tested that.

Categories