Crawler stucks at last page

Crawler stucks at last page - python

I have some crawler code with chromedriver and selenium in python which goes through different pages.
However it does not reach the last page.
For example the maximum number of pages = 9, number of rows in table = 10. It then will continue scraping page 8 and start over again with page 8 again infinite.
My looping code is looking like this:
def extract(page):
while 1:
pc = 1
print ("Extracting Page: " + str(page))
while pc <= 10:
colpageprod(pc)
browser.back()
waitSmall()
pc+=1
try:
np = browser.find_element_by_xpath('//li[#class="next"]/a').click()
waitSmall()
except:
pass
page +=1
if page < 1:
break
try:
extract(1)
Problem is that on the website the second last page has no "next" button.
How could we work with this in the code?

Related

How do I deal with "Message: stale element reference: element is not attached to the page document" in Python Selenium

I'm writing a script to scrape product names from a website, filtered by brands. Some search results may contain more than one page, and this is where the problem comes in. I'm able to scrape the first page but when the script clicks on the next page the error message selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document shows. Below is my code:
def scrape():
resultList = []
currentPage = 1
while currentPage <= 2:
titleResults = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'h4.mt-0')))
resultList.append(titleResults)
checkNextPage = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div/nav/ul/li/a[#aria-label='Next']")))
for cnp in checkNextPage:
nextPageNumber = int(cnp.get_attribute("data-page"))
currentPage += 1
driver.find_element_by_xpath("//div/nav/ul/li/a[#aria-label='Next']").click()
for result in resultList[0]:
print("Result: {}".format(result.text))
I think the error got triggered when .click() was called. I've done a lot of searching on the internet before resorting to posting this question here because either I don't understand the solutions from other articles/posts or they don't apply to my case.

Stale Element means an old element or no longer available element.
I think the error is caused by last line.
You should extract elements text before the elements become unavailable.
def scrape():
resultList = []
currentPage = 1
while currentPage <= 2:
titleResults = WebDriverWait(driver,
10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'h4.mt-0')))
// Extract elements text
results_text = [titleResults[i].text for i in range(0, len(titleResults))]
resultList.extend(results_text)
checkNextPage = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div/nav/ul/li/a[#aria-label='Next']")))
for cnp in checkNextPage:
nextPageNumber = int(cnp.get_attribute("data-page"))
currentPage += 1
driver.find_element_by_xpath("//div/nav/ul/li/a[#aria-label='Next']").click()
print("Result: {}".format(resultList))

The element is not attached - Selenium in Python

I am trying scraping data from a number of pages on a website by using selenium in python. The syntax run and scrape data successfully on the first page but, after the second page, it can't find the click button and stop scraping. I check the HTML codes of the webpage, but the element on the second page is as same as the one on the first page. I found this question related to the same issue. I think that the problem is caused by that the reference to the button is lost after the DOM is changed, but I still can't fix the issue properly. I would appreciate any suggestions or solutions. The syntax and results are included below:
browser = webdriver.Chrome(r"C:\Users\...\chromedriver.exe")
browser.get('https://fortune.com/global500/2019/walmart')
table = browser.find_element_by_css_selector('tbody')
data =[]
#Use For Loop for Index
i = 1
while True:
if i > 5:
break
try:
print("Scraping Page no. " + str(i))
i = i + 1
# Select rows in the table
for row in table.find_elements_by_css_selector('tr'):
cols = data.append([cell.text for cell in row.find_elements_by_css_selector('td')])
try:
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH,'//span[#class="singlePagination__icon--2KbZn"]')))
time.sleep(10)
finally:
browser.find_element_by_xpath('//span[#class="singlePagination__icon--2KbZn"]').click()
except Exception as e:
print(e)
break
data1 = pd.DataFrame(data, columns=['Labels','Value'])
print(data1)
browser.close()
output:
Scraping Page no. 1
Scraping Page no. 2
Message: stale element reference: element is not attached to the page document
(Session info: chrome=....)
Labels Value
0 (...) (...)
1 (...) (...)

move table = browser.find_element_by_css_selector('tbody') line into your while loop.So that you will get the latest reference to the table element as part of each loop and then you should not see any stale element issue.
while True:
table = browser.find_element_by_css_selector('tbody')
if i > 5:

Python print results which contains specific string in it

I am trying to get google search result description.
from selenium import webdriver
import re
chrome_path = r"C:\Users\xxxx\Downloads\Compressed\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.google.co.in/search?q=stackoverflow")
posts = driver.find_elements_by_class_name("st")
for post in posts:
print(post.text)
Here Im getting correct results.
But I only want to print links from description.
And want to get results from 5 google search pages.
Here I am only getting from 1 page.
I have tried using
print(post.get_attribute('href'))
but description links are not clickable so this returns None.

Try the below code:
for i in range(1, 6, 1):
print("--------------------------------------------------------------------")
print("Page "+str(i)+" Results : ")
print("--------------------------------------------------------------------")
staticLinks = driver.find_elements_by_xpath("//*[#class='st']")
for desc in staticLinks:
txt = desc.text+''
if txt.count('http://') > 0 or txt.count('https://') > 0:
for c in txt.split():
if c.startswith('http') or c.startswith('https'):
print(c)
dynamicLinks = driver.find_elements_by_xpath("//*[#class='st']//a")
for desc in dynamicLinks:
link = desc.get_attribute('href')
if link is not None:
print(link)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
nextPage = driver.find_element_by_xpath("//a[#aria-label='Page "+str(i+1)+"']");
nextPage.click();
Will try to fetch the static & dynamic links from the google's first 5 search results description.

navigating through pagination with selenium in python

I'm scraping this website using Python and Selenium. I have the code working but it currently only scrapes the first page, I would like to iterate through all the pages and scrape them all but they handle pagination in a weird way how would I go through the pages and scrape them one by one?
Pagination HTML:
<div class="pagination">
First
Prev
1
<span class="current">2</span>
3
4
Next
Last
</div>
Scraper:
import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')
url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)
def getData():
data = []
rows = driver.find_element_by_xpath('//*[#id="form1"]/table/tbody').find_elements_by_tag_name('tr')
for row in rows:
app_number = row.find_elements_by_tag_name('td')[1].text
address = row.find_elements_by_tag_name('td')[2].text
proposals = row.find_elements_by_tag_name('td')[3].text
status = row.find_elements_by_tag_name('td')[4].text
data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
all_data.extend( getData() )
driver.find_element_by_xpath('//*[#id="form1"]/div[3]/a[4]').click()
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()
if __name__ == "__main__":
main()

Before moving on to automating any scenario, always write down the manual steps you would perform to execute the scenario. Manual steps for what you want to (which I understand from the question) is -
1) Go to site - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList
2) Select first week option
3) Click search
4) Get the data from every page
5) Load the url again
6) Select second week option
7) Click search
8) Get the data from every page
.. and so on.
You are having a loop to select different weeks but inside each loop iteration for the week option, you also need to include a loop to iterate over all the pages. Since you are not doing that, your code is returning only the data from the first page.
Another problem is with how you are locaing the 'Next' button -
driver.find_element_by_xpath('//*[#id="form1"]/div[3]/a[4]').click()
You are selecting the 4th <a> element which is ofcourse not robust because in different pages, the Next button's index will be different. Instead, use this better locator -
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
Logic for creating loop which will iterate through pages -
First you will need the number of pages. I did that by locating the <a> immediately before the "Next" button. As per the screenshot below, it is clear that the text of this element will be equal to the number of pages -
-
I did that using following code -
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
Now once you have number of pages as number_of_pages, you only need to click "Next" button number_of_pages - 1 times!
Final code for your main function-
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
for j in range(number_of_pages - 1):
all_data.extend(getData())
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
time.sleep(1)
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()

Following approach is simply worked for me.
driver.find_element_by_link_text("3").click()
driver.find_element_by_link_text("4").click()
....
driver.find_element_by_link_text("Next").click()

first get the total pages in the pagination, using
ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0
for i in range(len(all_as)):
if 'Next' in all_as[i].text:
total = all_as[i-1].text
break
Now just loop through the range
for i in range(total):
ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))
keep incrementing the count and get the source code for the page and then get the data for it.
Note: Don't forget the sleep when clicking on going form one page to another

Analysing quantity and membership levels of Facebook groups over the past year

I need to analyse the number of Facebook groups created in the past year related to a topic, and their membership numbers over the same period of time.
Currently I have followed a tutorial to scrape Facebook for all groups related to that one keyword using the following code:
from selenium import webdriver
your_username = input("Please Enter Your Email/Login")
your_password = input("Please Enter Your Password")
query = input("Please enter a search query")
driver = webdriver.Chrome("C:\Python34\selenium\webdriver\chromedriver.exe")
print ("Logging in...")
driver.get("http://facebook.com")
driver.find_element_by_id("email").send_keys(your_username)
driver.find_element_by_id("pass").send_keys(your_password)
driver.find_element_by_id("loginbutton").click()
print ("Login Successful!")
driver.get("https://mobile.facebook.com/search/groups/?q=" + query)
import time
time.sleep(2) #Wait for page to load.
check = 0 #Variable to check after each pagination(Scroll Down)
last = 0 #What the last length of group_links was
time_to_sleep = 1 #Total time to sleep after each scroll down.
group_links = [] #A list to store new group links.
while check<10:
elems = driver.find_elements_by_xpath("//a[#href]") # grabs every anchor element on page each loop
for elem in elems: #Loops through each anchor element above
new_link = elem.get_attribute("href") #grabs link from anchor element
if "facebook.com/groups/" in new_link: #Checks to see if facebook group link
if new_link not in group_links: #If new link found not already in our group links add it
group_links.append(new_link)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(time_to_sleep) # Sleep here, let page scroll load
if last == len(group_links): #If the amount of group links is the same as last time, then add 1 to check
print ("Found Same Amount...")
check+=1
else:#Check out http://www.pythonhowto.com
check=0 #If not reset check back to 0
last = len(group_links) #changes last to current length of group links
print ("Total group links found => "),last
print ("Out of Loop")
filey = open("grouplinks.txt","w") #Open file
for link in group_links: #FOr each link found write it to file
filey.write(link + "\n")
filey.close()
driver.quit() #Exits selenium driver (It can sometimes hang in background)
However, this only gives me groups existing today. Is it possible to run something similar to analyse the number of groups created since, lets say 01/01/2017?
Sidenote: I have read that the Facebook Graph API is a more efficient method of carrying out tasks such as this when compared to scraping. Should I be doing this differently?
Lastly; This is for a college project, ultimately what I want to achieve is to be able to compare the number of Facebook groups related to Bitcoin, their memberships over a period of time, and compare this to the price of Bitcoin over the same period.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawler stucks at last page - python

Related

How do I deal with "Message: stale element reference: element is not attached to the page document" in Python Selenium

The element is not attached - Selenium in Python

Python print results which contains specific string in it

navigating through pagination with selenium in python

Analysing quantity and membership levels of Facebook groups over the past year

Categories

Resources