Can't fetch the texts from a webpage

Can't fetch the texts from a webpage - python

I've created a script using python and selenium to get all the text available out there in the following link. The webpage has got lazyloading method active and that is why more content become visible upon each scrolling. My script can handle that too.
However, the problem is when my script makes the webpage exhaust its content by reaching the bottom, it stucks right there. Once it can breaks out of the loop, I can fetch the content. How can I break out of the loop?
I know .LoadingDots is always there. And that is the only reason I can't find any logic to break the loop.
Link to that site
Here is what I've tried so far: (couldn't get rid of the loop)
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get("https://www.quora.com/topic/American-Football")
while True:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots")))
except Exception: break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))):
print(item.text)
driver.quit()
I know I can solve the issue if I comply with the following:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get("https://www.quora.com/topic/American-Football")
last_len = len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))))
while True:
for load_more in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a[id$='_more']"))):
driver.execute_script("arguments[0].click();",load_more)
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(lambda driver: len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para")))) > last_len)
items = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para")))
last_len = len(items)
except TimeoutException: break
for item in items:
print(item.text)
driver.quit()
My question is: how can i fetch the content from that page exhausting all the scrolls using the way I tried with my first script making use of .LoadingDots?

When the page is scrolled to the button the element with classes .LoadingDots.regular remains the same, but its parent element adds new class hidden. You can check if the class was added using get_attribute function. You can also locate it directly with the class spinner_display_area
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
loading_dots = driver.find_element_by_class_name('spinner_display_area')
if 'hidden' in loading_dots.get_attribute('class'):
break;

Your script doesn't work as expected because (By.CSS_SELECTOR, ".LoadingDots") selector returns this element <div class="LoadingDots tiny"> and it is always hidden so your expectation of its invisibility always returns True and loop cannot be broken.
You need to check another element with "LoadingDots" class name: <div class="LoadingDots regular"> and the logic should be following:
Scroll page down
Wait for loading dots to appear (start loading more content)
Wait for loading dots to disappear (loading more content is done)
If after page scrolled we see no dots - break the loop
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 5)
driver.get("https://www.quora.com/topic/American-Football")
while True:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots.regular")))
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, ".LoadingDots.regular")))
except Exception: continue
else: break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ui_qtext_rendered_qtext .ui_qtext_para"))):
print(item.text)
driver.quit()
BUT! Note that I've posted this script just to point on reason why your script is not working... It's not really efficient as in case content loaded too fast (possibility is quite low, but...) script might not catch the moment when loading dots appeared and you'll not get all required content.
So #Guy solution seem to be more reliable (+1)

Related

How to solve "Move target out of bounds" Selenium error?

I'm trying to simulate clikcin on the "Load more listings" button on the "https://empireflippers.com/marketplace/" webpage untill the button no longer is. I tried the following code but it results in "Move target out bounds" error.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.action_chains import ActionChains
HOME_PAGE_URL = "https://empireflippers.com/marketplace/"
driver = webdriver.Chrome('./chromedriver.exe')
driver.get(HOME_PAGE_URL)
while True:
try:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//button[contains(text(),'Load More Listings')]")))
ActionChains(driver).move_to_element(element).click().perform()
except Exception as e:
print (e)
break
print("Complete")
time.sleep(10)
page_source = driver.page_source
driver.quit()
I'm expecting to retrieve the html code of the full web page without load more listings button.

So it seems that the button that you are trying to click is not visible on the screen. You could try this:
driver.execute_script("arguments[0].click();", driver.find_element(By.XPATH, "//button[contains(text(),'Load More Listings')]"))
To click the button.

I have no idea why, but trying to click twice works for me. [I still get the same error if I try to click twice with ActionChains, and I'm not familiar enough with ActionChains to try to fix that; my usual approach is to use .execute_script to scroll to the element with JavaScript and then just apply .click() to the element, so that's what I've done below.]
while True:
try:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//button[contains(text(),'Load More Listings')]")))
# ActionChains(driver).move_to_element(element).click().perform()
driver.execute_script('arguments[0].scrollIntoView(false);', element)
try: element.click() # for some reason, the 1st click always fails
except: element.click() # but after the 1st attempt, the 2nd click works...
except Exception as e:
print (e)
break

Selenium not going to next page in scraper

I'm writing my first real scraper and although in general it's been going well, I've hit a wall using Selenium. I can't get it to go to the next page.
Below is the head of my code. The output below this is just printing out data in terminal for now and that's all working fine. It just stops scraping at the end of page 1 and shows me my terminal prompt. It never starts on page 2. I would be so grateful if anyone could make a suggestion. I've tried selecting the button at the bottom of the page I'm trying to scrape using both the relative and full Xpath (you're seeing the full one here) but neither work. I'm trying to click the right-arrow button.
I built in my own error message to indicate whether the driver successfully found the element by Xpath or not. The error message fires when I execute my code, so I guess it's not finding the element. I just can't understand why not.
# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
try:
driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
except (TimeoutException, WebDriverException) as e:
print("A timeout or webdriver exception occurred.")
break
driver.quit()

What you can do is to set up Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) and use a relative XPath to select the next page element. All of this in a loop (its range is the number of pages you have to deal with).
XPath for the next page link :
//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a
Code could look like :
## imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
## count the number of pages you have
els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")
## loop. at the end of the loop, click on the following page
for i in range(int(els)):
***scrape what you want***
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()

You were pretty close with while True and try-catch{} logic. To go to the next page using Selenium and python you have to induce WebDriverWait for element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class, 'state-active')]//following::li[1]/a[#href]"))).click()
print("Clicked for next page")
WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(#class, 'state-active')]//following::li[1]/a[#href]")))
except (TimeoutException):
print("No more pages")
break
driver.quit()
Console Output:
Clicked for next page
No more pages

Circumventing Stale Element Exceptions in Selenium

I have read several articles on this site regarding around the StaleElementReferenceException and am aware that this error is caused by the element no longer being in the site's DOM. What I am trying to do is click the bottom links on this webpage in order to go on and see the next page's listings. I have tried a few ways around this exception being given to me, and haven't found any to work. Here is an example of the code I have tried, and what I thought it might accomplish.
driver = webdriver.Chrome(r'C:\Users\Hank\Desktop\chromedriver_win32\chromedriver.exe')
driver.get('https://steamcommunity.com/market/listings/440/Unusual%20Old%20Guadalajara')
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
action = ActionChains(driver)
page_links = wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class^=market_paging_pagelink]')))
try:
action.move_to_element(page_links[1]).click().perform()
except StaleElementReferenceException as Exception:
print("Exception received, trying again")
time.sleep(5)
page_links = wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class^=market_paging_pagelink]')))
action.move_to_element(page_links[1]).click().perform()
I was hoping that this code segment would attempt to move to the element at the bottom, click it, or return the error message, and try again, succeeding the second time. Instead, the code simply throws the error again. If my question has already been answered, please direct me to the relevant link.
Thank you!

The approach I normally go for is to click Next page until the button gets disabled/invisible.
Here's a working example based on your page. You should obviously do whatever relevant in the while loop; I chose to capture prices for the sake of example.
url="https://steamcommunity.com/market/listings/440/Unusual%20Old%20Guadalajara"
driver.get(url)
next_button=wait(driver, 10).until(EC.presence_of_element_located((By.ID,'searchResults_btn_next')))
# capture the start value from "Showing x-xx of 22 results"
#need this to check against later
ref_val=wait(driver, 10).until(EC.presence_of_element_located((By.ID,'searchResults_start'))).text
while next_button.get_attribute('class') == 'pagebtn':
next_button.click()
#wait until ref_val has changed
wait(driver, 10).until(lambda driver: wait(driver, 10).until(EC.presence_of_element_located((By.ID,'searchResults_start'))).text != ref_val)
# ====== Do whatever relevant here =============================
page_num=wait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'.market_paging_pagelink.active'))).text
print(f"Prices from page {page_num}")
prices = wait(driver, 10).until(EC.presence_of_all_elements_located(
(By.XPATH, ".//span[#class='market_listing_price market_listing_price_with_fee']")))
for price in prices:
print(price.text)
#================================================================
#get the new reference value
ref_val = wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResults_start'))).text

Unable to collect all the shop names from a webpage

I've written a script in python to parse some names from a webpage. The items available in that webpage doesn't get displayed all at a time, rather, it is necessary to scroll to the bottom to let the webpage release few more items and again few more upon another scrolling and so on until all items are visible. The problem is the items are not located in the body that is why driver.execute_script("return document.body.scrollHeight;") this command is not working (IMO). It is located in the left sided area like a sliding container. How can I reach the bottom of that container and parse the names from this webpage? I've written almost all the codes except for controlling the lazy-load. I'm attaching an image to give you an idea what did i try to mean by calling it a sliding container.
The link to that webpage: Link
This what I've tried so far:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("replace_the_above_link")
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".select_list h2 a"))):
print(item.text)
driver.quit()
This is the image of that box which contains item: Click Here
Currently my scraper is parsing items which are visible when the page is loaded.

Below code should allow you to make XHR requests by scrolling container as much time as possible and then scrape required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.weedsta.com/dispensaries/in/california")
entries_count = len(wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "select_list"))))
while True:
driver.find_element_by_class_name("tel").send_keys(Keys.END)
try:
wait.until(lambda driver: entries_count < len(driver.find_elements_by_class_name("select_list")))
except:
break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".select_list h2 a"))):
print(item.text)
driver.quit()

Selenium Python: How to wait for a page to load after a click?

I want to grab the page source of the page after I make a click. And then go back using browser.back() function. But Selenium doesn't let the page fully load after the click and the content which is generated by JavaScript isn't being included in the page source of that page.
element[i].click()
#Need to wait here until the content is fully generated by JS.
#And then grab the page source.
scoreCardHTML = browser.page_source
browser.back()

As Alan mentioned - you can wait for some element to be loaded. Below is an example code
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "element_id")))

you can also use seleniums staleness_of
from selenium.webdriver.support.expected_conditions import staleness_of
def wait_for_page_load(browser, timeout=30):
old_page = browser.find_element_by_tag_name('html')
yield
WebDriverWait(browser, timeout).until(
staleness_of(old_page)
)

You can do it using this method of a loop of try and wait, an easy to implement method
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
Button=''
while not Button:
try:
Button=browser.find_element_by_name('NAME OF ELEMENT')
Button.click()
except:continue

Assuming "pass" is an element in the current page and won't be present at the target page.
I mostly use Id of the link I am going to click on. because it is rarely present at the target page.
while True:
try:
browser.find_element_by_id("pass")
except:
break

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't fetch the texts from a webpage - python

Related

How to solve "Move target out of bounds" Selenium error?

Selenium not going to next page in scraper

Circumventing Stale Element Exceptions in Selenium

Unable to collect all the shop names from a webpage

Selenium Python: How to wait for a page to load after a click?

Categories

Resources