How to Skip a Webpage After a Period of Time Selenium - python

I am parsing a file with a ton of colleges. Selenium googles "Admissions " + college_name then clicks the first link and gets some data from each page. The issue is that the list of college names I am pulling from is very rough (technically a list of all accredited institutions in America), so some of the links are broken or get stuck in a load loop. How do I set some sort of timer that basically says
if page load time > x seconds:
go to next element in list

You could invoke WebDriverWait on the page, and if the page catches a TimeoutException then you will know it took too long to load, so you can proceed to the next one.
Given you do not know what each page HTML will look like, this is a very challenging problem.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# list of college names
names = []
for name in names:
# search for the college here
# get list of search results
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='rc']")))
search_results = driver.find_elements_by_xpath("//div[#class='rc']")
# get first result
search_result = search_results[0]
# attempt to load the page
try:
search_result.click()
except TimeoutException:
# click operation should time out if next page does not load
# pass to move on to next URL
pass
This is a very rough, general outline. As I mentioned, without knowing what the expected page title will be, or what the expected page content will look like, it's incredibly difficult to write a generic method that will successfully accomplish this. This code is meant to be just a starting point for you.

Related

Unable to click Next button using selenium as number of pages are unknown

I am new to selenium and trying to scrape:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/
I need all the details mentioned on this page an others as well.
Also, there are certain more pages containing the same information, need to scrape them as well. I try to scrape by making changes to the target URL:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/40
but the last item is changing and is not even similar to the page number. Page number 3 is having 40 at the end and page number 5:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/80
so not able to get the data through that.
Here is my code:-
def extract_url():
url = driver.find_elements(By.XPATH,"//h2[#class='resultTitle']//a")
for i in url:
dist.append(i.get_attribute("href"))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.find_element(By.XPATH,"//li[#class='btnNextPre']//a").click()
for _ in range(10):
extract_url()
working fine till page 5 but not after that. Could you please suggest how can I iterate over pages where the we don't know the number of pages and can extract data till teh last page.
You need the check the pagination link is disabled. Use infinite loop and check for pagination button is disabled.
Use WebDriverWait() and wait for visibility of the element.
Code:
driver.get("https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/")
counter=1
while(True):
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2.resultTitle >a")))
urllist=[item.get_attribute('href') for item in driver.find_elements(By.CSS_SELECTOR, "h2.resultTitle >a")]
print(urllist)
print("Page number :" +str(counter))
driver.execute_script("arguments[0].click();", driver.find_element(By.CSS_SELECTOR, "ul.pagination >li.btnNextPre>a"))
#check for pagination button disabled
if len(driver.find_elements(By.XPATH, "//li[#class='disabled']//a[text()='>']"))>0:
print("pagination not found!!!")
break
time.sleep(2) #To slowdown the loop
counter=counter+1
import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time

Selenium: wait until text NOT to be present in element

I want to extract the article titles from a webpage with a multi-page list of articles.
I get the article titles on the first page using:
titles = browser.find_elements_by_xpath(r'path')
for i in range(len(titles)):
titles_list.append(titles[i].text)
I navigate to the next page using:
next_page = browser.find_element_by_xpath(r'path')
next_page.click()
Then, I return to the first step (i.e. getting the article titles).
The problem is, using the codes above, I sometimes get the article titles of a page twice and I sometimes miss the article titles of a page.
I believe the solution is to wait until the page fully loads after the second step and before repeating the first step: I should store something unique to the first page (e.g. the first article's title) in a variable (e.g. 'first_item'), and I should wait until the corresponding element does not contain that text.
I found the answer to my question but in Java which used ExpectedConditions.not, but the following code (the EC.not() part) is not valid in Python and raise a SyntaxError:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
next_page.click()
wait = WebDriverWait(browser, 10)
wait.until(EC.not(EC.text_to_be_present_in_element((By.XPATH, r'path'), first_item)))
How can I wait until a text is not present in an element in Python?
You Can wait like this
element = WebDriverWait(driver, 6).until_not(EC.element_to_be_clickable((By.XPATH, 'xpath')))
while element == True:
try:
element.click()
except:
pass
However it is weird looking but It will wait until the element is found else loop will continue

How to fix "stale element reference: element is not attached to the page document" while _scraping_ with selenium

Good evening modern-day heroes, hope everyone's safe and sound !
What I'm hoping to achieve with this selenium script is to load up the page, click on BTC, ETH, XRP icons to filter results, then keep clicking the "show more" button until the max number of elements have been loaded --> 1138, then to obtain all the hrefs of those 1138 companies, click on each and visit their respective pages, then scrape further data points located on each internal page visited
With that said, I've tried lots of different approaches including just to print the link of each company which it worked, however, it fails to properly go/visit the extracted hrefs and says ("stale element reference: element is not attached to the page document").
Heard that explicit/implicit waits could help to fix this, but I can't seem to wrap my head around how to use it with the variable links particularly which is where the code stops to give me the error aforementioned
Have a feeling that the issue is with the while loop and how it processes the fact that I'm looping through a list of links that will be visited next. Can't emphasize how grateful I'll be if someone can guide me in the right direction !!
from selenium.webdriver import Chrome
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
webdriver = '/Users/karimnabil/projects/selenium_js/chromedriver-1'
driver = Chrome(webdriver)
url = 'https://acceptedhere.io/catalog/company/'
driver.get(url)
btc = driver.find_element_by_xpath("//ul[#role='currency-list']/li[1]/a")
btc.click()
eth = driver.find_element_by_xpath("//ul[#role='currency-list']/li[2]/a")
eth.click()
xrp = driver.find_element_by_xpath("//ul[#role='currency-list']/li[5]/a")
xrp.click()
all_categories = driver.find_element_by_xpath("//div[#class='dropdownMenu']/ul/li[1]")
all_categories.click()
time.sleep(1)
maximun_number = 1138
while True:
show_more = driver.find_element_by_xpath("//div[#class='row search-result']/div[3]/button")
elements = driver.find_elements_by_xpath("//div[#class='row desktop-results mobile-hide']/div")
if len(elements) > maximun_number:
break
show_more.click()
time.sleep(1)
for element in elements:
links = element.find_elements_by_xpath(".//div/div/div[2]/div/div/div[1]/a")
links = [url.get_attribute('href') for url in links]
time.sleep(0.5)
for link in links:
driver.get(link)
company_title = driver.find_element_by_xpath("//h3").text
print(company_title)
When you navigate through pages the elements you put in you variables (e.g. show_more ) becomes stale or stateless since you are on a different page. It may seem you need to wait for an element to load or to be clickable. Here are some examples:
https://seleniumbyexamples.github.io/waitclickable
https://seleniumbyexamples.github.io/waitvisibility

Scraping dynamic data from a form on a site

I am trying to scrape a dynamic list of options from a form on a site. The site works in a way that when you enter some data in the query box, it takes them as keywords and searches from its own database and accordingly generates results.
I am trying to extract the whole complete list by scraping using selenium.
Initially in the inspect element section, I have:
and this is what changes when we write some keywords there in the form:
for i in range(1,100):
try:
depart.append(browser.find_elements_by_class_name("accessabilityBar textIndent")[i].text)
except Exception as e:
break
print(depart)
So, here is what I get as output: [u'']
Can somebody help me out with this?
browser.find_elements_by_class_name("accessabilityBar textIndent") returns you an exception because compound class names are not permitted, but exception is catched by except block.
Try below instead:
depart = [item.text for item in browser.find_elements_by_css_selector("span.accessabilityBar.textIndent")]
If you need to wait until text generated, you might need to use
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(browser, 10).until(EC.frame_to_be_available_and_switch_to_it(driver.find_element_by_xpath('//iframe[#src="s.effectivemeasure.net/html/frame_2.3.7.html"]')))
depart = [item.text for item in WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//span[#class='accessabilityBar textIndent' and normalize-space()]")))]

cannot locate element within a web page using selenium python

I just want to write a simple log in code for one website. However, I think the log in page was written in JS. It's really hard to locate the elements with selenium.
The web page I am going to play with is:
"https://www.nike.com/snkrs/login?returnUrl=%2F"
This is how the page looks like and how the inspect element page look like:
I was trying to locate the element by following code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Firefox()
driver.get("https://www.nike.com/snkrs/login?returnUrl=%2Fthread%2Fe9680e08e7e3cd76b8832684037a58a369cad5ed")
time.sleep(5)
driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
elem =driver.find_element_by_xpath("//*[#id='ce3feab5-6156-441a-970e-23544473a623']")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
driver.close()
This code return the error that the element could not find by [#id='ce3feab5-6156-441a-970e-23544473a623'.
I tried playing with frames, it seems does not work. If I went to "view web source" page, it is full of JS code.
Is there a good way to play with such a web page with selenium?
Try changing the code :
elem =driver.find_element_by_xpath("//*[#id='ce3feab5-6156-441a-970e-23544473a623']")
to
elem =driver.find_element_by_xpath("//*[#type='email']")
My guess (and observation) is that the id changes each time you visit the page. The id looks auto-generated, and when I go to the page multiple times, the id is different each time.
You'll need to search for something that doesn't change. For example, you can search for the name attribute, which has the seemingly static value "emailAddress"
element = driver.find_element_by_name("emailAddress")
You could also use an xpath expression to search for other attributes, such as data-componentname:
element = driver.find_element_by_xpath("//input[#data-componentname='emailAddress']")
Also, instead of a hard-coded sleep, you can simply wait for the element to be visible:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("https://www.nike.com/snkrs/login")
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.NAME, "emailAddress"))
)

Categories