for element in driver.find_elements_by_xpath('.//span[#data-bind = "text: $salableQuantityData.qty"]'):
elem = element.text
stock = int(elem)
if stock < 0 :
print(stock)
After this loop have to click this driver.find_element_by_xpath('.//button[#class="action-next"]').click() again continue the same loop.
Note: The web table has 5 paginations and each page has few negative values, I'm trying to get negative values from all pages.
If I understand correctly you will need a function. Neat when you need to do the same thing several times.
Just a simple function wrap, and call it every time you need it, if I understood correctly you need to click some sort of 'next page' button and continue, right?
def some_work():
for element in driver.find_elements_by_xpath('.//span[#data-bind = "text: $salableQuantityData.qty"]'):
elem = element.text
stock = int(elem)
if stock < 0 :
print(stock)
driver.find_element_by_xpath('.//button[#class="action-next"]').click()
some_work()
or just nest in for/while loops. Why not?
Try this to find all pages until neither 'QuantityData' nor 'action-next' was not found. First time seeing selenium, but their document suggests using 'NoSuchElementException'.
from selenium.common.exceptions import NoSuchElementException
while True:
try:
some_work()
except NoSuchElementException:
break
Related
I'm a beginner and have a lot to learn, so please be patient with me.
Using Python and Selenium, I'm trying to scrap table data from a website while navigating through different pages. As I navigate through different pages, the table shows the updated data, but it doesn't refresh the page, and the URL remains the same.
To get the refreshed data from the table and avoid stale element exception, I used WebDriverWait and expected_conditions (tr elements). Even with the wait, my code didn't get the refreshed data. It was getting the old data from the previous page and was giving the exception. So, I added time.sleep() after I clicked the next page button, which solved the problem.
However, I noticed my code was getting slower as I was navigating more and more pages. So, at around page 120, it gave me the stale element exception and was not able to get the refreshed data. I'm assuming it is because I'm using a for loop within a while loop that slows down the performance.
I tried implicit wait and increased time.sleep() gradually to avoid staleness exception, but nothing was working. There are 100 table rows in each page and around 3,100 pages total.
The followings are the problems:
Why do I get the stale element exception and how to avoid it
How to increase the efficiency of the code
I searched a lot and really tried to fix it on my own before I decided to write here.
I'm stuck here and don't know what to do. Please help, and thank you so much for your time.
while True:
# waits until the table elements are visible when the page is loaded
# this is a must step for Selenium to scrap data from the dynamic table when we navigate through different pages
tr = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[#id='erdashboard']/tbody/tr")))
for record in tr:
count += 1
posted_date = datetime.strptime(record.find_element(By.XPATH, './td[7]').text, "%m/%d/%Y").date()
exclusion_request_dict["ID"].append(int(record.find_element(By.XPATH, './td[1]').text))
exclusion_request_dict["Company"].append(record.find_element(By.XPATH, './td[2]').text)
exclusion_request_dict["Product"].append(record.find_element(By.XPATH, './td[3]').text)
exclusion_request_dict["HTSUSCode"].append(record.find_element(By.XPATH, './td[4]').text)
exclusion_request_dict["Status"].append(record.find_element(By.XPATH, './td[5]').text)
exclusion_request_dict["Posted Date"].append(posted_date)
next_button = driver.find_element(By.ID, "erdashboard_next")
next_button_clickable = driver.find_element(By.ID, "erdashboard_next").get_attribute("class").split(" ")
print(next_button_clickable)
print("Current Page:", page, "Total Counts:", count)
if next_button_clickable[-1] == "disabled":
break
next_button.click() # goes to the next page
time.sleep(wait + 0.01)
When you click the next page button, you can avoid the stale element exception by, for example, checking when the ID in the first row has changed. This is done in the section of the code # wait until new page is loaded (see full code below).
When scraping data from a table, you can increase the efficiency of the code with two tricks. First: loop over columns rather than over rows, because there are (almost always) more rows than columns. Second: use javascript instead of the selenium command .text, because js is way faster than .text. For example, to scrape the values in the first column, the command in selenium is
[td.text for td in driver.find_elements(By.XPATH, '//tbody/tr/td[1]')]
and it takes about 1.2 seconds on my computer, while the corresponding javascript command (see the code inside for idx in range(1,8) below) takes only about 0.008 seconds (150 times faster!). Actually, the first trick is slightly noticeable when using .text, but when using javascript is really effective: for example to scrape the whole table by rows with js it takes about 0.52 seconds, while by columns it takes about 0.05 seconds.
Here is the full code:
import math, time, pandas
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
chromedriver_path = '...'
driver = webdriver.Chrome(service=Service(chromedriver_path))
wait = WebDriverWait(driver,9)
driver.get('https://232app.azurewebsites.net/')
dropdown = wait.until(EC.element_to_be_clickable((By.NAME, 'erdashboard_length')))
target_number_of_rows = 100
Select(dropdown).select_by_value(str(target_number_of_rows))
# wait until 100 rows are loaded
current_number_of_rows = 0
while current_number_of_rows != target_number_of_rows:
current_number_of_rows = len(driver.find_elements(By.CSS_SELECTOR, 'tbody tr'))
header = [th.text for th in driver.find_elements(By.XPATH, '//tr/th[position()<last()]')]
data = {key:[] for key in header}
number_of_pages = int(driver.find_element(By.CSS_SELECTOR, '.paginate_button:last-child').text)
times = []
while 1:
start = time.time()
if len(times)>0:
current_page = int(driver.find_element(By.CLASS_NAME, "current").text)
mean = sum(times) / len(times)
eta = (number_of_pages - current_page) * mean
minutes = math.floor(eta/60)
seconds = round((eta/60 - minutes)*60)
print(f'current page {current_page} (ETA {minutes}:{seconds}) (mean per page {mean:.2f}s) ({len(data[header[0]])} rows scraped)',end='\r')
for idx in range(1,8):
data[header[idx-1]] += driver.execute_script("var result = [];" +
f"var all = document.querySelectorAll('tbody>tr>td:nth-child({idx})');" +
"for (var i=0, max=all.length; i < max; i++) {" +
" result.push(all[i].innerText);" +
"} " +
" return result;")
# check if all lists in the dictionary have the same length, if not there is a problem (column missing or not scraped properly)
lens = [len(data[h]) for h in header]
if len(set(lens)) != 1:
print('\nerror: lists in the dictionary have different lengths')
print(lens)
break
# click next page button if available
next_btn = driver.find_element(By.ID, 'erdashboard_next')
if 'disabled' not in next_btn.get_attribute('class'):
next_btn.click()
else:
print('\nno more pages to load')
break
# wait until new page is loaded
firt_row_id_old = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
firt_row_id_new = firt_row_id_old
while firt_row_id_new == firt_row_id_old:
try:
firt_row_id_new = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
except StaleElementReferenceException:
continue
times += [time.time() - start]
While the loop is running you get an output like this ("ETA" is the estimated remaining time in the format minutes:seconds) ("mean per page" is the mean time it takes to execute each loop)
current page 156 (ETA 73:58) (mean per page 1.52s) (15500 rows scraped)
Then by running pandas.DataFrame(data) you get something like this
When I run this code to get the titles and links, I get 10X results. Any idea what I am doing wrong? Is there a way to stop the scraping when we reach the last result on the page?
Thanks!
while True:
web = 'https://news.google.com/search?q=weather&hl=en-US&gl=US&ceid=US%3Aen'
driver.get(web)
time.sleep(3)
titleContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
linkContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
if (len(titleContainers) != 0):
for i in range(len(titleContainers)):
counter = counter + 1
print("Counter: " + str(counter))
titles.append(titleContainers[i].text)
links.append(linkContainers[i].get_attribute("href"))
else:
break
You put yourself in an infinite loop, with that 'while True' statement. if (len(titleContainers) != 0): condition will always evaluate to True, once they're found in page (they're 100). You're not posting your full code as well, I imagine that counter, titles and links are lists defined somewhere in your code. You may want to test for counter to be less or equal to titleContainers length.
I'm trying to fetch the web-table data using for loop. And the table has pagination up-to 42. here my code:
driver.get()
#identification and Locators
stack = driver.find_elements_by_xpath("//*[#id='container']/div/div[4]/table/tbody/tr/td[10]/div/ul/li")
quant = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[7]/div")
link = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[15]/a")
#Start a procedure
for i in driver.find_elements_by_xpath("//*[#id='container']/div/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/button[2]"):
for steck,quanty,links in zip(stack,quant,link):
stuck = steck.text
quantity = quanty.text
linkes = links.get_attribute("href")
if stuck != 'No manage stock':
word = "Default Stock: "
stock = stuck.replace(word, '')
stocks = int(stock)
quanties = int(float(quantity))
if stocks < 0:
print(stocks,quanties,linkes)
stacks = abs(stocks)
total = stacks+quanties+1
print(total)
i.click()
driver.implicitly_wait(10)
print("Next Page")
This code fetches data from the 1st page. after clicking the next page. the 2nd for-loop didn't fetch 2nd-page data from web-table.
Most likely your query driver.find_elements_by_xpath("//*[#id='container']/div/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/button[2]") only returns one element (the actual button to go to the next page) so I guess you should read the number of page and use it for an outer loop (or at least, you might have to rebind the selection on the HTML element representing the clickable button because it might change when a new page of the table is loaded) :
driver.get()
# Read the number of page and store it as an integer
nb_pages = int(driver.find_element_by_id('someId').text)
# Repeat your code (and rebind your selections, notably the one
# on the button to go to the next page) on each page of the table
for page in nb_pages:
# lines below are adapted from your code, I notably removed you first loop
stack = driver.find_elements_by_xpath("//*[#id='container']/div/div[4]/table/tbody/tr/td[10]/div/ul/li")
quant = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[7]/div")
link = driver.find_elements_by_xpath("//*[#class='admin__data-grid-wrap']/table/tbody/tr/td[15]/a")
# loop removed here (i also splited the string for readability
# (but it don't change the actual string value)
i = driver.find_elements_by_xpath(
"//*[#id='container']/div/div[2]/div[2]/div[2]"
"/div/div[2]/div/div[2]/button[2]")[0]
for steck, quanty, links in zip(stack, quant, link):
# your logic ...
# ...
# Load the next page:
i.click()
If you can't read the number of page, you may also use a while loop and exit it when you can't find a button to load the next page whit something like:
while True:
i = driver.find_elements_by_xpath(
"//*[#id='container']/div/div[2]/div[2]/div[2]"
"/div/div[2]/div/div[2]/button[2]")
if not i:
break
i = i[0]
# the rest of your logic
# ...
i.click()
This is only a guess (as we don't have a sample HTML code of the page / table structure that you are trying to use).
I want to click on an element that is copied throughout the website (it is a button), but how do I click on lets say the second button, not the first.
Here is the code of the button I want to click:
SHOP NOW
However, the issue is that sometimes it may greyed out if the item is not in stock so I don't want to click it
As a result, here is all of my code:
def mainclick(website):
while True:
time.sleep(1)
price_saved = [i.text.replace('$', "").replace(',', '') for i in driver.find_elements_by_css_selector('[itemprop=youSave]')]
print(price_saved)
for g in range(len(price_saved)):
a = g + 1
if float(price_saved[g]) > 200:
try:
driver.find_element_by_link_text("SHOP NOW")[a].click()
time.sleep(3)
try:
driver.find_element_by_id("addToCartButtonTop").click()
driver.execute_script("window.history.go(-1)")
except:
driver.execute_script("window.history.go(-1)")
except:
print("couldn't click")
pass
print(a)
driver.find_element_by_link_text("Next Page").click()
print("all pages done")
# starts time
start_time = time.time()
mainweb = "https://www.lenovo.com/us/en/outletus/laptops/c/LAPTOPS?q=%3Aprice-asc%3AfacetSys-Memory%3A16+GB%3AfacetSys-Processor%3AIntel%C2%AE+Core%E2%84%A2+i7%3AfacetSys-Processor%3AIntel%C2%AE+Core%E2%84%A2+i5%3AfacetSys-Memory%3A8+GB&uq=&text=#"
driver.get(mainweb)
mainclick(mainweb)
I tried using [a] to click on a certain one but it doesn't seem to work. Also, the href might change of the shop now button based on the product.
You can collect the elements using .find_elements*.
elements = driver.find_elements_by_link_text('insert_value_here')
elements[0].click()
The above example to click first elements.
This index [0], replace with what you want.
If you are sure that everytime you want to click on 2nd button
try using below xpath,
(//*[#class='button-called-out button-full facetedResults-cta'])[2]
If, count of buttons is not same ( may be greyed out)
try using findelements
List button=driver.findElements(By.xpath("//*[#class='button-called-out button-full facetedResults-cta']"));
button.size();
Append the button.size() to the xpath in the place of '2' dynamically, you can click on the second/first not greyed button
You can use XPath with an index a:
driver.find_element_by_xpath("(//a[.='SHOP NOW'])[{}]".format(a))
Note that the first element has index 1.
I'm writing a program that scrapes courses from my schools website and i'm trying to check what element is present after I click on the search button.
If 'id1' is present then that means the course is not available and the next course needs to be searched, else if 'id2' is present then that means the course is available so scrape it and search the next course.
This is what I have at the moment but it's not working. I've tired using webdriverwait with conditional statements but I couldn't get it to work either. So how can I solve this?
while(i < len(courseNumList)):
self.clearAndSearch(courseNumList[i], coursePrefixList[i])
if(len(self.driver.find_elements_by_id('id1')) > 0):
i = i + 1
continue
self.scrapeAndModifySearch(courses, INDEX_NAME, TYPE_NAME)
i = i + 1
scrapeAndModifySearch():
def scrapeAndModifySearch(self, courses, esindex, estypename):
self.getCourses(courses, esindex, estypename)
self.modifySearch()
getCourses():
def getCourses(self, courses, INDEX_N, TYPE_N):
course = {}
try:
element_present = EC.presence_of_element_located((By.ID, 'id2'))
WebDriverWait(self.driver, 30).until(element_present)
except TimeoutException:
print ("Loading took too much time!")
classSectionsFound = self.driver.find_element_by_xpath('id2').text
Seems you've mixed up your locators. That last line should be:
classSectionsFound = self.driver.find_element_by_id('id2').text
'id2' is not valid xpath (at least not for a HTML doc).