Getting data from html table with selenium (python): Submitting changes breaks loop - python

I want to scrape data from an HTML table for different combinations of drop down values via looping over those combinations. After a combination is chosen, the changes need to be submitted. This is, however, causing an error since it refreshes the page.
This it what I've done so far:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
browser.get('https://daten.ktbl.de/feldarbeit/entry.html')
# Selecting the constant values of some of the drop downs:
fertilizer = Select(browser.find_element_by_name("hgId"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("gId"))
fertilizer.select_by_value("193")
fertilizer = Select(browser.find_element_by_name("avId"))
fertilizer.select_by_value("383")
fertilizer = Select(browser.find_element_by_name("hofID"))
fertilizer.select_by_value("2")
# Looping over different combinations of plot size and amount of fertilizer:
size = Select(browser.find_element_by_name("flaecheID"))
for size_values in size.options:
size.select_by_value(size_values.get_attribute("value"))
time.sleep(1)
amount= Select(browser.find_element_by_name("mengeID"))
for amount_values in amount.options:
amount.select_by_value(amount_values.get_attribute("value"))
time.sleep(1)
#Refreshing the page after the two variable values are chosen:
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
time.sleep(5)
This leads to the error:selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of <option> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed.
Obviously the issue is that I did indeed refresh the document.
After submitting the changes and the page has loaded the results, I want to retrieve the them with:
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")
(Shout-out to #bink1time who answered this part of my question here).
How can I update the page without breaking the loop?
I would very much appreciate some help here!

Stale Element Reference Exception often occurs upon page refresh because of an element UUID change in the DOM.
In order to avoid it, always try to search for an element before an interaction. In your particular case, you searched for size and amount, found them and stored them in variables. But then, upon refresh, their UUID changed, so old ones that you have stored are no longer attached to the DOM. When trying to interact with them, Selenium cannot find them in the DOM and throws this exception.
I modified your code to always re-search size and amount elements before the interaction:
# Looping over different combinations of plot size and amount of fertilizer:
size = Select(browser.find_element_by_name("flaecheID"))
for i in range(len(size.options)):
# Search and save new select element
size = Select(browser.find_element_by_name("flaecheID"))
size.select_by_value(size.options[i].get_attribute("value"))
time.sleep(1)
amount = Select(browser.find_element_by_name("mengeID"))
for j in range(len(amount.options)):
# Search and save new select element
amount = Select(browser.find_element_by_name("mengeID"))
amount.select_by_value(amount.options[j].get_attribute("value"))
time.sleep(1)
#Refreshing the page after the two variable values are chosen:
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
time.sleep(5)
Try this? It worked for me. I hope it helps.

Related

StaleElementReferenceException while looping over list

I'm trying to make a webscraper for this website. The idea is that code iterates over all institutions by selecting the institution's name (3B-Wonen at first instance), closes the pop-up screen, clicks the download button, and does it all again for all items in the list.
However, after the first loop it throws the StaleElementReferenceException when selecting the second institution in the loop. From what I read about it this implies that the elements defined in the first loop are no longer accessible. I've read multiple posts but I've no idea to overcome this particular case.
Can anybody point me in the right directon? Btw, I'm using Pythons selenium and I'm quite a beginner in programming so I'm still learning. If you could point me in a general direction that would help me a lot! The code I have is te following:
#importing and setting up parameters for geckodriver/firefox
...
# webpage
driver.get("https://opendata-dashboard.cijfersoverwonen.nl/dashboard/opendata-dashboard/beleidswaarde")
WebDriverWait(driver, 30)
# Get rid of cookie notification
# driver.find_element_by_class_name("cc-compliance").click()
# Store position of download button
element_to_select = driver.find_element_by_id("utilsmenu")
action = ActionChains(driver)
WebDriverWait(driver, 30)
# Drop down menu
driver.find_element_by_id("baseGeo").click()
# Add institutions to array
corporaties=[]
corporaties = driver.find_elements_by_xpath("//button[#role='option']")
# Iteration
for i in corporaties:
i.click() # select institution
driver.find_element_by_class_name("close-button").click() # close pop-up screen
action.move_to_element(element_to_select).perform() # select download button
driver.find_element_by_id("utilsmenu").click() # click download button
driver.find_element_by_id("utils-export-spreadsheet").click() # pick export to excel
driver.find_element_by_id("baseGeo").click() # select drop down menu for next iteration
This code worked for me. But I am not doing driver.find_element_by_id("utils-export-spreadsheet").click()
from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome(executable_path="path")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://opendata-dashboard.cijfersoverwonen.nl/dashboard/opendata-dashboard/beleidswaarde")
act = ActionChains(driver)
driver.find_element_by_xpath("//a[text()='Sluiten en niet meer tonen']").click() # Close pop-up
# Get the count of options
driver.find_element_by_id("baseGeoContent").click()
cor_len = len(driver.find_elements_by_xpath("//button[contains(#class,'sel-listitem')]"))
print(cor_len)
driver.find_element_by_class_name("close-button").click()
# No need to start from 0, since 1st option is already selected. Start from downloading and then move to next items.
for i in range(1,cor_len-288): # Tried only for 5 items
act.move_to_element(driver.find_element_by_id("utilsmenu")).click().perform()
#Code to click on downloading option
print("Downloaded:{}".format(driver.find_element_by_id("baseGeoContent").get_attribute("innerText")))
driver.find_element_by_id("baseGeoContent").click()
time.sleep(3) # Takes time to load.
coritems = driver.find_elements_by_xpath("//button[contains(#class,'sel-listitem')]")
coritems[i].click()
driver.find_element_by_class_name("close-button").click()
driver.quit()
Output:
295
Downloaded:3B-Wonen
Downloaded:Acantus
Downloaded:Accolade
Downloaded:Actium
Downloaded:Almelose Woningstichting Beter Wonen
Downloaded:Alwel
Problem Explanation :
See the problem here is, that you have defined a list corporaties = driver.find_elements_by_xpath("//button[#role='option']") and then iterating this list, and clicking on first element, which may cause some redirection to a new page, or in a new tab etc.
so when Selenium try to interact with the second webelement from the same list, it has to come back to original page, and the moment it comes back, all the elements become stale in nature.
Solution :
one of the basic solution in this cases are to define the list again, so that element won't be stale. Please see the illustration below :-
Code :
corporaties=[]
corporaties = driver.find_elements_by_xpath("//button[#role='option']")
# Iteration
j = 0
for i in range(len(corporaties)):
elements = driver.find_elements_by_xpath("//button[#role='option']")
elements[j].click()
j = j + 1 # select institution
driver.find_element_by_class_name("close-button").click() # close pop-up screen
action.move_to_element(element_to_select).perform() # select download button
driver.find_element_by_id("utilsmenu").click() # click download button
driver.find_element_by_id("utils-export-spreadsheet").click() # pick export to excel
driver.find_element_by_id("baseGeo").click() # select drop down menu for next iteration
time.sleep(2)

How to load in the entirety of a website for selenium to collect data from, and keep everything loaded in?

I am trying to scrape the terms and definitions, using the selenium chrome driver in python, from this website here: https://quizlet.com/433328443/ap-us-history-flash-cards/. There are 533 terms...so many in fact that quizlet makes you click a See more button if you want to see all the terms. The following code successfully extracts terms and definitions (I have tested it on other quizlet sites with less terms). There are also if() statements to deal with popups and the See more button. Again, my goal is to get the terms and definitions for every single term-definition pair on the page; however, to do this, the entire page needs to be loaded in, which is the basis of my problem.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path = chrome_driver_path)
driver.get("https://quizlet.com/433328443/ap-us-history-flash-cards/")
# INCASE OF POPUP, CLICK AWAY
if len(driver.find_elements_by_xpath("//button[#class='UILink UILink--revert']")) > 0:
popup = driver.find_element_by_xpath("//button[#class='UILink UILink--revert']")
popup.click()
del popup
# SCROLL TO BOTTOM TO LOAD IN ALL TERMS, AND THEN BACK TO THE TOP
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# INCASE OF "SEE MORE" BUTTON AT BOTTOM, CLICK IT
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# list of terms
quizlet_terms = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-wordText")))
# list of definitions
quizlet_definitions = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-definitionText")))
In my code, I have tried the scrolling down trick to load in everything, but this does not work. This is because as I scroll down, while terms in my browser window are loaded, terms above and below my browser window get unloaded. Obviously, this is done for memory reasons, but I do not care about memory and I just want for all the terms to be loaded at once so I can access their contents. My code works on smaller quizlet sites (with say 100 terms), but it breaks on this site, generating the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
This stackoverflow page explains the error message: Python with Selenium "element is not attached to the page document".
From reading the aforementioned page, I have come to the conclusion that because the website is so large, as I scroll down the quizlet page, the terms I am currently looking at in my browser window are loaded, but terms that I have scrolled past and are no longer in my view are unloaded and stored in some funky way that I cannot properly access, generating the error message.
How would one go about in keeping the entirety of the page loaded-in so I can access the contents of all 533 terms? Ideally, I would like a solution that keeps everything I have scrolled past fully-loaded in, and does not unload anything. Another idea is that the whole page is loaded in from the get-go. It would also be nice if there is some memory-saving solution to this, perhaps by simply accessing just the raw html code and no fancy graphics or anything. Has anyone ever encountered this problem, and if so, how did you solve it? Thank you, any help is appreciated.
Much thanks to #Abhishek Dhoundiyal's comment. My working code:
driver.execute_script("window.scrollTo(800, 800);")
terms_in_this_set = int(sub("\D", "", (driver.find_element_by_xpath("//h4[#class='UIHeading UIHeading--assembly UIHeading--four']")).text))
chunk_size = 15000
quizlet = numpy.empty(shape = (0, 2), dtype = "str")
# done in while loop so that terms and definitions can be extracted while scrolling (while making sure there are no duplicate entries)
while len(quizlet) != terms_in_this_set:
# INCASE OF "SEE MORE" BUTTON, CLICK IT TO SEE MORE
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
# CHECK IF THERE ARE TERMS
quizlet_terms_classes = driver.find_elements_by_class_name("SetPageTerm-wordText")
quizlet_definitions_classes = driver.find_elements_by_class_name("SetPageTerm-definitionText")
if (len(quizlet_terms_classes) > 0) and (len(quizlet_definitions_classes) > 0):
# append current iteration terms and definitions to full quizlet terms and definitions
quizlet = numpy.vstack((quizlet, numpy.transpose([list(map(lambda term: remove_whitespace(term.text), quizlet_terms_classes)), list(map(lambda definition: remove_whitespace(definition.text), quizlet_definitions_classes))])))
# get unique rows
quizlet = numpy.unique(quizlet, axis = 0)
del quizlet_terms_classes, quizlet_definitions_classes
driver.execute_script(f"window.scrollBy(0, {chunk_size})")
del terms_in_this_set

Detecting when an element is refreshed, even if the value doesn't change

(Selenium/webscraping noob warning.)
selenium 3.141.0
chromedriver 78
MacOS 10.14.6
I'm compiling a list of URLs across a range of dates for later download. The URLs are in a table that displays information for the date selected on a nearby calendar. When the user clicks a new date on the calendar, the table is updated asynchronously with a new list of URLs or – if no files exist for that date – with a message inside a <td class="dataTables_empty"> tag.
For each date in the desired range, my code clicks the calendar, using WebDriverWait with a custom expectation to track when the first href value in the table changes (indicating the table has finished updating), and scrapes the URLs for that day. If no files are available for a given date, the code looks for the dataTables_empty tag to go away to indicate the next date's URLs have loaded.
if current_first_uri != NO_ATT_DATA:
element = WebDriverWait(browser, 10).until_not(
text_to_be_present_in_href((
By.XPATH, first_uri_in_att_xpath),
current_first_uri))
else:
element = WebDriverWait(browser, 10).until_not(
EC.presence_of_element_located((
By.CLASS_NAME, "dataTables_empty")))
This works great in all my use cases but one: if two or more consecutive days have no data, the code doesn't notice the table has refreshed, since the dataTables_empty class remains in the table (and the cell is identical in every other respect).
In the Chrome inspector, when I click from one date without data to another, the corresponding <td> flashes pink. That suggests the values are being updated, even though their values remain the same.
Questions:
Is there a mechanism in Selenium to detect that the value was refreshed, even if it hasn't changed?
If not, any creative ideas on how to determine the table has refreshed in the problem use case? I don't want to wait blindly for some arbitrary length of time.
UPDATE: The accepted answer answered the latter of the two questions, and I was able to replace my entire detection scheme using the MutationObserver.
You could use a MutationObserver:
driver.execute_script("""
new MutationObserver(() => {
window.lastRefresh = new Date()
}).observe(document.querySelector('table.my-table'), { attributes: true, childList: true, subtree: true } )
""")
And get the last time the table dom changed with:
lastRefresh = driver.execute_script("return window.lastRefresh")
I use this below method to check if element has gone stale or not. Usually expecting false.
The same may help in your case when you are expecting true.
isElementStale(driver, element) {
try:
wait = WebDriverWait(browser, 2)
element.isEnabled()
element = wait.until(EC.element_to_be_clickable(element))
if element != null:
return False
except:
print('')
return True
}
So you can pass element to this method and check if any change has occured to it like
# element = Get First element
# Make changes that causes the refresh
if (isElementStale(driver, element)):
print('Element refreshed')
else:
print('Element Not refreshed')

How to use Selenium to click through multiple elements while avoiding Stale Element Error

I'm working on making somewhat of a site map/tree (using anytree) and in order to do so, I need Selenium to find particular elements on a page (representing categories) and then systematically click through these elements, looking for new categories on each new page until we hit no more categories, ie. all leaves and the tree is populated.
I have much of this already written. My issue arises when trying to iterate through my elements list. I currently try to populate the tree depth-first, going down to the leaves and then popping back up to the original page to continue the same thing with the next element in the list. This, however, is resulting in a Stale element reference error because my page reloads. What is a workaround to this? Can I somehow open the new links in a new window so that the old page is preserved? The only fixes I have found for that exception are to neatly catch it, but that doesn't help me.
Here is my code so far (the issue lies in the for loop):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from anytree import Node, RenderTree
def findnodes(driver) :
driver.implicitly_wait(5)
try:
nodes = driver.find_elements_by_css_selector('h3.ng-binding')
except:
nodes = []
return nodes
def populateTree(driver, par) :
url = driver.current_url
pages = findnodes(driver)
if len(pages)>0 :
for page in pages:
print(page.text)
Node(page.text, parent=par)
page.click()
populateTree(driver, page.text)
driver.get(url)
driver = webdriver.Chrome()
#Get starting page
main ='http://www.example.com'
root = Node(main)
driver.get(main)
populateTree(driver, root)
for pre, fill, node in RenderTree(root):
print("%s%s" % (pre, node.name))
I haven't worked in python but have worked on java/selenium. But,I can give you the idea to overcome staleness.
Generally we will be getting the Stale Exception if the element attributes or something is changed after initiating the webelement. For example, in some cases if user tries to click on the same element on the same page but after page refresh, gets staleelement exception.
To overcome this, we can create the fresh webelement in case if the page is changed or refreshed. Below code can give you some idea.(It's in java but the concept will be same)
Example:
webElement element = driver.findElement(by.xpath("//*[#id='StackOverflow']"));
element.click();
//page is refreshed
element.click();//This will obviously throw stale exception
To overcome this, we can store the xpath in some string and use it create a fresh webelement as we go.
String xpath = "//*[#id='StackOverflow']";
driver.findElement(by.xpath(xpath)).click();
//page has been refreshed. Now create a new element and work on it
driver.findElement(by.xpath(xpath)).click(); //This works
Hope this helps you.
xpath variable is not suppose to be star, it an xpath to desired elements. Stale exception appears, because we click something in the browser. That requires to find all the elements each time you click. So in each loop we find all the elements driver.find_elements_by_xpath(xpath). We get a list of elements. But then we need only one of them. Therefore we take element at specific index represented idx which will range from 0 to the number of elements.
xpath = '*'
for idx, _ in enumerate(range(len(driver.find_elements_by_xpath(xpath)))):
element = driver.find_elements_by_xpath(xpath)[idx]
element.click()

Selenium Visible, Non Visible Elements (Drop Down)

I am trying to select all elements of a dropdown.
The site I am testing on is: http://jenner.com/people
The dropdown(checkbox list) I am trying to access is the "locations" list.
I am using Python. I am getting the following error: Message: u'Element is not currently visible and so may not be interacted with'
The code I am using is:
from selenium import webdriver
url = "http://jenner.com/people"
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
element = driver.find_element_by_xpath("//div[#class='filter offices']")
elements = element.find_elements_by_tag_name("input")
counter = 0
while counter <= len(elements) -1:
driver.get(url)
element = driver.find_element_by_xpath("//div[#class='filter offices']")
elements1 = element.find_elements_by_tag_name("input")
elements1[counter].click()
counter = counter + 1
I have tried a few variations, including clicking the initial element before clicking on the dropdown options, that didnt work. Any ideas on how to make elements visible in Selenium. I have spent the last few hours searching for an answer online. I have seen a few posts regarding moving the mouse in Selenium, but havent found a solution that works for me yet.
Thanks a lot.
As input check-boxes are not visible at initial state,they get visible after click on "filter offices" option.Also there is change in class name changes from "filter offices" to "filter offices open",if you have observed in firebug.Below code works for me but it is in Java.But you can figure out python as it contain really basic code.
driver.get("http://jenner.com/people");
driver.findElement(By.xpath("//div[#class='filter offices']/div")).click();
Thread.sleep(2000L);
WebElement element = driver.findElement(By.xpath("//div[#class='filter offices open']"));
Thread.sleep(2000L);
List <WebElement> elements = element.findElements(By.tagName("input"));
for(int i=0;i<=elements.size()-1;i++)
{
elements.get(i).click();
Thread.sleep(2000L);
elements = element.findElements(By.tagName("input"));
}
I know this is an old question, but I came across it when looking for other information. I don't know if you were doing QA on the site to see if the proper cities were showing in the drop down, or if you were actually interacting with the site to get the list of people who should be at each location. (Side note: selecting a location then un-selecting it returns 0 results if you don't reset the filter - possibly not desired behavior.)
If you were trying to get a list of users at each location on this site, I would think it easier to not use Selenium. Here is a pretty simple solution to pull the people from the first city "Chicago." Of course, you could make a list of the cities that you are supposed to look for and sub them into the "data" variable by looping through the list.
import requests
from bs4 import BeautifulSoup
url = 'http://jenner.com/people/search'
data = 'utf8=%E2%9C%93&authenticity_token=%2BayQ8%2FyDPAtNNlHRn15Fi9w9OgXS12eNe8RZ8saTLmU%3D&search_scope=full_name' \
'&search%5Bfull_name%5D=&search%5Boffices%5D%5B%5D=Chicago'
r = requests.post(url, data=data)
soup = BeautifulSoup(r.content)
people_results = soup.find_all('div', attrs={'class': 'name'})
for p in people_results:
print p.text

Categories