I'm creating two for loops (one nested in the other).
My code looks like this:
try:
a = browser.find_elements_by_class_name("node")
for links in a:
links.click()
for id in range(2, 41):
my_id = "stree{}".format(id)
browser.find_element_by_id(my_id).click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/a[1]/img').click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[2]/tbody/tr/td[4]/input').click()
browser.find_element_by_xpath('/html/body/center/form/table[2]/tbody/tr/td[5]/a').click()
sleep(5)
# browser.execute_script("window.history.go(-1)")
except:
a = browser.find_elements_by_class_name("node")
for links in a:
links.click()
for id in range(2, 41):
my_id = "stree{}".format(id)
browser.find_element_by_id(my_id).click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/a[1]/img').click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[2]/tbody/tr/td[4]/input').click()
browser.find_element_by_xpath('/html/body/center/form/table[2]/tbody/tr/td[5]/a').click()
sleep(5)
# browser.execute_script("window.history.go(-1)")
What the code is doing:
It is going through two for loops and then going to a new page where it clicks on something. Then, I want the browser to go back to go through the for loops. The problem is that since the outer loop has to be executed first before the second loop could be executed, I face some issue while going back.
Two important questions:
1. Do I need to tell my browser to go back?
2. How can I execute the outercode first and then the code within?
The page looks like this:
enter image description here
The html for outer loop looks like this:
enter image description here
The html for inner loop (by clicking on this, I will go to the next page):
enter image description here
How do I improve my code? Just to clarify: I want to go through all the files.
Edit: Someone asked for more clarification. In the photo of the page (attached), do you see folder icons? I want to click on them, that opens up all the file icons. I'm choosing those files by clicking on them, then clicking on the arrow in the page to put it into some box, and then clicking on "Accepting my selection" which takes me to the next page where I click on Excel, and that downloads my file. The "for-loop" is my attempt to go through all the files in those folders. Obviously, I have given a large explanation, but the point remains about the for-loop. The "class name - node" refers to folder icons and "for-id" refers to the file icons.
At the end of the outer for loop, you could add a function that goes
back to the starting page in order to click the next link
OR
Instead of clicking the links, you could collect them and then use
the outer loop to connect to these links. In other words, collect
all links of the starting page with find_all and the make your
browser connect to each one with the outer for loop. More specifically:
First, you create a browser instance (ABrowser could be Firefox() or anything else) and connect to the starting webpage as you already do:
browser = webdriver.ABrowser()
connection=browser.get(StartingPageURL)
Then you collect all links with the desired characteristics:
a = browser.find_elements_by_class_name("node")
And now you have a, which is a list of the links URLs. Instead of clicking a link, do the job and go back to the starting page, you can make your browser connect to the link URL, do the job and then connect to the next link URL etc. with a for loop
for links in a:
connection=browser.get(link) ## browser connect to the link URL
for id in range(2, 41):
my_id = "stree{}".format(id)
browser.find_element_by_id(my_id).click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/a[1]/img').click()
browser.find_element_by_xpath('/html/body/center[2]/form/table[2]/tbody/tr/td[4]/input').click()
browser.find_element_by_xpath('/html/body/center/form/table[2]/tbody/tr/td[5]/a').click()
sleep(5)
Usually I prefer the second option
Related
I navigate to a page and then find the list of episodes. I get each episode and click on the link for the episode. But when I go back to the page that has the list of episodes the following error happens:
stale element reference: element is not attached to the page document
My code is:
navigator.get('https://anchor.fm/dashboard/episodes')
time.sleep(5)
#get list
list_episodes = navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/ul')
#get episodes in list
items = list_episodes.find_elements_by_tag_name('li')
for item in items:
item.find_element_by_tag_name('button').click()
time.sleep(10)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/div/div[2]/div/div/div/button').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/div/div[2]/div/div/div/div/div/div[1]/button[6]').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[1]/div/div/div/button').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[1]/div/div/div/div/div/div/button[1]').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[3]/div/div/a').click()
time.sleep(3)
navigator.get('https://anchor.fm/dashboard/episodes')
time.sleep(5)
By navigating to another page all collected by selenium web elements (they are actually references to a physical web elements) become no more valid since the web page is re-built when you open it again.
To make your code working you need to collect the items list again each time.
This should work:
navigator.get('https://anchor.fm/dashboard/episodes')
time.sleep(5)
#get list
list_episodes = navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/ul')
#get episodes in list
items = list_episodes.find_elements_by_tag_name('li')
for i in range(len(items)):
item = items[i]
item.find_element_by_tag_name('button').click()
time.sleep(10)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/div/div[2]/div/div/div/button').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[2]/div/div[2]/div/div/div/div/div/div[1]/button[6]').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[1]/div/div/div/button').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[1]/div/div/div/div/div/div/button[1]').click()
time.sleep(2)
navigator.find_element_by_xpath('//*[#id="app-content"]/div/div/div/div[3]/div[3]/div/div/a').click()
time.sleep(3)
navigator.get('https://anchor.fm/dashboard/episodes')
time.sleep(5)
#get the `items` list again
items = list_episodes.find_elements_by_tag_name('li')
In this case, it's helpful to think of the pages as instances of a class. They may have the same name, the same properties, the same values but they're still separate objects and you can't call object A if you have a reference to object B.
Here's what's happening to you in this case; I highlighted the interesting parts.
You navigate to a directory page
2. Server builds an instance of the page & displays it to you
You get a hold of episode objects on the page
You navigate to one of the episodes
5. Server destroys the directory page. Any objects in it you were
holding disappear with it
6. Server builds a copy of the episode page & displays it to you
You navigate back to the directory page
8. Server builds a new instance of the page & displays it to you
You try to click an element from the old instance of the page that no longer exists
You get a stale reference exception because, well - your reference is now stale
The way to fix this is to find the episode elements each time you navigate to the directory page. If you find them once and store them, they'll go bad as soon as you navigate elsewhere and their parent page poofs.
Also, a note about your Xpaths: I'd encourage you to stop using your browser's 'Copy Xpath' function, it doesn't often get good results. There are plenty of tutorials on how to write good Xpaths online that are worth reading.
I am trying to scrape the terms and definitions, using the selenium chrome driver in python, from this website here: https://quizlet.com/433328443/ap-us-history-flash-cards/. There are 533 terms...so many in fact that quizlet makes you click a See more button if you want to see all the terms. The following code successfully extracts terms and definitions (I have tested it on other quizlet sites with less terms). There are also if() statements to deal with popups and the See more button. Again, my goal is to get the terms and definitions for every single term-definition pair on the page; however, to do this, the entire page needs to be loaded in, which is the basis of my problem.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path = chrome_driver_path)
driver.get("https://quizlet.com/433328443/ap-us-history-flash-cards/")
# INCASE OF POPUP, CLICK AWAY
if len(driver.find_elements_by_xpath("//button[#class='UILink UILink--revert']")) > 0:
popup = driver.find_element_by_xpath("//button[#class='UILink UILink--revert']")
popup.click()
del popup
# SCROLL TO BOTTOM TO LOAD IN ALL TERMS, AND THEN BACK TO THE TOP
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# INCASE OF "SEE MORE" BUTTON AT BOTTOM, CLICK IT
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# list of terms
quizlet_terms = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-wordText")))
# list of definitions
quizlet_definitions = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-definitionText")))
In my code, I have tried the scrolling down trick to load in everything, but this does not work. This is because as I scroll down, while terms in my browser window are loaded, terms above and below my browser window get unloaded. Obviously, this is done for memory reasons, but I do not care about memory and I just want for all the terms to be loaded at once so I can access their contents. My code works on smaller quizlet sites (with say 100 terms), but it breaks on this site, generating the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
This stackoverflow page explains the error message: Python with Selenium "element is not attached to the page document".
From reading the aforementioned page, I have come to the conclusion that because the website is so large, as I scroll down the quizlet page, the terms I am currently looking at in my browser window are loaded, but terms that I have scrolled past and are no longer in my view are unloaded and stored in some funky way that I cannot properly access, generating the error message.
How would one go about in keeping the entirety of the page loaded-in so I can access the contents of all 533 terms? Ideally, I would like a solution that keeps everything I have scrolled past fully-loaded in, and does not unload anything. Another idea is that the whole page is loaded in from the get-go. It would also be nice if there is some memory-saving solution to this, perhaps by simply accessing just the raw html code and no fancy graphics or anything. Has anyone ever encountered this problem, and if so, how did you solve it? Thank you, any help is appreciated.
Much thanks to #Abhishek Dhoundiyal's comment. My working code:
driver.execute_script("window.scrollTo(800, 800);")
terms_in_this_set = int(sub("\D", "", (driver.find_element_by_xpath("//h4[#class='UIHeading UIHeading--assembly UIHeading--four']")).text))
chunk_size = 15000
quizlet = numpy.empty(shape = (0, 2), dtype = "str")
# done in while loop so that terms and definitions can be extracted while scrolling (while making sure there are no duplicate entries)
while len(quizlet) != terms_in_this_set:
# INCASE OF "SEE MORE" BUTTON, CLICK IT TO SEE MORE
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
# CHECK IF THERE ARE TERMS
quizlet_terms_classes = driver.find_elements_by_class_name("SetPageTerm-wordText")
quizlet_definitions_classes = driver.find_elements_by_class_name("SetPageTerm-definitionText")
if (len(quizlet_terms_classes) > 0) and (len(quizlet_definitions_classes) > 0):
# append current iteration terms and definitions to full quizlet terms and definitions
quizlet = numpy.vstack((quizlet, numpy.transpose([list(map(lambda term: remove_whitespace(term.text), quizlet_terms_classes)), list(map(lambda definition: remove_whitespace(definition.text), quizlet_definitions_classes))])))
# get unique rows
quizlet = numpy.unique(quizlet, axis = 0)
del quizlet_terms_classes, quizlet_definitions_classes
driver.execute_script(f"window.scrollBy(0, {chunk_size})")
del terms_in_this_set
I am scraping a small site wherein I loop to send_keys to a textbox then click on the search button, the page loads some results, I check for presence_of_element and finally I get text of those results.
But the issue is when the site opens it already has a few results Present on the page, so when the loop for the search starts and the search button is clicked, the page takes a few secs to load the New Results but the script continues and selenium see the presence of initial results and captures them again and the loop continues with the same result. I tried adding time.sleep but still runs into some issues. Below is the workflow and code
URL Opens
Result 0 already present on page
Change Dropdown
Loop starts>>
Text sent to searchBox >> Search button Clicked
Result 0 still present on Site
Page is loading >> But Selenium sees presence of Result0 and gets text
Loop continues to send new key and click search button
Page is still loading with Result 1>> selenium again checks presence and this continues.
self.driver.get(self.url)
self.waitForPresenceOfElement(locator=self.radius_drop_down, locatorType='id')
self.dropByType(data='100', locator=self.radius_drop_down, locatorType='id', type='value')
time.sleep(6)
for state in self.states:
self.sendKeysWhenReady(data=state, locator=self.search_box, locatorType='id')
time.sleep(1)
self.elementClick(locator=self.search_button, locatorType='xpath')
time.sleep(3)
if self.getElementList(self.storesXpath, locatorType='xpath'): # to ignore Empty states
stores = self.waitForPresenceOfAllElements(locator=self.storesXpath, locatorType='xpath')
for store in stores:
self.full_list.append(self.getText(element=store).lower())
The way you fix this is to:
Start your loop.
Find an existing search result on the page and get a reference to it.
result = driver.find_element(...)
Send the search terms and click Search.
Wait for the result reference to be stale, that tells you that the page is reloading.
wait = WebDriverWait(driver, 10)
wait.until(EC.staleness_of(result))
Wait for results to be visible and continue looping.
I've got the following use case.
I want to Loop through different games on this website:
https://sports.bwin.de/en/sports/football-4/betting/germany-17
Each game has got a detailed page to be found by this element:
grid-event-wrapper
By looping these elements, I would have to click on each one of them, scrape the data from the detailed page and get back
Something like this:
events = driver.find_elements_by_class_name('grid-event-wrapper')
for event in events:
event.click()
time.sleep(5)
# =============================================================================
# Logic for scraping detailed information
# =============================================================================
driver.back()
time.sleep(5)
The first iteration is working fine, but by the second one I throws the following exception:
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: chrome=90.0.4430.93)
I tried different things like re-initializing my events, but nothing worked.
I am sure, that there is a oppurtinity to hold the state even if I have to go back in the browser.
Thanks for your help in advance
Instead of for event in events: loop try the following:
size = len(driver.find_elements_by_class_name('grid-event-wrapper'))
for i in range(1,size+1):
xpath = (//div[#class='grid-event-wrapper'])[i]
driver.find_elements_by_xpath(xpath).click
now you do here what you want and finally get back
Clicking on the element reloads the page, thereby losing the old references.
There are two things you can do.
One is keep a global set where you store the "ID" of the game, (you can use the URL of the game (e.g. https://sports.bwin.de/en/sports/events/fsv-mainz-05-hertha-bsc-11502399 as ID or any other distinguishing characteristic).
Alternatively, you can first extract all the links. (These are first children of your grid-event-wrapper, so you can do event.find_element_by_tagname('a') and access href attribute of those. Once all links are extracted, you can load them one by one.
events = driver.find_elements_by_class_name('grid-event-wrapper')
links = []
for event in events:
link = event.find_element_by_tag_name('a').get_attribute('href')
links.append(link)
for link in links:
# Load the link
# Extraction logic
I feel the second way is a bit cleaner.
I want to click on the Next-button at https://free-proxy-list.net/. The XPATH selector is //*[#id="proxylisttable_next"]/a
I do this with the following piece of code:
element = WebDriverWait(driver, 2, poll_frequency = 0.1).until
(EC.visibility_of_element_located((By.XPATH, '//*[#id="proxylisttable_next"]/a')))
if (element.is_enabled() == True) and (element.is_displayed() == True):
element.click()
print "next button located and clicked" # printed in case of success
Subsequently, I get all the IPs from the table like this:
IPs = WebDriverWait(driver, 2, poll_frequency = 0.1).until
(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ':nth-child(n) > td:nth-child(1)')))
Although the CSS_selector is the same for all tabs, and although I get a next button located and clicked, the IPs output is the same for both tabs (i.e. it seems like the Next-button never was clicked). Additionally, there is no Exception thrown.
Therefore, there must be something fundamentally wrong with my approach.
How to click on visible & enabled buttons correctly in phantomJS using python/selenium?
For your understanding, here is the html of the page section I am referring to:
As far as I see there could be two possible causes:
The click was not registered, though this is highly unlikely. You can look at other ways to click like JavascriptExecutor's click.
(Most likely) The find elements are queried right after the click is performed and before the Page 2 results are loaded. Since elements is visible from page 1, it exits immediately with the list of elements from page 1. An ideal way of doing this would be (using psuedocode as I am not familiar with python)
a. Get the current page number
b. Get all the IPs from the current page
c. Click Next
d. Check if (Current page + 1 ) page has become active (class 'active' is added to the Number 2)
e. Get all the elements from the current page
I am the OP and for anyone coming across a similar problem - The Next element was no longer attached to the DOM following its selection, which caused StaleElementReferenceException when printing element.is_enabled() or when clicking it - a detailed solution can be found here