I know what it does, but can't understand HOW it does, if you know what I mean.
For example, the code below will pull out all links from the page OR it will timeout if it won't find any <a> tag on the page.
driver.get('https://selenium-python.readthedocs.io/waits.html')
links = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'a')))
for link in links:
print(link.get_attribute('href'))
driver.quit()
I'm wondering HOW Selenium knows for sure that presence_of_all_elements_located((By.TAG_NAME, 'a')) detected all <a> elements and the page won't dynamically load any more links?
BTW, pardon the following question, but can you also explain why we use double brackets here EC.presence_of_all_elements_located((By.TAG_NAME, 'a'))? Is that because presence_of_all_elements_located method accepts tuple as its parameter?
Selenium doesn't know the page won't dynamically load more links. When you use this presence_of_all_elements_located class (not a method!), then so long as there is 1 matching element on the page, it will return a list of all such elements.
When you write EC.presence_of_all_elements_located((By.TAG_NAME, 'a')) you are instantiating this class with a single argument which is a tuple as you say. This tuple is called a "locator".
"How this works" is kind of complicated and the only way to really understand is to read the source code. Selenium sees the root html as a WebElement and all children elements also as WebElements. These classes are created and discarded dynamically. They are only kept around if assigned to something. When you check for the presence of all elements matching your locator, it will traverse the HTML tree by jumping from parent to children and back up to parent siblings. Waiting for the presence of something just does this on a loop until it gets a positive match (then it completes the tree traversal and returns a list) or until the wait times out.
Related
<li>
<b>word</b>
<i>type</i>
<b>1.</b>
"translation 1"
<b>2.</b>
"translation 2"
</li>
I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.
How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4
I would like to get all children for further analysis
If you want to get all children (including descendant) text nodes from li node you can try this code
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome()
driver.get(<URL>)
li = driver.find_element('xpath', '//li')
nodes = driver.execute_script("return arguments[0].childNodes", li)
text_nodes = []
for node in nodes:
if not isinstance(node, WebElement): # Extract text from direct child text nodes
_text = node['textContent'].strip()
if _text: # Ignore all the empty text nodes
text_nodes.append(_text)
else: # Extract text from WebElements like <b>, <i>...
text_nodes.append(node.text)
print(text_nodes)
Output:
['word', 'type', '1.', '"translation 1"', '2.', '"translation 2"']
I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.
So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.
I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.
Elements *, comment(), text(), and processing-instruction() are all nodes.
To select all nodes:
.//node()
To ensure that it's only selecting * and text() you can add a predicate filter:
.//node()[self::* or self::text()]
However, the Selenium method is find_element() (and there is find_elements()) and they expect to locate elements and not text(). It seems that there isn't a more generic method to find nodes, so you may need to write some code to achieve what you want, such as JaSON answer.
I have a test which is sending thousands of api requests in order to complete the test. This is causing the test to take a very long time to complete, over 15 minutes.
What I am doing over and over again is find an element, then I find two elements within that element and then I read the text of those elements.
This causes selenium to send an api request for the element and then another to find the elements within the element, and then one final one to get the text of the element. I could skip all of these api requests if I just did one request to get the DOM tree of the first element then parse the HTML for the text of the elements. Is there a library used for this purpose.
Something along the lines of this?
Import ElementParser
elems = map(ElementParser, driver.find_elements(By.CSS_SELECTOR, 'div.row'))
for elem in elems:
name = elem.find_element(By.CSS_SELECTOR, 'div.content').text
description = elem.find_element(By.CSS_SELECTOR. 'div.description').text
It would be great if this library uses a WebElement as an argument and mimics selenium syntax as shown above. If nothing like this exists then I could create my own Library but I'd rather use something that already exists.
I'm new to stackoverflow so hopefully random advice questions like this is appropriate.
BeautifulSoup was the exact sort of library I was looking for. I was easily able to override one of my functions to use BeautifulSoup instead of selenium. This caused nearly a 100x speed increase since it would no longer need to wait for API responses.
Original function:
def find_value(self, cell):
text = self.find_element(*self.get_row_selector(cell.cell_number)).get_attribute('innerHTML')
if text:
return text
New Function:
def find_value(self, cell):
text = BeautifulSoup(self.get_attribute('innerHTML'), 'html.parser').select(self.get_row_selector(cell.cell_number)[1])[0].decode_contents()
if text:
return text
(Code not copied verbatim to make more readable. Also type(self) == WebElement)
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text
If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.
As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))
In the case that I want the first use of class so I don't have to guess the find_elements_by_xpath(), what are my options for this? The goal is to write less code, assuring any changes to the source I am scraping can be fixed easily. Is it possible to essentially
find_elements_by_css_selector('source[1]')
This code does not work as is though.
I am using selenium with Python and will likely be using phantomJS as the webdriver (Firefox for testing).
In CSS Selectors, square brackets select attributes, so your sample code is trying to select the 'source' type element with an attribute named 1, eg
<source 1="your_element" />
Whereas I gather you're trying to find the first in a list that looks like this:
<source>Blah</source>
<source>Rah</source>
If you just want the first matching element, you can use the singular form:
element = find_element_by_css_selector("source")
The form you were using returns a list, so you're also able to get the n-1th element to find the nth instance on the page (Lists index from 0):
element = find_elements_by_css_selector("source")[0]
Finally, if you want your CSS selectors to be completely explicit in which element they're finding, you can use the nth-of-type selector:
element = find_element_by_css_selector("source:nth-of-type(1)")
You might find some other helpful information at this blog post from Sauce Labs to help you write flexible selectors to replace your XPath.