XPath to the children as well as "text children" - python

<li>
<b>word</b>
<i>type</i>
<b>1.</b>
"translation 1"
<b>2.</b>
"translation 2"
</li>
I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.
How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4
I would like to get all children for further analysis

If you want to get all children (including descendant) text nodes from li node you can try this code
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome()
driver.get(<URL>)
li = driver.find_element('xpath', '//li')
nodes = driver.execute_script("return arguments[0].childNodes", li)
text_nodes = []
for node in nodes:
if not isinstance(node, WebElement): # Extract text from direct child text nodes
_text = node['textContent'].strip()
if _text: # Ignore all the empty text nodes
text_nodes.append(_text)
else: # Extract text from WebElements like <b>, <i>...
text_nodes.append(node.text)
print(text_nodes)
Output:
['word', 'type', '1.', '"translation 1"', '2.', '"translation 2"']

I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.
So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.
I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.

Elements *, comment(), text(), and processing-instruction() are all nodes.
To select all nodes:
.//node()
To ensure that it's only selecting * and text() you can add a predicate filter:
.//node()[self::* or self::text()]
However, the Selenium method is find_element() (and there is find_elements()) and they expect to locate elements and not text(). It seems that there isn't a more generic method to find nodes, so you may need to write some code to achieve what you want, such as JaSON answer.

Related

Selenium: How does WebDriverWait (presence_of_all_elements_located) actually works?

I know what it does, but can't understand HOW it does, if you know what I mean.
For example, the code below will pull out all links from the page OR it will timeout if it won't find any <a> tag on the page.
driver.get('https://selenium-python.readthedocs.io/waits.html')
links = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'a')))
for link in links:
print(link.get_attribute('href'))
driver.quit()
I'm wondering HOW Selenium knows for sure that presence_of_all_elements_located((By.TAG_NAME, 'a')) detected all <a> elements and the page won't dynamically load any more links?
BTW, pardon the following question, but can you also explain why we use double brackets here EC.presence_of_all_elements_located((By.TAG_NAME, 'a'))? Is that because presence_of_all_elements_located method accepts tuple as its parameter?
Selenium doesn't know the page won't dynamically load more links. When you use this presence_of_all_elements_located class (not a method!), then so long as there is 1 matching element on the page, it will return a list of all such elements.
When you write EC.presence_of_all_elements_located((By.TAG_NAME, 'a')) you are instantiating this class with a single argument which is a tuple as you say. This tuple is called a "locator".
"How this works" is kind of complicated and the only way to really understand is to read the source code. Selenium sees the root html as a WebElement and all children elements also as WebElements. These classes are created and discarded dynamically. They are only kept around if assigned to something. When you check for the presence of all elements matching your locator, it will traverse the HTML tree by jumping from parent to children and back up to parent siblings. Waiting for the presence of something just does this on a loop until it gets a positive match (then it completes the tree traversal and returns a list) or until the wait times out.

Selenium Python - Store XPath in var and extract depther hirachy XPath from var

I sadly couldn't find any resources online for my problem. I'm trying to store elements found by XPath in a list and then loop over the XPath elements in a list to search in that object. But instead of searching in that given object, it seems that selenium is always again looking in the whole site.
Anyone with good knowledge about this? I've seen that:
// Selects nodes in the document from the current node that matches the selection no matter where they are
But I've also tried "/" and it didn't work either.
Instead of giving me the text for each div, it gives me the text from all divs.
My Code:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# I'm looking for all divs with a specific class and store them in a list
divs_found = driver.find_elements_by_xpath("//div[#class='a-fixed-right-grid-col a-col-left']")
# Here seems to be the problem as it seems like instead of "divs_found[1]" it behaves like "driver" an looking on the whole site
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
# Now I'm looking in the found href matches to store the text from it
for href in hrefs_matching_in_div:
result_text.append(href.text)
print(result_text)
You need to add . for immediate child.Try now.
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath(".//a[contains(#href, '/gp/product/')]")

How to check if matched text in web element is partially enclosed in <a> tag?

I have the following:
This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.
In the above string i have to search for "next part" not only "part" so once i find the "next part" I need to check if there is any a tag present in the matched text (sometimes there is not an tag) - how can I do that?
Additional to my main question I can't make my xpath to work to find "next part" in the elements.
I tried this:
//*[contains(text(),"next part")]
But it doesn't find anything probably because I have spaces in there - how do I overcome this?
Thank you in advance,
Let's assume this html:
<p>This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.</p>
We can select with selenium:
p = driver.find_element_by_xpath('//p[contains(.,"next part")]')
And we can determine if it's partly in an a tag with regex (Tony the Pony notwithstanding):
html = p.get_attribute('innerHTML')
partly_in_a = 'next part' in re.sub(r'</?a.*?>', '', html) and 'next part' not in html
There's no pure xpath 1.0 solution for this, and it's a mistake in general to depend on xpath for stuff like this.
You'll need to use a nested XPath selector for this.
//*[contains(text(), 'next') and a[contains(text(), 'part')]]
This will query on any element that contains text next, then also check that the element contains nested a element with text part.
To determine whether or not there actually IS a nested a tag, you will need to write a method for this that checks against two different XPaths. There is no easy way around this, other than to evaluate the elements and see what's there.
public bool DoesElementHaveNestedTag()
{
// check for presence of locator with nested tag
// if driver.findElements returns > 0, then nested tag locator exists
if (driver.findElements(By.XPath("//*[contains(text(), 'next') and a[contains(text(), 'part')]]")).Count > 0) return true
else return false
}
You can change this method to fit your needs, but the idea is the same. There is no way to know if a WebElement has a nested tag or not, unless you try to find the WebElement using two XPaths -- one that checks for the tag, and one that does not.

Difference between Scrapy selectors "a::text" and "a ::text"

I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text and a ::text (don't overlook the space between a and ::text in the latter). When I run my script, I get the same exact result no matter which selector I choose.
import requests
from scrapy import Selector
res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
title = item.css(".product-name a::text").extract_first().strip()
title_ano = item.css(".product-name a ::text").extract_first().strip()
print("Name: {}\nName_ano: {}\n".format(title,title_ano))
As you can see, both title and title_ano contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.
My question: is there any substantial difference between the two, and when should I use the former and when the latter?
Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.
If you're coming from CSS, you'd probably expect to write a::text in much the same way you'd write a::first-line, a::first-letter, a::before or a::after. No surprises there.
On the other hand, standard selector syntax would suggest that a ::text matches the ::text pseudo-element of a descendant of the a element, making it equivalent to a *::text. However, .product-list-product-wrapper .product-name a doesn't have any child elements, so by rights, a ::text is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.
Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text comes from. With that in mind, let's examine how Parsel implements ::text:
>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'
So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self axis, but because text nodes are proper children of element nodes in the DOM, ::text is treated as a standalone node and converted directly to text(), which, with the descendant-or-self axis, matches any text node that is a descendant of an a element, just as a/text() matches any text node child of an a element (a child is also a descendant).
Egregiously, this happens even when you add an explicit * to the selector:
>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'
However, the use of the descendant-or-self axis means that a ::text can match all text nodes in the a element, including those in other elements nested within the a. In the following example, a ::text will match two text nodes: 'Link ' followed by 'text':
Link <span>text</span>
So while Scrapy's implementation of ::text is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.
In fact, Scrapy's other pseudo-element ::attr()1 behaves similarly. The following selectors all match the id attribute node belonging to the div element when it does not have any descendant elements:
>>> css2xpath('div::attr(id)')
'descendant-or-self::div/#id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/#id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/#id'
... but div ::attr(id) and div *::attr(id) will match all id attribute nodes within the div's descendants along with its own id attribute, such as in the following example:
<div id="parent"><p id="child"></p></div>
This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text.
Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:
>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[#href]'
This correctly translates the descendant combinator to descendant-or-self::*/* with an additional implicit child axis, ensuring that the [#href] predicate is never tested on the a element.
If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:
Use a::text if your a element contains only text, or if you're only interested in the top-level text nodes of this a element and not its nested elements.
Use a ::text if your a element contains nested elements and you want to extract all the text nodes within this a element.
While you can use a ::text if your a element contains only text, its syntax is confusing, so for the sake of consistency, use a::text instead.
1 On an interesting note, ::attr() appears in the (abandoned as of 2021) Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.

pulling multiple values from python ElementTree with lxml and xpath

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.
I am web-scraping. The pages I am scraping have the following salient elements:
<div class='parent'>
<span class='title'>
<a>THIS IS THE TITLE</a>
</span>
<div class='copy'>
<p>THIS IS THE COPY</p>
</div>
</div>
My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')
Below is my code
## 'tree' is the ElementTree of the document I've just pulled
xpath = "//div[#class='parent']"
filtered_html = tree.xpath(xpath)
arr = []
for i in filtered_html:
title_filter = "//span[#class='author']/a/text()" # xpath for title text
copy_filter = "//div[#class='copy']/p/text()" # xpath for copy text
title = i.getroottree().xpath(title_filter)
copy = i.getroottree().xpath(copy_filter)
arr.append((title, copy))
I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.
What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.
I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?
Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.
I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.
My (untested!) code would look like this:
parent_div_xpath = "//div[#class='parent']"
title_filter = "//span[#class='title']/a"
copy_filter = "//div[#class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]
Alternately, you could skip explicit iteration entirely:
title_filter = "//div[#class='parent']/span[#class='title']/a/text()"
copy_filter = "//div[#class='parent']/div[#class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))
You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:
arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))
And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Categories