I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text and a ::text (don't overlook the space between a and ::text in the latter). When I run my script, I get the same exact result no matter which selector I choose.
import requests
from scrapy import Selector
res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
title = item.css(".product-name a::text").extract_first().strip()
title_ano = item.css(".product-name a ::text").extract_first().strip()
print("Name: {}\nName_ano: {}\n".format(title,title_ano))
As you can see, both title and title_ano contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.
My question: is there any substantial difference between the two, and when should I use the former and when the latter?
Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.
If you're coming from CSS, you'd probably expect to write a::text in much the same way you'd write a::first-line, a::first-letter, a::before or a::after. No surprises there.
On the other hand, standard selector syntax would suggest that a ::text matches the ::text pseudo-element of a descendant of the a element, making it equivalent to a *::text. However, .product-list-product-wrapper .product-name a doesn't have any child elements, so by rights, a ::text is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.
Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text comes from. With that in mind, let's examine how Parsel implements ::text:
>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'
So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self axis, but because text nodes are proper children of element nodes in the DOM, ::text is treated as a standalone node and converted directly to text(), which, with the descendant-or-self axis, matches any text node that is a descendant of an a element, just as a/text() matches any text node child of an a element (a child is also a descendant).
Egregiously, this happens even when you add an explicit * to the selector:
>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'
However, the use of the descendant-or-self axis means that a ::text can match all text nodes in the a element, including those in other elements nested within the a. In the following example, a ::text will match two text nodes: 'Link ' followed by 'text':
Link <span>text</span>
So while Scrapy's implementation of ::text is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.
In fact, Scrapy's other pseudo-element ::attr()1 behaves similarly. The following selectors all match the id attribute node belonging to the div element when it does not have any descendant elements:
>>> css2xpath('div::attr(id)')
'descendant-or-self::div/#id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/#id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/#id'
... but div ::attr(id) and div *::attr(id) will match all id attribute nodes within the div's descendants along with its own id attribute, such as in the following example:
<div id="parent"><p id="child"></p></div>
This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text.
Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:
>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[#href]'
This correctly translates the descendant combinator to descendant-or-self::*/* with an additional implicit child axis, ensuring that the [#href] predicate is never tested on the a element.
If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:
Use a::text if your a element contains only text, or if you're only interested in the top-level text nodes of this a element and not its nested elements.
Use a ::text if your a element contains nested elements and you want to extract all the text nodes within this a element.
While you can use a ::text if your a element contains only text, its syntax is confusing, so for the sake of consistency, use a::text instead.
1 On an interesting note, ::attr() appears in the (abandoned as of 2021) Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.
Related
If I am implementing string locators, such as:
continue_button: str = "button:has-text(\"Continue\")"
If there are multiple buttons on the same page that say continue, but are for different paths, how do I select the correct continue... is there a way to add an index to that string locator?
There is several good practices for creating locators/selectors.
Using playwright there is official documentation for each common and unique selector on how-to and what-is doing.
More information in https://playwright.dev/docs/selectors#text-selector
About your case, i would suggest always to use an parent selector for locating an element.
When there is a button, try to find its unique parent.
By id
By unique class
Something else unique.
Example:
<dv id=test>
<button id=continue-test>Continue</button>
</div>
In this case you can use the unique id of the button and not the text.
Selector css: #continue-test
But if you, don't have an unique identifier for the button you can use the parent and go down to the button.
Selector css: #test > button
Matching text using css is not possible, but with XPATH can look like this:
//button[text()="Continue"]
This selector MATCHES the text using "equals".
Using playwright:
button:has-text("Continue")
Using has-text and quotes - matches the text using equals.
If you are using another selector for example text=Continue, this will match all elements that CONTAINS the text "Continue"
All this is explained with example in the official documentation for playwright selectors.
That does not mean to not use XPATH to achieve the goals.
CSS selectors are fast but kind of restricted to work with text.
Xpath is quite slower but much more powerful to work in text/parent/child elements etc.
I would suggest always to use an parent element with unique identifier and go down to reach your actual element, which will receive the interaction.
The fact that I love Playwright is because of scenarios like this and how easily it can be handled.
If you have a string named abc and there are multiple occurrences of that string on a single page, then you can use the nth-match criteria to pick the nth element.
For eg ,
await page.locator(':nth-match(:text("abc"), 3)').click();
will select the 3rd occurrence of the word abc. Similarly, in your case, if you want to select the first or second or third, you can simply do
await page.locator(':nth-match(:text("Continue"), 1)').click();
await page.locator(':nth-match(:text("Continue"), 2)').click();
await page.locator(':nth-match(:text("Continue"), 3)').click();
Please refer to the Selectors documentation for Playwright -> Selectors
This is different than the nth-child concept as mentioned
Unlike :nth-child(), elements do not have to be siblings, they could
be anywhere on the page. In the snippet above, all three buttons match
:text("Buy") selector, and :nth-match() selects the third button.
<li>
<b>word</b>
<i>type</i>
<b>1.</b>
"translation 1"
<b>2.</b>
"translation 2"
</li>
I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.
How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4
I would like to get all children for further analysis
If you want to get all children (including descendant) text nodes from li node you can try this code
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome()
driver.get(<URL>)
li = driver.find_element('xpath', '//li')
nodes = driver.execute_script("return arguments[0].childNodes", li)
text_nodes = []
for node in nodes:
if not isinstance(node, WebElement): # Extract text from direct child text nodes
_text = node['textContent'].strip()
if _text: # Ignore all the empty text nodes
text_nodes.append(_text)
else: # Extract text from WebElements like <b>, <i>...
text_nodes.append(node.text)
print(text_nodes)
Output:
['word', 'type', '1.', '"translation 1"', '2.', '"translation 2"']
I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.
So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.
I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.
Elements *, comment(), text(), and processing-instruction() are all nodes.
To select all nodes:
.//node()
To ensure that it's only selecting * and text() you can add a predicate filter:
.//node()[self::* or self::text()]
However, the Selenium method is find_element() (and there is find_elements()) and they expect to locate elements and not text(). It seems that there isn't a more generic method to find nodes, so you may need to write some code to achieve what you want, such as JaSON answer.
In md2pptx - which uses python-pptx to turn Markdown into PowerPoint - I've implemented a few functions that manipulate the XML tree.
In a few places I need to find a child element if it exists - and create it if it doesn't.
I have a rather hacky way of searching for this element. I'd rather have a decent way.
So, could someone post me the "right" way to search for a child element's existence.
There's probably a more general version of this question - how to manipulate XML in the context of python-pptx. I could use a reference for that, too. (Yes, I can read the python-pptx code and often do - but a synopsis would help me get it right.)
Using XPath for this job is almost always the right answer.
For example, if you wanted to get all the a:fld child elements of a paragraph to implement something to do with text fields:
# --- get <a:p> XML element of paragraph ---
p = paragraph._p
# --- use XPath to get all the `<a:fld>` child elements ---
flds = p.xpath("./a:fld")
# --- do something with them ---
for fld in flds:
do_fieldy_thing(fld)
The result of an .xpath() call is a list of the zero-or-more items that matched the str XPath expression provided as its argument. If there can only be zero or one result it's common to process it like this instead:
if flds:
do_fieldy_thing(flds[0])
The complication arises when the "starting" element (p in this case) is not a defined oxml element. oxml is a layer of custom element classes added by python-pptx "on top of" the base lxml.etree._Element class for each XML element. These custom element classes provide some convenience services, in particular allowing you to specify elements using their namespace prefixes (like "a:fld" in this case).
Not all elements in python-pptx have a custom element class, only those we manipulate via the API in some way. Any element you get from a python-pptx object (like paragraph._p above) will be oxml elements, but the elements returned by the .xpath() calls very likely won't be (otherwise you would have used python-pptx to get them). Elements that are not oxml elements are plain lxml.etree._Element instances.
The .xpath() implementation on an lxml.etree._Element instance requires use of so-called "Clark names" which look something like: "{http://schemas.openxmlformats.org/drawingml/2006/main}fld" instead of "a:fld".
You can create a Clark-name from a namespace-prefixed tag name using the pptx.oxml.ns.qn() function:
>>> from pptx.oxml.ns import qn
>>> qn("a:fld")
'{http://schemas.openxmlformats.org/drawingml/2006/main}fld'
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
In the case that I want the first use of class so I don't have to guess the find_elements_by_xpath(), what are my options for this? The goal is to write less code, assuring any changes to the source I am scraping can be fixed easily. Is it possible to essentially
find_elements_by_css_selector('source[1]')
This code does not work as is though.
I am using selenium with Python and will likely be using phantomJS as the webdriver (Firefox for testing).
In CSS Selectors, square brackets select attributes, so your sample code is trying to select the 'source' type element with an attribute named 1, eg
<source 1="your_element" />
Whereas I gather you're trying to find the first in a list that looks like this:
<source>Blah</source>
<source>Rah</source>
If you just want the first matching element, you can use the singular form:
element = find_element_by_css_selector("source")
The form you were using returns a list, so you're also able to get the n-1th element to find the nth instance on the page (Lists index from 0):
element = find_elements_by_css_selector("source")[0]
Finally, if you want your CSS selectors to be completely explicit in which element they're finding, you can use the nth-of-type selector:
element = find_element_by_css_selector("source:nth-of-type(1)")
You might find some other helpful information at this blog post from Sauce Labs to help you write flexible selectors to replace your XPath.