I have an xml, which has a lot of different nodes with the different tags, but the same attribute. Is it possible to find all these nodes?
I know, that it is possible to find all nodes by attribute, if they all have the same tag:
root.findall(".//tag[#attrib]")
but in my case they all have different tags. Something like this is not working:
root.findall(".//[#attrib]")
In XPath you can use * to reference element of any name, and you can use #* to reference attribute of any name :
root.findall(".//*[#attrib]")
side notes :
As a heads up, if you're really using lxml (not just accidentally tagged the question with lxml), I would suggest to use xpath() method instead of findall(). The former has much better XPath support. For example, when you need to find element of limited set of names, say foo and bar, you can use the following XPath expression with xpath() method :
root.xpath("//*[self::foo or self::bar][#attrib]")
The same expression above when passed to findall() will result in an error :
SyntaxError: prefix 'self' not found in prefix map
Related
Using XPATH with Python, do I really need to use get() or getall() or does the xpath string suffice
For example, is this ok?
product_links = response.xpath('//a[contains(#class,"box_product")]/#href')
or do I really need to use
product_links = response.xpath('//a[contains(#class,"box_product")]/#href').getall()
Or is it so that when using an attribute #attribute works, but to retrieve the data (text) within the html tags itself then we use get() and getall()
question: when do I need to use variant 1 /#href or variant 2 /#href').getall()?
The goal is to obtain a workable array of links
Calling response.xpath('//a[contains(#class,"box_product")]/#href') gives you only an instance of Selector (i.e. a recipe for getting the results you want) instead of the actual results.
To get the actual results, you need to call either get(), which will give you only the first match, or getall(), which will return all matches.
So for your use case, go with getall().
=====================
Example and read more # https://www.pythongasm.com/introduction-to-scrapy/
I would like to got an list of items, independently of their prefixes.
My goal is to create method (please notice me if something like this exist), who has one argument(tagname) and returns list of elements.
For example in case of argument 'item' <media:item>, <abc:item> should be part of result of this function.
It would be nice to use lxml but it can be any python DOM-based parser.
Unfortunatuly i can't assume, that xml has xmlns, that's why i need to parse for any prefix.
lxml is a good option primarily because it has full support for XPath version 1.0 via the xpath() method besides many other useful utilities. And in XPath, you can ignore element namespace by using local-name() as mentioned in the comment.
lxml also able to deal with undefined prefix by setting parameter recover=True, but now comes the catch; local-name() still return prefixed 'tagname' for element having undefined prefix. There is a hacky way to match this kind of element, by finding element which local name contains :tagname -or to be more precise, find element which local name ends with :tagname instead of contains-.
The following is a working example for demo. The demo uses two expressions combined with logical operator or; one for dealing with element having undefined prefix, and the other for element without prefix or with properly defined prefix :
from lxml import etree
xml = """<root foo="bar">
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>
</root>"""
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(xml, parser=parser)
tagname = "item"
#expression to match element undefined prefix
predicate1 = "contains(local-name(),':{0}')".format(tagname)
#expression to match element with properly defined prefix or with no prefix
predicate2 = "local-name()='{0}'".format(tagname)
elements = tree.xpath("//*[{0} or {1}]".format(predicate1, predicate2))
for e in elements:
print(etree.tostring(e))
output :
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>
I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!
If the same tag name is used in multiple places within an xml file with the nesting providing unqiueness, what is the best way to specify the particular node of interest.
from xml.dom.minidom import parse
dom = parse("inputs.xml")
data_node = dom.getElementsByTagName("outer_level_x")[0].getElementsByTagName('inner_level_y')[0].getElementsByTagName('Data')
So, is there a better way to specify the "Data" node nested under "<outer_level_x><inner_level_y>"? The specific nesting is always known and a function which recurses calling getElementsByTagName could be written; but, I suspect that I am missing something basic here.
xml.etree.ElementTree provides support for XPath syntax when calling find/findall. Thus, allowing precision when specifying desired tags/attributes.
I'm new to lxml and I'm trying to figure how to rewrite links using iterlinks().
import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
if attibute == "src":
link = link.replace('foo', 'bar')
print lxml.html.tostring(html)
However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.
Thanks in advance.
Instead of just assigning a new (string) value to the variable name link, you have to alter the element itself, in this case by setting its src attribute:
new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)
Note that - if you know which "links" you are interested in, for example, only img elements - you can also get the elements by using .findall() (or xpath or css selectors) instead of using .iterlinks().
lxml provides a rewrite_links method (or function that you pass the text to be parsed into a document) to provide a method of changing all links in a document:
.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None):
This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.
For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob#example.com" (or javascript:...).
Probably link is just a copy of the actual object. Try replacing the attribute of the element in your loop. Even element can be just a copy, but it deserves a try...
Here is working code with rewrite_links:
from lxml.html import fromstring, tostring
e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")
def my_rewriter(link):
return "http://newlink.com"
e.rewrite_links(my_rewriter)
print(tostring(e))
Output:
b'<html><body>hello</body></html>'