Following on from Removing child elements in XML using python ...
Thanks to #Tichodroma, I have this code:
If you can use lxml, try this:
import lxml.etree
tree = lxml.etree.parse("leg.xml")
for dog in tree.xpath("//Leg1:Dog",
namespaces={"Leg1": "http://what.not"}):
parent = dog.xpath("..")[0]
parent.remove(dog)
parent.text = None
tree.write("leg.out.xml")
Now leg.out.xml looks like this:
<?xml version="1.0"?>
<Leg1:MOR xmlns:Leg1="http://what.not" oCount="7">
<Leg1:Order>
<Leg1:CTemp id="FO">
<Leg1:Group bNum="001" cCount="4"/>
<Leg1:Group bNum="002" cCount="4"/>
</Leg1:CTemp>
<Leg1:CTemp id="GO">
<Leg1:Group bNum="001" cCount="4"/>
<Leg1:Group bNum="002" cCount="4"/>
</Leg1:CTemp>
</Leg1:Order>
</Leg1:MOR>
How do I modify my code to remove the Leg1: namespace prefix from all of the elements' tag names?
One possible way to remove namespace prefix from each element :
def strip_ns_prefix(tree):
#iterate through only element nodes (skip comment node, text node, etc) :
for element in tree.xpath('descendant-or-self::*'):
#if element has prefix...
if element.prefix:
#replace element name with its local name
element.tag = etree.QName(element).localname
return tree
Another version which has namespace checking in the xpath instead of using if statement :
def strip_ns_prefix(tree):
#xpath query for selecting all element nodes in namespace
query = "descendant-or-self::*[namespace-uri()!='']"
#for each element returned by the above xpath query...
for element in tree.xpath(query):
#replace element name with its local name
element.tag = etree.QName(element).localname
return tree
Related
I have this XML as string returned by a DB query as clob and converted using OutputTypeHandler method which retuns the contet of the clob element in a tuple :
This is the code that returns the tuple from clob content:
def OutputTypeHandler(cursor, name, defaultType, size, precision, scale):
if defaultType == cx_Oracle.CLOB:
return cursor.var(cx_Oracle.LONG_STRING,arraysize=cursor.arraysize)
This is the code where the XML tree is build after the tuple returned by OutputTypeHandler is converted to string :
import xml.etree.ElementTree as ET
conn.outputtypehandler = OutputTypeHandler
c = conn.cursor()
c.execute("""select Clob from Table""")
clobData = c.fetchone()
str = ''.join(clobData) #saving the new string value as str
root = ET.fromstring(str) #building the xml Tree using xml.etree.ElementTree as ET
ET.dump(root)
Resulting XML message is (replica of the XML in the DB) :
<Parent>
<Batch_Number>2000</Batch_Number>
<Total_No_Of_Batches>12312</Total_No_Of_Batches>
<requestNo>1923</requestNo>
<Parent1>
<Parent2>
<Parent3>
<lastModifiedDateTime>2022-11-11T11:07:30.000</lastModifiedDateTime>
<purpose>NeverMore</purpose>
<endDate>9999-12-31T00:00:00.000</endDate>
<createdDateTime>2019-06-06T06:32:16.000</createdDateTime>
<createdOn>2019-06-06T08:32:16.000</createdOn>
<address2>Forever street 21</address2>
<externalCode>home</externalCode>
<lastModifiedBy>user2.thisUser</lastModifiedBy>
<lastModifiedOn>2039-06-11T13:07:30.000</lastModifiedOn>
<lastModifiedBy>MG</lastModifiedBy>
<PS>1234431</PS>
</Parent3>
</Parent2>
</Parent1>
Here is where I'm trying to look into every value of every child/grandchild of the XML untill I find a specific value :
for child in root:
if(child.text == 'MG'):
print(child.text)
else:
print("Value not found")
The result is really strange, and I don't understand where its comming from :
<Parent>
<Batch_Number>2000</Batch_Number>
<Total_No_Of_Batches>12312</Total_No_Of_Batches>
<requestNo>1923</requestNo>
<Parent1>
<Parent2>
<Parent3>
<lastModifiedDateTime>2022-11-11T11:07:30.000</lastModifiedDateTime>
<purpose>NeverMore</purpose>
<endDate>9999-12-31T00:00:00.000</endDate>
<createdDateTime>2019-06-06T06:32:16.000</createdDateTime>
<createdOn>2019-06-06T08:32:16.000</createdOn>
<address2>Forever street 21</address2>
<externalCode>home</externalCode>
<lastModifiedBy>user2.thisUser</lastModifiedBy>
<lastModifiedOn>2039-06-11T13:07:30.000</lastModifiedOn>
<lastModifiedBy>MG</lastModifiedBy>
<PS>1234431</PS>
</Parent3>
</Parent2>
</Parent1>
Value not found
Value not found
Value not found
Value not found
If I only print every child find from root :
for child in root:
print(child)
The result is :
*Whole XML*
<Element 'Batch_Number' at 0x05203E10>
<Element 'Total_No_Of_Batches' at 0x05203E70>
<Element 'requestNo' at 0x05203EA0>
<Element 'Parent1' at 0x05203ED0>
I did try another aproach :
element = root.find('MG')
if not element:
print "element not found, or element has no subelements"
if element is None:
print "element not found"
The result was the same, full xml printed and no element found :
*WholeXML*
element not found, or element has no subelements
element not found
I'm not sure what I'm doing wrong, I assume that the XML tree that is built based on the string is fauly and somehow it's not being parsed tag to tag.
lastModifiedBy is embedded in Parent3, which is itself embedded in Parent2 and Parent1 - that's why you won't find a text matching MG in your approach.
If you'd like to follow on this approach, you need to define method, which recursively checks every child, if given element has children.
Please refer to: ElementTree - findall to recursively select all child elements
I'm trying to parse the following XML using Python and lxml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/bind9.xsl"?>
<isc version="1.0">
<bind>
<statistics version="2.2">
<memory>
<summary>
<TotalUse>1232952256
</TotalUse>
<InUse>835252452
</InUse>
<BlockSize>598212608
</BlockSize>
<ContextSize>52670016
</ContextSize>
<Lost>0
</Lost>
</summary>
</memory>
</statistics>
</bind>
</isc>
The goal is to extract the tag name and text of every element under bind/statistics/memory/summary in order to produce the following mapping:
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I've managed to extract the element values, but I can't figure out the xpath expression to get the element tag names.
A sample script:
from lxml import etree as et
def main():
xmlfile = "bind982.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "??????" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)[0]
value = stat.xpath(value_selector)[0]
print "{0}: {1}".format(label, value)
if __name__ == '__main__':
main()
I know I could use value = stat.tag instead of stat.xpath(), but the script must be sufficiently generic to also process other pieces of XML where the label selector is different.
What xpath selector would return an element's tag name?
Simply use XPath's name(), and remove the zero index since this returns a string and not list.
from lxml import etree as et
def main():
xmlfile = "ExtractXPathTagName.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "name()" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)
value = stat.xpath(value_selector)[0]
print("{0}: {1}".format(label, value).strip())
if __name__ == '__main__':
main()
Output
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I think you don't need XPath for the two values, the element nodes have properties tag and text so use for instance a list comprehension:
[(element.tag, element.text) for element in etree.xpath(location)]
Or if you really want to use XPath
result = [(element.xpath('name()'), element.xpath('string()')) for element in etree.xpath(location)]
You could of course also construct a list of dictionaries:
result = [{ element.tag : element.text } for element in root.xpath(location)]
or
result = [{ element.xpath('name()') : element.xpath('string()') } for element in etree.xpath(location)]
<?xml version="1.0" ?>
<school xmlns="loyo:22:2.2">
<profile>
<student xmlns="loyo:5:542">
<marks>
<mark java="java:/lo">
<ca1>200</ca1>
</mark>
</marks>
</student>
</profile>
</school>
I trying to access the ca1 text. I am using etree but I cannot access it. I'm using below code.
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath):
elements = list()
if root.findall(xpath):
for elem in root.findall(xpath):
elements.append(elem.text)
return elements
else:
raise SystemExit("Invalid xpath provided")
t = getElementsData('.//ca1')
for i in t:
print(i)
I tried in different way to access it I don't know the exact problem. Is it recording file type issue?
Your document has namespaces on nodes school and student, you need to incorporate the namespaces in your search. Since you are looking for ca1, which is under student, you will need to specify the namespace that student node has:
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath, namespaces):
elements = root.findall(xpath, namespaces)
if elements == []:
raise SystemExit("Invalid xpath provided")
return elements
namespaces = {'ns_school': 'loyo:22:2.2', 'ns_student': 'loyo:5:542'}
elements = getElementsData('.//ns_student:ca1', namespaces)
for element in elements:
print(element)
Notes
Since your namespaces have no names, I gave them such names as ns_school, ns_student, but these name can be anything (e.g. ns1, mystudent, ...)
In a more complex system, I recommend raising some other kinds of errors and let the caller decide whether or not to exit.
How about traversing like this
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('test.xml').getroot()
data = e.getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].text
print(data)
Try the following xpath
tree.xpath('//ca1//text()')[0].strip()
Example:
html = <a><b>Text</b>Text2</a>
BeautifullSoup code
[x.extract() for x in html.findAll(.//b)]
in exit we have:
html = <a>Text2</a>
Lxml code:
[bad.getparent().remove(bad) for bad in html.xpath(".//b")]
in exit we have:
html = <a></a>
because lxml think "Text2" it's a tail of <b></b>
If we need only text line from join of tags we can use:
for bad in raw.xpath(xpath_search):
bad.text = ''
But, how do that without changing text, but remove tags without tail?
While the accepted answer from phlou will work, there are easier ways to remove tags without also removing their tails.
If you want to remove a specific element, then the LXML method you are looking for is drop_tree.
From the docs:
Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
If you want to remove all instances of a specific tag, you can use the lxml.etree.strip_elements or lxml.html.etree.strip_elements with with_tail=False.
Delete all elements with the provided tag names from a tree or
subtree. This will remove the elements and their entire subtree,
including all their attributes, text content and descendants. It
will also remove the tail text of the element unless you
explicitly set the with_tail keyword argument option to False.
So, for the example in the original post:
>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
... bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
or
>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
Edit:
Please look at #Joshmakers answer https://stackoverflow.com/a/47946748/8055036, which is clearly the better one.
I did the following to safe the tail text to the previous sibling or parent.
def remove_keeping_tail(self, element):
"""Safe the tail text and then delete the element"""
self._preserve_tail_before_delete(element)
element.getparent().remove(element)
def _preserve_tail_before_delete(self, node):
if node.tail: # preserve the tail
previous = node.getprevious()
if previous is not None: # if there is a previous sibling it will get the tail
if previous.tail is None:
previous.tail = node.tail
else:
previous.tail = previous.tail + node.tail
else: # The parent get the tail as text
parent = node.getparent()
if parent.text is None:
parent.text = node.tail
else:
parent.text = parent.text + node.tail
HTH
I'm following this link to try to get values of several tags:
Parsing XML with namespace in Python via 'ElementTree'
In this link there is no problem to access to the root tag like this:
import sys
from lxml import etree as ET
doc = ET.parse('file.xml')
namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed
namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed
namespaces_dct = {'dct': 'http://purl.org/dc/terms/'}
print doc.findall('rdf:RDF', namespaces_rdf)
print doc.findall('dcat:Dataset', namespaces_dcat)
print doc.findall('dct:identifier', namespaces_dct)
OUTPUT:
[]
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x2269b98>]
[]
I get only access to dcat:Dataset, and I can't see how to access the value of rdf:about
And later access to dct:identifier
Of course, once I have accessed to this info, I need to acces to dcat:distribution info
This is my example file, generated with ckanext-dcat:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:dct="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcat="http://www.w3.org/ns/dcat#"
>
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:identifier>ec631628-2f46-4f17-a685-d62a37466c01</dct:identifier>
<dct:description>FOO-Description</dct:description>
<dct:title>FOO-title</dct:title>
<dcat:keyword>keyword1</dcat:keyword>
<dcat:keyword>keyword2</dcat:keyword>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-10-08T08:55:04.566618</dct:issued>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-06-25T11:04:10.328902</dct:modified>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dct:description>FOO-Description-1</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f/download/myxls.xls</dcat:accessURL>
<dct:format>XLS</dct:format>
</dcat:Distribution>
</dcat:distribution>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:format>XLS</dct:format>
<dct:title>FOO-title-2</dct:title>
<dct:description>FOO-Description-2</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f/download/myxls.xls</dcat:accessURL>
</dcat:Distribution>
</dcat:distribution>
</dcat:Dataset>
</rdf:RDF>
Any idea on how to access this info??
Thanks
UPDATE:
Well, I need to access rdf:about in:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
so with this code taken from:
Parse xml with lxml - extract element value
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
print '#' + attrib + '=' + node.attrib[attrib]
I get this output:
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x23d8ee0>]
#{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about=http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01
So, the question is:
How can I ask if the attribute is about to take this value, because in other files I have several tags.
UPDATE 2: Fixed how I get about value (clark notations)
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
if attrib.endswith('about'):
#do my jobs
Well, almost finished, but I have last question: I need to know when I access my
<dct:title>
to which belongs, I have:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:title>FOO-title</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:title>FOO-title-2</dct:title>
If I do something like this I get:
for node in doc.xpath('//dct:title', namespaces=namespaces):
print node.tag, node.text
{http://purl.org/dc/terms/}title FOO-title
{http://purl.org/dc/terms/}title FOO-title-1
{http://purl.org/dc/terms/}title FOO-title-2
Thanks
Use the xpath() method with namespaces named argument:
namespaces = {
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dcat': 'http://www.w3.org/ns/dcat#',
'dct': 'http://purl.org/dc/terms/'
}
print(doc.xpath('//rdf:RDF', namespaces=namespaces))
print(doc.xpath('//dcat:Dataset', namespaces=namespaces))
print(doc.xpath('//dct:identifier', namespaces=namespaces))