Getting subelements using lxml and iterparse

Getting subelements using lxml and iterparse - python

I am trying to write a parsing algorithm to efficiently pull data from an xml document. I am currently rolling through the document based on elements and children, but would like to use iterparse instead. One issue is that I have a list of elements that when found, I want to pull the child data from them, but it seems like using iterparse my options are to filter based on either one element name, or get every single element.
Example xml:
<?xml version="1.0" encoding="UTF-8"?>
<data_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<source id="0">
<name>Office Issues</name>
<datetime>2012-01-13T16:09:15</datetime>
<data_id>7</data_id>
</source>
<event id="125">
<date>2012-11-06</date>
<state_id>7</state_id>
</event>
<state id="7">
<name>Washington</name>
</state>
<locality id="2">
<name>Olympia</name>
<state_id>7</state_id>
<type>City</type>
</locality>
<locality id="3">
<name>Town</name>
<state_id>7</state_id>
<type>Town</type>
</locality>
</data_object>
Code example:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
base = False
b_name = ""
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
base = True
bname = elem.tag
children = elem.getchildren()
child_list = []
for child in children:
child_list.append(child.tag)
print bname + ":" + str(child_list)
elif event == "end" and elem.tag in ELEMENT_LIST:
base = False
root.clear()

With iterparse you cannot limit parsing to some types of tags, you may do this only with one tag (by passing argument tag). However it is easy to do manually what you would like to achieve. In the following snippet:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
print "this elem is interesting, do some processing: %s: [%s]" % (elem.tag, ", ".join(child.tag for child in elem))
elem.clear()
you limit your search to interesting tags only. Important part of iterparse is the elem.clear() which clears memory when item is obsolete. That is why it is memory efficient, see http://lxml.de/parsing.html#modifying-the-tree

I would use XPath instead. It's much more elegant than walking the document on your own and certainly more efficient I assume.

Use tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url'
Similar question with right answer https://stackoverflow.com/a/7019273/1346222
#!/usr/bin/python
# coding: utf-8
""" Parsing xml file. Basic example """
from StringIO import StringIO
from lxml import etree
import urllib2
sitemap = urllib2.urlopen(
'http://google.com/sitemap.xml',
timeout=10
).read()
NS = {
'x': 'http://www.sitemaps.org/schemas/sitemap/0.9',
'x2': 'http://www.google.com/schemas/sitemap-mobile/1.0'
}
res = []
urls = etree.iterparse(StringIO(sitemap), tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url')
for event, url in urls:
t = []
t = url.xpath('.//x:loc/text() | .//x:priority/text()', namespaces=NS)
t.append(url.xpath('boolean(.//x2:mobile)', namespaces=NS))
res.append(t)

Related

Is there a way to skip nodes/elements with iterparse lxml?

Is there a way using lxml iterparse to skip an element without checking the tag? Take this xml for example:
<root>
<sample>
<tag1>text1</tag1>
<tag2>text2</tag2>
<tag3>text3</tag3>
<tag4>text4</tag4>
</sample>
<sample>
<tag1>text1</tag1>
<tag2>text2</tag2>
<tag3>text3</tag3>
<tag4>text4</tag4>
</sample>
</root>
If I care about tag1 and tag4, checking tag2 and tag3 will eat up some time. If the file isn't big, it doesn't really matter but if I have a million <sample> nodes, I could reduce search time some if I don't have to check tag2 nd tag3. They're always there and I never need them.
using iterparse in lxml
import lxml
xmlfile = 'myfile.xml'
context = etree.iterparse(xmlfile, events('end',), tag='sample')
for event, elem in context:
for child in elem:
if child.tag == 'tag1'
my_list.append(child.text)
#HERE I'd like to advance the loop twice without checking tag2 and tag3 at all
#something like:
#next(child)
#next(child)
elif child.tag == 'tag4'
my_list.append(child.text)

If you use the tag arg in iterchildren like you do in iterparse, you can "skip" elements other than tag1 and tag4.
Example...
from lxml import etree
xmlfile = "myfile.xml"
my_list = []
for event, elem in etree.iterparse(xmlfile, tag="sample"):
for child in elem.iterchildren(tag=["tag1", "tag4"]):
if child.tag == "tag1":
my_list.append(child.text)
elif child.tag == "tag4":
my_list.append(child.text)
print(my_list)
Printed output...
['text1', 'text4', 'text1', 'text4']

XML parsing using ElementTree thrown of by lower case attribute of root node in Python [duplicate]

I have an xml file I need to open and make some changes to, one of those changes is to remove the namespace and prefix and then save to another file.
Here is the xml:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer">
<provider>some data</provider>
<language>en-GB</language>
</package>
I can make the other changes I need, but can't find out how to remove the namespace and prefix. This is the reusklt xml I need:
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
And here is my script which will open and parse the xml and save it:
metadata = '/Users/user1/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
So how would I add code in my script which will remove the namespace and prefix?

We can get the desired output document in two steps:
Remove namespace URIs from element names
Remove unused namespace declarations from the XML tree
Example code
from lxml import etree
input_xml = """
<package xmlns="http://apple.com/itunes/importer">
<provider>some data</provider>
<language>en-GB</language>
<!-- some comment -->
<?xml-some-processing-instruction ?>
</package>
"""
root = etree.fromstring(input_xml)
# Iterate through all XML elements
for elem in root.getiterator():
# Skip comments and processing instructions,
# because they do not have names
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
# Remove a namespace URI in the element's name
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
print(etree.tostring(root).decode())
Output XML
<package>
<provider>some data</provider>
<language>en-GB</language>
<!-- some comment -->
<?xml-some-processing-instruction ?>
</package>
Details explaining the code
As described in the documentation, we use lxml.etree.QName.localname to get local names of elements, that is names without namespace URIs. Then we replace the fully qualified names of the elements by their local names.
Some XML elements, such as comments and processing instructions do not have names. So, we have to skip these elements while replacing element names, otherwise a ValueError will be raised.
Finally, we use lxml.etree.cleanup_namespaces() to remove unused namespace declarations from the XML tree.
Note on namespaced XML attributes
If the XML input contains attributes with explicitly specified namespace prefixes, the example code will not remove those prefixes. To accomplish the deletion of namespace prefixes in attributes, add the following for-loop after the line elem.tag = etree.QName(elem).localname, as suggested here
for attr_name in elem.attrib:
local_attr_name = etree.QName(attr_name).localname
if attr_name != local_attr_name:
attr_value = elem.attrib[attr_name]
del elem.attrib[attr_name]
elem.attrib[local_attr_name] = attr_value
To learn more about namespaced XML attributes see this answer.

Replace tag as Uku Loskit suggests. In addition to that, use lxml.objectify.deannotate.
from lxml import etree, objectify
metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # guard for Comment tags
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
Note: Some tags like Comment return a function when accessing tag attribute. added a guard for that.

import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}' % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()
remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
Used a snippet of code from here
This method could be easily extended to delete any namespace attributes by searching for tags that begin with "xmlns"

You could also use XSLT to strip the namespaces...
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*" priority="1">
<xsl:element name="{local-name()}" namespace="">
<xsl:apply-templates select="#*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="#*">
<xsl:attribute name="{local-name()}" namespace="">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Python
from lxml import etree
tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True,
encoding="UTF-8").decode("UTF-8"))
Output
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>

you can try with lxml:
# Remove namespace prefixes
for elem in root.getiterator():
namespace_removed = elem.xpath('local-name()')

Define and call the following function, right after you parse the XML string:
from lxml import etree
def clean_xml_namespaces(root):
for element in root.getiterator():
if isinstance(element, etree._Comment):
continue
element.tag = etree.QName(element).localname
etree.cleanup_namespaces(root)
💡 Note - comment elements in the XML are ignored, as they should be
Usage:
xml_content = b'''<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<dependencies>
<dependency>
<groupId>org.easytesting</groupId>
<artifactId>fest-assert</artifactId>
<version>1.4</version>
</dependency>
<!-- this dependency is critical -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
</dependencies>
</project>
'''
root = etree.fromstring(xml_content)
clean_xml_namespaces(root)
elements = root.findall(".//dependency")
print(len(elements))
# outputs "2", as expected

So I realize this is an older answer with a highly up-voted and accepted answer, but if you are reading LARGE-FILES and find yourself in the same predicament I did; I hope this helps you out.
The issue with this approach is, in fact, the iteration. Regardless of how fast the parser is, doing anything say... a few 100k times is gonna eat your execution time. With that said, it came down to really thinking about the problem for me and understanding how namespaces work (or are "intended to work", because they are honestly not needed). Now if your xml truly uses namespaces, meaning you see tags that look like this: <xs:table>, then you'll need to tweak the approach here for your use-case. I'll include the full way of handling, as well.
DISCLAIMER : I cannot, with a good conscience, tell you to use regular expressions when parsing html/xml, go look at SergiyKolesnikov's answer as it WORKS, but I had an edge case so with that said... let's dive into some regex!
Problem: namespace stripping takes forever... and most of the time the namespaces only live inside of the very opening tag, or our "root". So in thinking about how python reads information in, and where our only problem-child is that root node, why not use that to our advantage.
Please NOTE: the file i'm using as my example comes as a raw, horrid, remarkably senseless structure of lulz with the promise of data in there somewhere.
my_file is the path to the file im using for our example, I cannot share it with you for professional reasons; and it has been cut down way in size just to get through this answer.
import os, sys, subprocess, re, io, json
from lxml import etree
# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
exists = os.path.exists(_biggest_file),
sizeof = os.path.getsize(_biggest_file),
extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
system_thinks_its_a = subprocess.check_output(
["file", "-i", _biggest_file]
).decode().split(":")[-1:][0].strip()
)
print(json.dumps(meta_stuff, indent = 2))
So for starters, decently sized, and system thinks at best it's html; the file extension is neither xml or html either...
{
"exists": true,
"sizeof": 24442371,
"extension_is_a_real_thing": false,
"system_thinks_its_a": "text/html; charset=us-ascii"
}
Approach:
In order to parse an xml file... it should at the very least be xml, so we'll need to check and add a declarations tag if one doesn't exist
If I have namespaces.. thats bad because I can't use xpaths, which is what I want to do
If my file is huge, I should only operate on the smallest imaginable parts that I need to clean before I'm ready to parse it.
Function
def speed_read(file_path):
# We're gonna be low-brow and add our own using this string. It's fine
_xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
# Even worse.. rgx for xml here we go
#
# We'll need to extract the very first node that we find in our document,
# because for our purposes thats the one we know has the namespace uri's
# ie: "attributes"
# FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
# We're going to pluck out that first node, get the tags actual name
# which means from:
# <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
# We pluck:
# actual_name
# Then we're gonna replace the entire tag with one we make from that name
# by simple string substitution
#
# -> 'starting from the beginning, capture everything between the < and the >'
_first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
# -> 'Starting from the beginning, but dont you get me the <, find anything that happens
# before the first white-space, which i don't want either man'
_first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
# open the file context
with open(file_path, "r", encoding = "utf-8") as f:
# go ahead and strip leading and trailing, cause why not... plus adds
# safety for our regex's
_raw = f.read().strip()
# Now, if the file somehow happens to magically have the xml declaration, we
# wanna go ahead and remove it as we plan to add our own. But for efficiency,
# only check the first couple of characters
if _raw.startswith('<?xml', 0, 5):
#_raw = re.sub(_xml_dec, '', _raw).strip()
_raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
# Here we grab that first node that has those meaningless namespaces
root_element = _first_node.search(_raw).group()
# here we get its name
first_tag = _first_tagname.search(root_element).group()
# Here, we rubstitute the entire element, with a new one
# that only contains the elements name
_raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
# Now we add our declaration tag in the worst way you have ever
# seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
_raw = "{}{}".format(_xml_dec, _raw)
# The bytes part here might end up being overkill.. but this has worked
# for me consistently so it stays.
return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))
# a good answer from above:
def safe_read(file_path):
root = etree.parse(file_path)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
return root
Benchmarking - Yes I know there's better ways to do this.
import pandas as pd
safe_times = []
for i in range(0,5):
s = time.time()
safe_read(_biggest_file)
safe_times.append(time.time() - s)
fast_times = []
for i in range(0,5):
s = time.time()
speed_read(_biggest_file)
fast_times.append(time.time() - s)
pd.DataFrame({"safe":safe_times, "fast":fast_times})
Results
safe
fast
2.36
0.61
2.15
0.58
2.47
0.49
2.94
0.60
2.83
0.53

The accepted solution removes namespaces in node names and not in attributes, i.e. <b:spam c:name="cheese"/> will be transformed to <spam c:name="cheese"/>.
An updated version which will give you <spam name="cheese"/>
def remove_namespaces(root):
for elem in root.getiterator():
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
localname = etree.QName(elem).localname
if elem.tag != localname:
elem.tag = etree.QName(elem).localname
for attr_name in elem.attrib:
local_attr_name = etree.QName(attr_name).localname
if attr_name != local_attr_name:
attr_value = elem.attrib[attr_name]
del elem.attrib[attr_name]
elem.attrib[local_attr_name] = attr_value
deannotate(root, cleanup_namespaces=True)

Here are two other ways of removing namespaces. The first uses the lxml.etree.QName helper while the second uses regexes. Both functions allow an optional list of namespaces to match against. If no namespace list is supplied then all namespaces are removed. Attribute keys are also cleaned.
from lxml import etree
import re
def remove_namespaces_qname(doc, namespaces=None):
for el in doc.getiterator():
# clean tag
q = etree.QName(el.tag)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
el.tag = q.localname
else:
el.tag = q.localname
# clean attributes
for a, v in el.items():
q = etree.QName(a)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
del el.attrib[a]
el.attrib[q.localname] = v
else:
del el.attrib[a]
el.attrib[q.localname] = v
return doc
def remove_namespace_re(doc, namespaces=None):
if namespaces is not None:
ns = list(map(lambda n: u'{%s}' % n, namespaces))
for el in doc.getiterator():
# clean tag
m = re.match(r'({.+})(.+)', el.tag)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
el.tag = m.group(2)
else:
el.tag = m.group(2)
# clean attributes
for a, v in el.items():
m = re.match(r'({.+})(.+)', a)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
del el.attrib[a]
el.attrib[m.group(2)] = v
else:
del el.attrib[a]
el.attrib[m.group(2)] = v
return doc

all you need to do is:
objectify.deannotate(root, cleanup_namespaces=True)
after you have get the root, by using root = tree.getroot()

Python lxml: how to fetch XML tag names with xpath selector?

I'm trying to parse the following XML using Python and lxml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/bind9.xsl"?>
<isc version="1.0">
<bind>
<statistics version="2.2">
<memory>
<summary>
<TotalUse>1232952256
</TotalUse>
<InUse>835252452
</InUse>
<BlockSize>598212608
</BlockSize>
<ContextSize>52670016
</ContextSize>
<Lost>0
</Lost>
</summary>
</memory>
</statistics>
</bind>
</isc>
The goal is to extract the tag name and text of every element under bind/statistics/memory/summary in order to produce the following mapping:
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I've managed to extract the element values, but I can't figure out the xpath expression to get the element tag names.
A sample script:
from lxml import etree as et
def main():
xmlfile = "bind982.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "??????" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)[0]
value = stat.xpath(value_selector)[0]
print "{0}: {1}".format(label, value)
if __name__ == '__main__':
main()
I know I could use value = stat.tag instead of stat.xpath(), but the script must be sufficiently generic to also process other pieces of XML where the label selector is different.
What xpath selector would return an element's tag name?

Simply use XPath's name(), and remove the zero index since this returns a string and not list.
from lxml import etree as et
def main():
xmlfile = "ExtractXPathTagName.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "name()" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)
value = stat.xpath(value_selector)[0]
print("{0}: {1}".format(label, value).strip())
if __name__ == '__main__':
main()
Output
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0

I think you don't need XPath for the two values, the element nodes have properties tag and text so use for instance a list comprehension:
[(element.tag, element.text) for element in etree.xpath(location)]
Or if you really want to use XPath
result = [(element.xpath('name()'), element.xpath('string()')) for element in etree.xpath(location)]
You could of course also construct a list of dictionaries:
result = [{ element.tag : element.text } for element in root.xpath(location)]
or
result = [{ element.xpath('name()') : element.xpath('string()') } for element in etree.xpath(location)]

How to parse the xml with xmlns attribute using python

<?xml version="1.0" ?>
<school xmlns="loyo:22:2.2">
<profile>
<student xmlns="loyo:5:542">
<marks>
<mark java="java:/lo">
<ca1>200</ca1>
</mark>
</marks>
</student>
</profile>
</school>
I trying to access the ca1 text. I am using etree but I cannot access it. I'm using below code.
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath):
elements = list()
if root.findall(xpath):
for elem in root.findall(xpath):
elements.append(elem.text)
return elements
else:
raise SystemExit("Invalid xpath provided")
t = getElementsData('.//ca1')
for i in t:
print(i)
I tried in different way to access it I don't know the exact problem. Is it recording file type issue?

Your document has namespaces on nodes school and student, you need to incorporate the namespaces in your search. Since you are looking for ca1, which is under student, you will need to specify the namespace that student node has:
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath, namespaces):
elements = root.findall(xpath, namespaces)
if elements == []:
raise SystemExit("Invalid xpath provided")
return elements
namespaces = {'ns_school': 'loyo:22:2.2', 'ns_student': 'loyo:5:542'}
elements = getElementsData('.//ns_student:ca1', namespaces)
for element in elements:
print(element)
Notes
Since your namespaces have no names, I gave them such names as ns_school, ns_student, but these name can be anything (e.g. ns1, mystudent, ...)
In a more complex system, I recommend raising some other kinds of errors and let the caller decide whether or not to exit.

How about traversing like this
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('test.xml').getroot()
data = e.getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].text
print(data)

Try the following xpath
tree.xpath('//ca1//text()')[0].strip()

Remove namespace and prefix from xml in python using lxml

I have an xml file I need to open and make some changes to, one of those changes is to remove the namespace and prefix and then save to another file.
Here is the xml:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer">
<provider>some data</provider>
<language>en-GB</language>
</package>
I can make the other changes I need, but can't find out how to remove the namespace and prefix. This is the reusklt xml I need:
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
And here is my script which will open and parse the xml and save it:
metadata = '/Users/user1/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
So how would I add code in my script which will remove the namespace and prefix?

We can get the desired output document in two steps:
Remove namespace URIs from element names
Remove unused namespace declarations from the XML tree
Example code
from lxml import etree
input_xml = """
<package xmlns="http://apple.com/itunes/importer">
<provider>some data</provider>
<language>en-GB</language>
<!-- some comment -->
<?xml-some-processing-instruction ?>
</package>
"""
root = etree.fromstring(input_xml)
# Iterate through all XML elements
for elem in root.getiterator():
# Skip comments and processing instructions,
# because they do not have names
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
# Remove a namespace URI in the element's name
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
print(etree.tostring(root).decode())
Output XML
<package>
<provider>some data</provider>
<language>en-GB</language>
<!-- some comment -->
<?xml-some-processing-instruction ?>
</package>
Details explaining the code
As described in the documentation, we use lxml.etree.QName.localname to get local names of elements, that is names without namespace URIs. Then we replace the fully qualified names of the elements by their local names.
Some XML elements, such as comments and processing instructions do not have names. So, we have to skip these elements while replacing element names, otherwise a ValueError will be raised.
Finally, we use lxml.etree.cleanup_namespaces() to remove unused namespace declarations from the XML tree.
Note on namespaced XML attributes
If the XML input contains attributes with explicitly specified namespace prefixes, the example code will not remove those prefixes. To accomplish the deletion of namespace prefixes in attributes, add the following for-loop after the line elem.tag = etree.QName(elem).localname, as suggested here
for attr_name in elem.attrib:
local_attr_name = etree.QName(attr_name).localname
if attr_name != local_attr_name:
attr_value = elem.attrib[attr_name]
del elem.attrib[attr_name]
elem.attrib[local_attr_name] = attr_value
To learn more about namespaced XML attributes see this answer.

Replace tag as Uku Loskit suggests. In addition to that, use lxml.objectify.deannotate.
from lxml import etree, objectify
metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # guard for Comment tags
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
Note: Some tags like Comment return a function when accessing tag attribute. added a guard for that.

import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}' % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()
remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
Used a snippet of code from here
This method could be easily extended to delete any namespace attributes by searching for tags that begin with "xmlns"

You could also use XSLT to strip the namespaces...
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*" priority="1">
<xsl:element name="{local-name()}" namespace="">
<xsl:apply-templates select="#*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="#*">
<xsl:attribute name="{local-name()}" namespace="">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Python
from lxml import etree
tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True,
encoding="UTF-8").decode("UTF-8"))
Output
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>

you can try with lxml:
# Remove namespace prefixes
for elem in root.getiterator():
namespace_removed = elem.xpath('local-name()')

Define and call the following function, right after you parse the XML string:
from lxml import etree
def clean_xml_namespaces(root):
for element in root.getiterator():
if isinstance(element, etree._Comment):
continue
element.tag = etree.QName(element).localname
etree.cleanup_namespaces(root)
💡 Note - comment elements in the XML are ignored, as they should be
Usage:
xml_content = b'''<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<dependencies>
<dependency>
<groupId>org.easytesting</groupId>
<artifactId>fest-assert</artifactId>
<version>1.4</version>
</dependency>
<!-- this dependency is critical -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
</dependencies>
</project>
'''
root = etree.fromstring(xml_content)
clean_xml_namespaces(root)
elements = root.findall(".//dependency")
print(len(elements))
# outputs "2", as expected

So I realize this is an older answer with a highly up-voted and accepted answer, but if you are reading LARGE-FILES and find yourself in the same predicament I did; I hope this helps you out.
The issue with this approach is, in fact, the iteration. Regardless of how fast the parser is, doing anything say... a few 100k times is gonna eat your execution time. With that said, it came down to really thinking about the problem for me and understanding how namespaces work (or are "intended to work", because they are honestly not needed). Now if your xml truly uses namespaces, meaning you see tags that look like this: <xs:table>, then you'll need to tweak the approach here for your use-case. I'll include the full way of handling, as well.
DISCLAIMER : I cannot, with a good conscience, tell you to use regular expressions when parsing html/xml, go look at SergiyKolesnikov's answer as it WORKS, but I had an edge case so with that said... let's dive into some regex!
Problem: namespace stripping takes forever... and most of the time the namespaces only live inside of the very opening tag, or our "root". So in thinking about how python reads information in, and where our only problem-child is that root node, why not use that to our advantage.
Please NOTE: the file i'm using as my example comes as a raw, horrid, remarkably senseless structure of lulz with the promise of data in there somewhere.
my_file is the path to the file im using for our example, I cannot share it with you for professional reasons; and it has been cut down way in size just to get through this answer.
import os, sys, subprocess, re, io, json
from lxml import etree
# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
exists = os.path.exists(_biggest_file),
sizeof = os.path.getsize(_biggest_file),
extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
system_thinks_its_a = subprocess.check_output(
["file", "-i", _biggest_file]
).decode().split(":")[-1:][0].strip()
)
print(json.dumps(meta_stuff, indent = 2))
So for starters, decently sized, and system thinks at best it's html; the file extension is neither xml or html either...
{
"exists": true,
"sizeof": 24442371,
"extension_is_a_real_thing": false,
"system_thinks_its_a": "text/html; charset=us-ascii"
}
Approach:
In order to parse an xml file... it should at the very least be xml, so we'll need to check and add a declarations tag if one doesn't exist
If I have namespaces.. thats bad because I can't use xpaths, which is what I want to do
If my file is huge, I should only operate on the smallest imaginable parts that I need to clean before I'm ready to parse it.
Function
def speed_read(file_path):
# We're gonna be low-brow and add our own using this string. It's fine
_xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
# Even worse.. rgx for xml here we go
#
# We'll need to extract the very first node that we find in our document,
# because for our purposes thats the one we know has the namespace uri's
# ie: "attributes"
# FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
# We're going to pluck out that first node, get the tags actual name
# which means from:
# <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
# We pluck:
# actual_name
# Then we're gonna replace the entire tag with one we make from that name
# by simple string substitution
#
# -> 'starting from the beginning, capture everything between the < and the >'
_first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
# -> 'Starting from the beginning, but dont you get me the <, find anything that happens
# before the first white-space, which i don't want either man'
_first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
# open the file context
with open(file_path, "r", encoding = "utf-8") as f:
# go ahead and strip leading and trailing, cause why not... plus adds
# safety for our regex's
_raw = f.read().strip()
# Now, if the file somehow happens to magically have the xml declaration, we
# wanna go ahead and remove it as we plan to add our own. But for efficiency,
# only check the first couple of characters
if _raw.startswith('<?xml', 0, 5):
#_raw = re.sub(_xml_dec, '', _raw).strip()
_raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
# Here we grab that first node that has those meaningless namespaces
root_element = _first_node.search(_raw).group()
# here we get its name
first_tag = _first_tagname.search(root_element).group()
# Here, we rubstitute the entire element, with a new one
# that only contains the elements name
_raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
# Now we add our declaration tag in the worst way you have ever
# seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
_raw = "{}{}".format(_xml_dec, _raw)
# The bytes part here might end up being overkill.. but this has worked
# for me consistently so it stays.
return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))
# a good answer from above:
def safe_read(file_path):
root = etree.parse(file_path)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
return root
Benchmarking - Yes I know there's better ways to do this.
import pandas as pd
safe_times = []
for i in range(0,5):
s = time.time()
safe_read(_biggest_file)
safe_times.append(time.time() - s)
fast_times = []
for i in range(0,5):
s = time.time()
speed_read(_biggest_file)
fast_times.append(time.time() - s)
pd.DataFrame({"safe":safe_times, "fast":fast_times})
Results
safe
fast
2.36
0.61
2.15
0.58
2.47
0.49
2.94
0.60
2.83
0.53

The accepted solution removes namespaces in node names and not in attributes, i.e. <b:spam c:name="cheese"/> will be transformed to <spam c:name="cheese"/>.
An updated version which will give you <spam name="cheese"/>
def remove_namespaces(root):
for elem in root.getiterator():
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
localname = etree.QName(elem).localname
if elem.tag != localname:
elem.tag = etree.QName(elem).localname
for attr_name in elem.attrib:
local_attr_name = etree.QName(attr_name).localname
if attr_name != local_attr_name:
attr_value = elem.attrib[attr_name]
del elem.attrib[attr_name]
elem.attrib[local_attr_name] = attr_value
deannotate(root, cleanup_namespaces=True)

Here are two other ways of removing namespaces. The first uses the lxml.etree.QName helper while the second uses regexes. Both functions allow an optional list of namespaces to match against. If no namespace list is supplied then all namespaces are removed. Attribute keys are also cleaned.
from lxml import etree
import re
def remove_namespaces_qname(doc, namespaces=None):
for el in doc.getiterator():
# clean tag
q = etree.QName(el.tag)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
el.tag = q.localname
else:
el.tag = q.localname
# clean attributes
for a, v in el.items():
q = etree.QName(a)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
del el.attrib[a]
el.attrib[q.localname] = v
else:
del el.attrib[a]
el.attrib[q.localname] = v
return doc
def remove_namespace_re(doc, namespaces=None):
if namespaces is not None:
ns = list(map(lambda n: u'{%s}' % n, namespaces))
for el in doc.getiterator():
# clean tag
m = re.match(r'({.+})(.+)', el.tag)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
el.tag = m.group(2)
else:
el.tag = m.group(2)
# clean attributes
for a, v in el.items():
m = re.match(r'({.+})(.+)', a)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
del el.attrib[a]
el.attrib[m.group(2)] = v
else:
del el.attrib[a]
el.attrib[m.group(2)] = v
return doc

all you need to do is:
objectify.deannotate(root, cleanup_namespaces=True)
after you have get the root, by using root = tree.getroot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting subelements using lxml and iterparse - python

I would use XPath instead. It's much more elegant than walking the document on your own and certainly more efficient I assume.

Related

Is there a way to skip nodes/elements with iterparse lxml?

XML parsing using ElementTree thrown of by lower case attribute of root node in Python [duplicate]

Python lxml: how to fetch XML tag names with xpath selector?

How to parse the xml with xmlns attribute using python

Remove namespace and prefix from xml in python using lxml

Categories

Resources