lxml: Do not parse subtree but treat as binary content - python

I am working on XML content that contains elements which may hold potentially malformed XML/markup-like (e.g. HTML) content as text. For example:
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
Goal: I want lxml.etree to not attempt to parse anything under data-elements as XML but rather simply return it as bytes or str (can be in elem.text).
The files are big and I wanted to use lxml.etree.iterparse to extract the contents found in data-
elements.
Initial Idea: A straightforward way to just get the contents of the element (in this case containing the data start- and end-tags) could be:
data = BytesIO(b"""
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
""")
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), html=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
The problem with this is that lxml.etree can run into issues parsing the children of data (for example: I already had to use html=True to not run into issues when html-data is stored under data). I know that there are custom element classes in lxml but from how I understand the documentation, they do not change lxml.etree's parsing behaviour dictated by libxml2).
Is there any easy way to tell lxml to not attempt to parse element content as children. The application itself benefits from other lxml functionality which I would have to replicate if I wrote a custom extractor for data alone.
Or could there a way to use XSLT to first transform the input for processing in lxml and to later link back the data?

Does this work as expected?
The XML is modified by adding DTD and CDATA to specify that the content inside the data element has to be treated as character data.
data = io.BytesIO(B'''<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE root [
<!ELEMENT root (data+)>
<!ELEMENT data (#PCDATA)>
]>
<root>
<data>
<![CDATA[
<x>foo<y>bar</y>
]]>
</data>
<data>
<![CDATA[
<z>foo<y>bar</y>
]]>
</data>
</root>
''')
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), dtd_validation=True, load_dtd=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data

Related

Parsing an XML using ElementTree: The root of the tree is returned as an XML itself. How do I further parse it to find an element?

I'm parsing an XML file using ElementTree. In my case, the root of the tree is returned as an XML itself. How do I further parse it to extract the text inside the element <a:Message>?
tree = ETree.ElementTree(response)
print("tree:---", tree)
print("root:---", tree.getroot())
print("element found:---", tree.getroot().findall("./a:Message"))
Output
tree:--- <xml.etree.ElementTree.ElementTree object at 0x00000>
root:--- <s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope">
<s:Header>
<o:Security s:mustUnderstand="1"
xmlns:o="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<!-- Sample XML -->
</o:Security>
</s:Header>
<s:Body>
<Response xmlns="http://tempuri.org/">
<Result xmlns:a="http://xmldataschemas.data">
<!-- Fields must be in this exact order. -->
<a:Message>xxx Document is being processed</a:Message>
<a:ResponseCode>DOCUMENT_ERROR</a:ResponseCode>
</Result>
</Response>
</s:Body>
</s:Envelope>
element found:--- None
You have to deal with the namespaces in your xml. So try this instead:
ns = {'a': 'http://xmldataschemas.data'}
root.find('.//a:Message',ns).text
Output:
'xxx Document is being processed'

Insert XML document into existing XML with Python

Given these XML documents:
Document 1
<root>
<element1>
</element1>
</root>
Document 2
<request>
<dummyValue>5</dummyValue>
</request>
Using Pythons ElementTree I'd like to insert the second document into the first document so that the result would look as follows.
Resulting document
<root>
<element1>
<request>
<dummyValue>5</dummyValue>
</request>
</element1>
</root>
ET.SubElement(element1, request) gives me a serialization error.
Is there a Pythonic way of doing this?
SubElement() constructs an Element and then attaches it to the tree. Since you already have request as an Element, you don't need to construct a new one.
Try element1.append(request), like so:
import xml.etree.ElementTree as ET
doc1 = ET.XML('''
<root>
<element1>
</element1>
</root>
''')
request = ET.XML('''
<request>
<dummyValue>5</dummyValue>
</request>
''')
for element1 in doc1.findall('element1'):
element1.append(request)
ET.dump(doc1)

Parse xml in Python ( One correct way to do so) using xlml

I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems

Change parent in xml by Python (lxml)

Hi I am parsing and completely modifying XML file in Python3 using lxml and I need put new Element into existing Elements and change their parent.
Example:
old xml
<a>
<b>something</b>
<c>something different</c>
</a>
new xml
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
<a>
Is it possible ?
I'm not sure there is a function that do directly what you want. I would do it as follow: Create a new_parent node and append children of a to new_parent node, and append new_parent to a.
import lxml.etree
xml = '''<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<b>something</b>
<c>something different</c>
</a>
</root>'''
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
for child in a:
parent.append(child)
a.append(parent)
print lxml.etree.tostring(root, xml_declaration=True)
prints (output format is modified to make it easy to read)
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
</a>
</root>
UPDATE You can use extend instead of multiple calls of append.
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
parent.extend(a)
a.append(parent)

How to get XML tag value in Python

I have some XML in a unicode-string variable in Python as follows:
<?xml version='1.0' encoding='UTF-8'?>
<results preview='0'>
<meta>
<fieldOrder>
<field>count</field>
</fieldOrder>
</meta>
<result offset='0'>
<field k='count'>
<value><text>6</text></value>
</field>
</result>
</results>
How do I extract the 6 in <value><text>6</text></value> using Python?
With lxml:
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
textelem = root.find('result/field/value/text')
print textelem.text
Edit: But I imagine there could be more than one result...
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
results = root.findall('result')
textnumbers = [r.find('field/value/text').text for r in results]
BeautifulSoup is the most simple way to parse XML as far as I know...
And assume that you have read the introduction, then just simply use:
soup = BeautifulSoup('your_XML_string')
print soup.find('text').string

Categories