Given these XML documents:
Document 1
<root>
<element1>
</element1>
</root>
Document 2
<request>
<dummyValue>5</dummyValue>
</request>
Using Pythons ElementTree I'd like to insert the second document into the first document so that the result would look as follows.
Resulting document
<root>
<element1>
<request>
<dummyValue>5</dummyValue>
</request>
</element1>
</root>
ET.SubElement(element1, request) gives me a serialization error.
Is there a Pythonic way of doing this?
SubElement() constructs an Element and then attaches it to the tree. Since you already have request as an Element, you don't need to construct a new one.
Try element1.append(request), like so:
import xml.etree.ElementTree as ET
doc1 = ET.XML('''
<root>
<element1>
</element1>
</root>
''')
request = ET.XML('''
<request>
<dummyValue>5</dummyValue>
</request>
''')
for element1 in doc1.findall('element1'):
element1.append(request)
ET.dump(doc1)
Related
I'm parsing an XML file using ElementTree. In my case, the root of the tree is returned as an XML itself. How do I further parse it to extract the text inside the element <a:Message>?
tree = ETree.ElementTree(response)
print("tree:---", tree)
print("root:---", tree.getroot())
print("element found:---", tree.getroot().findall("./a:Message"))
Output
tree:--- <xml.etree.ElementTree.ElementTree object at 0x00000>
root:--- <s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope">
<s:Header>
<o:Security s:mustUnderstand="1"
xmlns:o="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<!-- Sample XML -->
</o:Security>
</s:Header>
<s:Body>
<Response xmlns="http://tempuri.org/">
<Result xmlns:a="http://xmldataschemas.data">
<!-- Fields must be in this exact order. -->
<a:Message>xxx Document is being processed</a:Message>
<a:ResponseCode>DOCUMENT_ERROR</a:ResponseCode>
</Result>
</Response>
</s:Body>
</s:Envelope>
element found:--- None
You have to deal with the namespaces in your xml. So try this instead:
ns = {'a': 'http://xmldataschemas.data'}
root.find('.//a:Message',ns).text
Output:
'xxx Document is being processed'
I just have the similar like the following XML file :
<xml>
<Catalog>
<Book>
<Textbook>
<Author ="MEMO" />
</Textbook>
</Book>
<Journal>
<Science>
<Author ="David" />
</Science>
</Journal>
</Catalog>
</xml>
what i would like to do that write a python code that will find and print the xpath for every nodes in my XML file , any idea or suggest i will be very thankful :), any models i can use to find the full path example the result should look like :
MEMO: Catalog/Book/Textbook/Author
It can be done with lxml:
import lxml.html as lh
from lxml import etree
books = """[your html above]"""
doc = lh.fromstring(books)
tree = etree.ElementTree(doc)
for e in doc.iter('author'):
print("Memo: ",tree.getpath(e).replace('/xml/',''))
Output:
Memo: catalog/book/textbook/author
Memo: catalog/journal/science/author
I am working on XML content that contains elements which may hold potentially malformed XML/markup-like (e.g. HTML) content as text. For example:
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
Goal: I want lxml.etree to not attempt to parse anything under data-elements as XML but rather simply return it as bytes or str (can be in elem.text).
The files are big and I wanted to use lxml.etree.iterparse to extract the contents found in data-
elements.
Initial Idea: A straightforward way to just get the contents of the element (in this case containing the data start- and end-tags) could be:
data = BytesIO(b"""
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
""")
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), html=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
The problem with this is that lxml.etree can run into issues parsing the children of data (for example: I already had to use html=True to not run into issues when html-data is stored under data). I know that there are custom element classes in lxml but from how I understand the documentation, they do not change lxml.etree's parsing behaviour dictated by libxml2).
Is there any easy way to tell lxml to not attempt to parse element content as children. The application itself benefits from other lxml functionality which I would have to replicate if I wrote a custom extractor for data alone.
Or could there a way to use XSLT to first transform the input for processing in lxml and to later link back the data?
Does this work as expected?
The XML is modified by adding DTD and CDATA to specify that the content inside the data element has to be treated as character data.
data = io.BytesIO(B'''<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE root [
<!ELEMENT root (data+)>
<!ELEMENT data (#PCDATA)>
]>
<root>
<data>
<![CDATA[
<x>foo<y>bar</y>
]]>
</data>
<data>
<![CDATA[
<z>foo<y>bar</y>
]]>
</data>
</root>
''')
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), dtd_validation=True, load_dtd=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
I working on a XML file that contains soap tags in it. I want to remove those soap tags as part of XML cleanup process.
How can I achieve it in either Python or Scala. Should not use shell script.
Sample Input :
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://sample.com/">
<soap:Body>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
</soap:Body>
</soap:Envelope>
Expected Output :
<?xml version="1.0" encoding="UTF-8"?>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
This could help you!
from lxml import etree
doc = etree.parse('test.xml')
for ele in doc.xpath('//soap'):
parent = ele.getparent()
parent.remove(ele)
print(etree.tostring(doc))
This XML document contains the set of tags events-data. I want to extract information from the most RECENT events-data. For example, in the code below I want to go to the last events-data tag, go down to the event-date tag and extract the text of the date child tag. At the moment I am using BeautifulSoup in Python to traverse this document. Any ideas?
<?xml version="1.0" encoding="UTF-8"?>
<first-tag>
<second-tag>
<events-data>
<event-date>
<date>20040913</date>
</event-date>
</events-data>
<events-data> #the one i want to traverse to grab date text
<event-date>
<date>20040913</date>
</event-date>
</events-data>
</second-tag>
</first-tag>
This is using BeautifulSoup 3
import os
import sys
# Import Custom libraries
from BeautifulSoup import BeautifulStoneSoup
xml_str = \
'''
<?xml version="1.0" encoding="UTF-8"?>
<first-tag>
<second-tag>
<events-data>
<event-date>
<date>20040913</date>
</event-date>
</events-data>
<events-data>
<event-date>
<date>20040913</date>
</event-date>
</events-data>
</second-tag>
</first-tag>
'''
soup = BeautifulStoneSoup(xml_str)
event_data_location = lambda x: x.name == "events-data"
events = soup.findAll(event_data_location)
if(events):
# The last event-data
print events[-1].text