Hi I am parsing and completely modifying XML file in Python3 using lxml and I need put new Element into existing Elements and change their parent.
Example:
old xml
<a>
<b>something</b>
<c>something different</c>
</a>
new xml
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
<a>
Is it possible ?
I'm not sure there is a function that do directly what you want. I would do it as follow: Create a new_parent node and append children of a to new_parent node, and append new_parent to a.
import lxml.etree
xml = '''<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<b>something</b>
<c>something different</c>
</a>
</root>'''
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
for child in a:
parent.append(child)
a.append(parent)
print lxml.etree.tostring(root, xml_declaration=True)
prints (output format is modified to make it easy to read)
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
</a>
</root>
UPDATE You can use extend instead of multiple calls of append.
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
parent.extend(a)
a.append(parent)
Related
I am working on XML content that contains elements which may hold potentially malformed XML/markup-like (e.g. HTML) content as text. For example:
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
Goal: I want lxml.etree to not attempt to parse anything under data-elements as XML but rather simply return it as bytes or str (can be in elem.text).
The files are big and I wanted to use lxml.etree.iterparse to extract the contents found in data-
elements.
Initial Idea: A straightforward way to just get the contents of the element (in this case containing the data start- and end-tags) could be:
data = BytesIO(b"""
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
""")
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), html=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
The problem with this is that lxml.etree can run into issues parsing the children of data (for example: I already had to use html=True to not run into issues when html-data is stored under data). I know that there are custom element classes in lxml but from how I understand the documentation, they do not change lxml.etree's parsing behaviour dictated by libxml2).
Is there any easy way to tell lxml to not attempt to parse element content as children. The application itself benefits from other lxml functionality which I would have to replicate if I wrote a custom extractor for data alone.
Or could there a way to use XSLT to first transform the input for processing in lxml and to later link back the data?
Does this work as expected?
The XML is modified by adding DTD and CDATA to specify that the content inside the data element has to be treated as character data.
data = io.BytesIO(B'''<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE root [
<!ELEMENT root (data+)>
<!ELEMENT data (#PCDATA)>
]>
<root>
<data>
<![CDATA[
<x>foo<y>bar</y>
]]>
</data>
<data>
<![CDATA[
<z>foo<y>bar</y>
]]>
</data>
</root>
''')
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), dtd_validation=True, load_dtd=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
Using ElementTree, how do I place a comment just below the XML declaration and above the root element?
I have tried root.append(comment), but this places the comment as the last child of root. Can I append the comment to whatever is root's parent?
Thanks.
Here is how a comment can be added in the wanted position (after XML declaration, before root element) with lxml, using the addprevious() method.
from lxml import etree
root = etree.fromstring('<root><x>y</x></root>')
comment = etree.Comment('This is a comment')
root.addprevious(comment) # Add the comment as a preceding sibling
etree.ElementTree(root).write("out.xml",
pretty_print=True,
encoding="UTF-8",
xml_declaration=True)
Result (out.xml):
<?xml version='1.0' encoding='UTF-8'?>
<!--This is a comment-->
<root>
<x>y</x>
</root>
Here
import xml.etree.ElementTree as ET
root = ET.fromstring('<root><e1><e2></e2></e1></root>')
comment = ET.Comment('Here is a Comment')
root.insert(0, comment)
ET.dump(root)
output
<root><!--Here is a Comment--><e1><e2 /></e1></root>
Is it possible to comment out an xml element with python's lxml while preserving the original element rendering inside the comment? I tried the following
elem.getparent().replace(elem, etree.Comment(etree.tostring(elem, pretty_print=True)))
but tostring() adds the namespace declaration.
The namespace of the commented-out element is inherited from the root element. Demo:
from lxml import etree
XML = """
<root xmlns='foo'>
<a>
<b>AAA</b>
</a>
</root>"""
root = etree.fromstring(XML)
b = root.find(".//{foo}b")
b.getparent().replace(b, etree.Comment(etree.tostring(b)))
print etree.tostring(root)
Result:
<root xmlns="foo">
<a>
<!--<b xmlns="foo">AAA</b>
--></a>
</root>
Manipulating namespaces is often harder than you might suspect. See https://stackoverflow.com/a/31870245/407651.
My suggestion here is to use BeautifulSoup, which in practice does not really care about namespaces (soup.find('b') returns the b element even though it is in the foo namespace).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(XML, "xml")
b = soup.find('b')
b.replace_with(Comment(str(b)))
print soup.prettify()
Result:
<?xml version="1.0" encoding="utf-8"?>
<root mlns="foo">
<a>
<!--<b>AAA</b>-->
</a>
</root>
So for example I have XML doc:
<?xml version="1.0"?>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
How do I parse all texts inside b's. I read my whole file into a string.
I only know how to parse html, tried applying it to html, but failed.
from lxml import html
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/#b');
But it wont work.
The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:
<?xml version="1.0"?>
<body>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
</body>
Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:
from lxml import etree as et
Now it is time to parse the data and create a root object from which to start:
file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name) #Parse the xml file
root = xmlParse.getroot() #Get the root
Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:
bTags = [tag for tag in root.getiterator("b")] #List comprehension with the iterator
bTags[0].text = "Change b tag 1." #Change tag from "Text I need"
bTags[1].text = "Change b tag 2." #Change tag from "Text I need2"
xmlParse.write(file_name) #Edit original xml file
The final output should look something like this:
<?xml version="1.0"?>
<body>
<a>
<b>Change b tag 1.</b>
</a>
<a>
<b>Change b tag 2.</b>
</a>
</body>
I want to transform this xml tree
<doc>
<a>
<op>xxx</op>
</a>
</doc>
to
<doc>
<a>
<cls>
<op>xxx</op>
</cls>
</a>
</doc>
I use this python code
from lxml import etree
f = etree.fromstring('<doc><a><op>xxx</op></a></doc>')
node_a = f.xpath('/doc/a')[0]
ele = etree.Element('cls')
node_a.insert(0, ele)
node_cls = f.xpath('/doc/a/cls')[0]
node_op = f.xpath('/doc/a/op')[0]
node_cls.append(node_op)
print etree.tostring(f, pretty_print=True)
Is it the best solution ?
Now I want to obtain
<cls>
<doc>
<a>
<op>xxx</op>
</a>
</doc>
</cls>
I am unable to find any solution.
Thanks for your help.
I prefer to use beautifulsoup better than lxml. I find it easier to handle.
Both problems can be solved using the same approach, first find the element, get its parent, create the new element and put the old one inside it.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
for e in soup.find_all(sys.argv[2]):
p = e.parent
cls = soup.new_tag('cls')
e_extracted = e.extract()
cls.append(e_extracted)
p.append(cls)
print(soup.prettify())
The script accepts two arguments, thr first one is the xml file and second one the tag to surround with new tag. Run it like:
python3 script.py xmlfile op
That yields:
<?xml version="1.0" encoding="utf-8"?>
<doc>
<a>
<cls>
<op>
xxx
</op>
</cls>
</a>
</doc>
For <doc>, run it like:
python3 script.py xmlfile doc
With following result:
<?xml version="1.0" encoding="utf-8"?>
<cls>
<doc>
<a>
<op>
xxx
</op>
</a>
</doc>
</cls>