Is it possible to comment out an xml element with python's lxml while preserving the original element rendering inside the comment? I tried the following
elem.getparent().replace(elem, etree.Comment(etree.tostring(elem, pretty_print=True)))
but tostring() adds the namespace declaration.
The namespace of the commented-out element is inherited from the root element. Demo:
from lxml import etree
XML = """
<root xmlns='foo'>
<a>
<b>AAA</b>
</a>
</root>"""
root = etree.fromstring(XML)
b = root.find(".//{foo}b")
b.getparent().replace(b, etree.Comment(etree.tostring(b)))
print etree.tostring(root)
Result:
<root xmlns="foo">
<a>
<!--<b xmlns="foo">AAA</b>
--></a>
</root>
Manipulating namespaces is often harder than you might suspect. See https://stackoverflow.com/a/31870245/407651.
My suggestion here is to use BeautifulSoup, which in practice does not really care about namespaces (soup.find('b') returns the b element even though it is in the foo namespace).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(XML, "xml")
b = soup.find('b')
b.replace_with(Comment(str(b)))
print soup.prettify()
Result:
<?xml version="1.0" encoding="utf-8"?>
<root mlns="foo">
<a>
<!--<b>AAA</b>-->
</a>
</root>
Related
I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?
<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate>
from bs4 import BeautifulSoup
dir_path = '20211006085201.xml'
file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
events = soup.find_all('fullEventUpdate')
print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')
idi = soup.find_all('event-reference')
for x in range(0, len(events)):
idText = (idi[x].event-id.get_text())
print(idText)
The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:
events = soup.select('fullEventUpdate')
for event in events:
print(event.select_one('event-id').text)
Output:
24425412
24342548
More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).
For XML parsing idiomatic approach is to use xpath selectors.
In python this can be easily achieved with parsel package which is similar to beautifulsoup but built on top of lxml for full xpath support:
body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
print(event.xpath('event-id/text()').get())
results in:
24425412
24342548
Without any external lib (Just ElementTree)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate> '''
root = ET.fromstring(xml)
ids = [e.text for e in root.findall('.//event-id')]
print(ids)
output
['24425412', '24342548']
So for example I have XML doc:
<?xml version="1.0"?>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
How do I parse all texts inside b's. I read my whole file into a string.
I only know how to parse html, tried applying it to html, but failed.
from lxml import html
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/#b');
But it wont work.
The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:
<?xml version="1.0"?>
<body>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
</body>
Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:
from lxml import etree as et
Now it is time to parse the data and create a root object from which to start:
file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name) #Parse the xml file
root = xmlParse.getroot() #Get the root
Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:
bTags = [tag for tag in root.getiterator("b")] #List comprehension with the iterator
bTags[0].text = "Change b tag 1." #Change tag from "Text I need"
bTags[1].text = "Change b tag 2." #Change tag from "Text I need2"
xmlParse.write(file_name) #Edit original xml file
The final output should look something like this:
<?xml version="1.0"?>
<body>
<a>
<b>Change b tag 1.</b>
</a>
<a>
<b>Change b tag 2.</b>
</a>
</body>
I want to transform this xml tree
<doc>
<a>
<op>xxx</op>
</a>
</doc>
to
<doc>
<a>
<cls>
<op>xxx</op>
</cls>
</a>
</doc>
I use this python code
from lxml import etree
f = etree.fromstring('<doc><a><op>xxx</op></a></doc>')
node_a = f.xpath('/doc/a')[0]
ele = etree.Element('cls')
node_a.insert(0, ele)
node_cls = f.xpath('/doc/a/cls')[0]
node_op = f.xpath('/doc/a/op')[0]
node_cls.append(node_op)
print etree.tostring(f, pretty_print=True)
Is it the best solution ?
Now I want to obtain
<cls>
<doc>
<a>
<op>xxx</op>
</a>
</doc>
</cls>
I am unable to find any solution.
Thanks for your help.
I prefer to use beautifulsoup better than lxml. I find it easier to handle.
Both problems can be solved using the same approach, first find the element, get its parent, create the new element and put the old one inside it.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
for e in soup.find_all(sys.argv[2]):
p = e.parent
cls = soup.new_tag('cls')
e_extracted = e.extract()
cls.append(e_extracted)
p.append(cls)
print(soup.prettify())
The script accepts two arguments, thr first one is the xml file and second one the tag to surround with new tag. Run it like:
python3 script.py xmlfile op
That yields:
<?xml version="1.0" encoding="utf-8"?>
<doc>
<a>
<cls>
<op>
xxx
</op>
</cls>
</a>
</doc>
For <doc>, run it like:
python3 script.py xmlfile doc
With following result:
<?xml version="1.0" encoding="utf-8"?>
<cls>
<doc>
<a>
<op>
xxx
</op>
</a>
</doc>
</cls>
Hi I am parsing and completely modifying XML file in Python3 using lxml and I need put new Element into existing Elements and change their parent.
Example:
old xml
<a>
<b>something</b>
<c>something different</c>
</a>
new xml
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
<a>
Is it possible ?
I'm not sure there is a function that do directly what you want. I would do it as follow: Create a new_parent node and append children of a to new_parent node, and append new_parent to a.
import lxml.etree
xml = '''<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<b>something</b>
<c>something different</c>
</a>
</root>'''
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
for child in a:
parent.append(child)
a.append(parent)
print lxml.etree.tostring(root, xml_declaration=True)
prints (output format is modified to make it easy to read)
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>
<new_parent>
<b>something</b>
<c>something different</c>
</new_parent>
</a>
</root>
UPDATE You can use extend instead of multiple calls of append.
root = lxml.etree.fromstring(xml)
a = root.find('.//a')
parent = lxml.etree.Element('new_parent')
parent.extend(a)
a.append(parent)
I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.
The culprit XML file:
<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
!##$%^&*()_+{}|:"<>?,./;'[]\-=
]]></bar>
</foo>
And here's the Python script.
from bs4 import BeautifulSoup
xmlfile = open("cdata.xml", "r")
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)
Here's the output. Note the CDATA section tags are missing.
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
!##$%^&*()_+{}|:"<>?,./;'[]\-=
</bar>
</foo>
I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?
Is there a way to tell BeautifulSoup to preserve CDATA sections?
Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?
In my case if I use
soup = BeautifulSoup( xmlfile, "lxml-xml" )
then cdata is preserved and accesible.