Can CDATA sections be preserved by BeautifulSoup? - python

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.
The culprit XML file:
<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
!##$%^&*()_+{}|:"<>?,./;'[]\-=
]]></bar>
</foo>
And here's the Python script.
from bs4 import BeautifulSoup
xmlfile = open("cdata.xml", "r")
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)
Here's the output. Note the CDATA section tags are missing.
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
!##$%^&*()_+{}|:"<>?,./;'[]\-=
</bar>
</foo>
I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?
Is there a way to tell BeautifulSoup to preserve CDATA sections?
Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?

In my case if I use
soup = BeautifulSoup( xmlfile, "lxml-xml" )
then cdata is preserved and accesible.

Related

In Python, using beautiful soup- how do I get the text of a XML tag that contains a hyphen

I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?
<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate>
from bs4 import BeautifulSoup
dir_path = '20211006085201.xml'
file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
events = soup.find_all('fullEventUpdate')
print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')
idi = soup.find_all('event-reference')
for x in range(0, len(events)):
idText = (idi[x].event-id.get_text())
print(idText)
The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:
events = soup.select('fullEventUpdate')
for event in events:
print(event.select_one('event-id').text)
Output:
24425412
24342548
More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).
For XML parsing idiomatic approach is to use xpath selectors.
In python this can be easily achieved with parsel package which is similar to beautifulsoup but built on top of lxml for full xpath support:
body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
print(event.xpath('event-id/text()').get())
results in:
24425412
24342548
Without any external lib (Just ElementTree)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate> '''
root = ET.fromstring(xml)
ids = [e.text for e in root.findall('.//event-id')]
print(ids)
output
['24425412', '24342548']

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
I am getting the error using lxml lib in python. Other solutions/hacks are replacing utf-16 to utf-8 in file php. What is the pythonic way to solve this?
python code:
import lxml.etree as etree
tree = etree.parse("req.xml")
req.xml:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Have a look at the documentation of the XMLParser constructor:
>>> help(etree.XMLParser)
Among other options, there is an encoding parameter, which allows you to "override the document encoding", as the docs say.
That's exactly what you need:
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)
If the error message is right (ie. there aren't any other problems with the document), then I expect this to work.
You can parse xml content using BeautifulSoup which is more pythonic way as you required.
NOTE: If your data is encoded in utf-16 it can easily parse by decoding in utf-8 during reading/PARSE the file content.
So below is the code:
sample.xml contains following data:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Code:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data
Output:
{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}

python beautifulsoup adds html tags to svg

While parsing an SVG file, I noticed that beautifulsoup adds html tags to it.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<svg></svg>', 'lxml')
print(soup)
results in:
<html><body><svg></svg></body></html>
Why is this so and can this be avoided?
You use lxml parser, which is HTML parser. To parse XML you should use xml parser:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<svg></svg>', 'xml')
print(soup) # ^^^^^^
From BeautifulSoup documentation:
Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers. Here’s a short
document, parsed as HTML:
BeautifulSoup("<a><b /></a>")
# <html><head></head><body><a><b></b></a></body></html>
Since an empty tag is not valid HTML, the parser turns it into a
tag pair.
Here’s the same document parsed as XML (running this requires that you
have lxml installed). Note that the empty tag is left alone, and
that the document is given an XML declaration instead of being put
into an tag.:
BeautifulSoup("<a><b /></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>
Source: Differences between parsers, emphasis mine.

Comment out an element using lxml

Is it possible to comment out an xml element with python's lxml while preserving the original element rendering inside the comment? I tried the following
elem.getparent().replace(elem, etree.Comment(etree.tostring(elem, pretty_print=True)))
but tostring() adds the namespace declaration.
The namespace of the commented-out element is inherited from the root element. Demo:
from lxml import etree
XML = """
<root xmlns='foo'>
<a>
<b>AAA</b>
</a>
</root>"""
root = etree.fromstring(XML)
b = root.find(".//{foo}b")
b.getparent().replace(b, etree.Comment(etree.tostring(b)))
print etree.tostring(root)
Result:
<root xmlns="foo">
<a>
<!--<b xmlns="foo">AAA</b>
--></a>
</root>
Manipulating namespaces is often harder than you might suspect. See https://stackoverflow.com/a/31870245/407651.
My suggestion here is to use BeautifulSoup, which in practice does not really care about namespaces (soup.find('b') returns the b element even though it is in the foo namespace).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(XML, "xml")
b = soup.find('b')
b.replace_with(Comment(str(b)))
print soup.prettify()
Result:
<?xml version="1.0" encoding="utf-8"?>
<root mlns="foo">
<a>
<!--<b>AAA</b>-->
</a>
</root>

Python and BeautifulSoup shuffles my tags around?

This has been puzzling me for a while now, and I can not figure out what's going on here. This is the original XML file:
<?xml version="1.0" encoding="UTF-8" ?>
<book>
<meta>
<title>Some Title</title>
<creator>Another Author</creator>
<language>en-US</language>
...
</meta>
<chapter>
...
</chapter>
</book>
Then I read the file in:
with open(filename) as f :
soup = BeautifulSoup(f);
print(soup.root)
and that, oddly, dumps the following:
<html><body><book>
<meta/>
<title>Some Title</title>
<creator>Some Author</creator>
<language>en-US</language>
...
So why oh why is the <meta> tag in the soup empty, when it is not in the original XML file? (I could swear that this worked just two weeks ago, and I could also swear that I've not touched the script. I did make some slight changes to the XML file further down, but I fail to see a correlation.)
You are opening a XML file in a HTML parser. BeautifulSoup tries to repair the HTML structure it expected.
Instead, use a XML parser, or use BeautifulSoup in XML mode:
soup = BeautifulSoup(f, 'xml')
For this to work you must have lxml installed.
lxml is an excellent XML library in and of itself. You could also use the ElementTree library included with Python; lxml is based on the same API but has more features.

Categories