Python parse XML files with HTML content - python

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, <br> or <b></b>
I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?
from xml.dom.minidom import parse, parseString
xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")

Read the xml file as a string, and fix the malformed tags before you parse it:
import xml.etree.ElementTree as ET
with open(xml) as xml_file: # open the xml file for reading
text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements

If you can use third-party libs I suggest you to use Beautiful Soup it can handle xml as well as html and also it parses broken markup, also providing easy to use api.

Related

Put a XML file inside a Python script?

I'm trying to create a face-detection script using Python's OpenCV using the haar cascade XML file.
My goal is to upload a python file to a website but due to some weird policies, I can only upload the Python file, without the XML...
The question is, is it possible to somehow put the XML file inside the Python script, say, convert it to a String or something and then generate an XML from that String?
xml = """<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>Yes, you can embed XML in a string literal in Python.</b>
</a>"""
Not answer to title but answer of your description question.
Haar cascade doesn't support non-file XML strings. Also, if you try to put an XML file to a website and give a link to an XML file with cv2.CascadeClassifier(), it will give an error.
But you can use the request module on python to achieve what you want.
It gets XML from the website, then puts it into a file
def function(self, image):
# download XML from server
link = LINK_TO_XML
r = requests.get(link, allow_redirects=True)
open('haarcascade_frontalface_default.xml', 'wb').write(r.content)
# end of download
haar_cascade = cv.CascadeClassifier('haarcascade_frontalface_default.xml')
First, copy the contents of the XML file into the python file and assign the whole thing to a string. Then use XML library to create a tree type data structure named root which contains the contents of the XML file. This tree is traversable and you can do what you like with it in your program:
import xml.etree.ElementTree as ET
root = ET.fromstring(XML_file_example_as_string).
To generate XML from the string you can use ElementTree.write() like this:
tree = ET.ElementTree(root)
tree.write('example.xml')

How can i see the content of a xml file in python

I'm struggling to find a way of seeing the content of a xml file. I have done a lot of searching and the only progress I am making is to keep running my code without any results
Have a look at the BeautifulSoup package and use the lxml parser.
Based on this url:
https://pymotw.com/2/xml/etree/ElementTree/parse.html
the relevant code:
from xml.etree import ElementTree
with open('example.xml', 'rt') as f:
tree = ElementTree.parse(f)
print tree
This will print the XML file.
It's also good for parsing the file and search elements.

Find all titles in an XML with Elementree from a bz2 file

I'm new to parsing in XML and am stuck with my code regarding finding all titles (title tags) in an XML. This is what I came up with, but it is returning just an empty list, while there should be titles in there.
import bz2
from xml.etree import ElementTree as etree
def parse_xml(filename):
with bz2.BZ2File(filename) as f:
doc = etree.parse(f)
titles = doc.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
print titles[:10]
Can someone tell me why this is not working properly? Just to be clear; I need to find all text inside title tags stored in a list, taken from an XML wrapped in a bz2 file (as far as I read the best way is without unzipping).

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Handling XML with Python

All I want to do is get the content of an XML tag in Python. I'm maybe using the wrong import; ideally I'd love to have the way PHP deals with XML (i.e $XML->this_tag), like the way pyodbc does database stuff (i.e. table.field)
Here's my example:
from xml.dom.minidom import parseString
dom = parseString("<test>I want to read this</test>")
dom.getElementsByTagName("test")[0].toxml()
>>> u'<test>I want to read this</test>'
All I want to be able to do read the contents of the tag (like innerHTML in javascript).
instead of dom.getElementsByTagName("test")[0].toxml() put dom.getElementsByTagName("test")[0].firstChild.data It will print the node value.
I like BeautifulSoup :
from BeautifulSoup import BeautifulStoneSoup
xml = """<test>I want to read this</test>"""
soup = BeautifulStoneSoup(xml)
soup.find('test')
I want to read this
looks somewhat better.
Use firstChild.data instead of toxml:
from xml.dom.minidom import parseString
dom = parseString('<test>I want to read this</test>')
element = dom.getElementsByTagName('test')[0]
print element.firstChild.data
Output:
>>> I want to read this

Categories