How to parse an XML file with encoding declaration in Python?

How to parse an XML file with encoding declaration in Python? - python

I have this XML file, called xmltest.xml:
<?xml version="1.0" encoding="GBK"?>
<productMeta>
<bands>1,2,3,4</bands>
<imageName>TestName.tif</imageName>
<browseName>TestName.jpg</browseName>
</productMeta>
And I have this Python dummy code:
import xml.etree.ElementTree as ET
xmldoc = ET.parse('xmltest.xml')
But it raises a ValueError:
ValueError: multi-byte encodings are not supported
I understand this error, it raises because the encoding declaration in the first line of the XML file. The XML file is UTF-8 encoded but always have that declaration (I'm not the creator of the XML files to be analyzed). How can I avoid such encoding declaration when parsing an XML file such the former one?

One thing that I tried, that worked for me is to open the xml file as a file object , then use ElementTree.fromstring() passing in the complete contents of the file.
Example -
>>> import xml.etree.ElementTree as ET
>>> ef = ET.parse('a.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
ValueError: multi-byte encodings are not supported
>>> with open('a.xml','r') as f:
... ef = ET.fromstring(f.read())
...
>>> ef
<Element 'productMeta' at 0x028DF180>
You can also, create an XMLParser with the required encoding, and this should enable you to be able to parse strings from that encoding, Example -
import xml.etree.ElementTree as ET
xmlp = ET.XMLParser(encoding="utf-8")
f = ET.parse('a.xml',parser=xmlp)

ET.parse('a.xml', parser=ET.XMLParser(encoding='iso-8859-5'))
solved my problem when dealed with xml excel in python

Related

xml.etree.ElementTree parse content in kernel

I am trying to convert a bunch of .xml.gz files into data frames. Because there are too many files where many of the nodes are not useful for our project, I will not write all the xml files out.
However, to parse xml using xml.etree.ElementTree, I need to get the directory of xml file. Is there a way to parse content in the kernel directly?
with gzip.open(gz_files[0], 'rb') as f:
content = f.read()
xmlparse = Xet.parse(gz_files[1])
Traceback (most recent call last):
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py:3251
in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
Input In [63] in
xmlparse = Xet.parse(gz_files[1])
File
~\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py:1229
in parse
tree.parse(source, parser)
File
~\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py:580
in parse
self._root = parser._parse_whole(source)
File ParseError: not well-formed (invalid token): line 1,
column 0

It should be possible to parse gzipped file content as following:
import xml.etree.ElementTree as ET
import gzip
with gzip.open('file.xml.gz', 'rb') as f:
xmlparse = ET.parse(f)
print(xmlparse)

Python Post Request Response Xml Error load fromstring

I'm literally new to Python and I have encounter something that I am not sure how to resolve I'm sure it must be a simple fix but haven't found an solution and hope someone with more knowledge in Python will be able to help.
My request:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
all is fine until with the above until I have to manipulate the data before saving it :
E.G:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
import xml.etree.ElementTree as ET
# contacts.encoding = 'utf-8'
parser = ET.XMLParser(encoding="UTF-8")
tree = ET.fromstring(contacts.content, parser=parser)
root = tree.getroot()
for item in root[0][0].findall('.//fields'):
if item[0].text == 'maching-text-here':
if not item[1].text:
item[1].text = 'N/A'
print(item[1].text)
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
in the above I literally replacing empty value with value 'N/A'
the error that I'm receiving is:
Traceback (most recent call last):
File "Desktop/PythonTests/test.py", line 107, in <module>
tree = ET.fromstring(contacts.content, parser=parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1659, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1523, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 192300
looking around this column I can see a text with characters E.G: Sinéd, É is a problem here and actually when I just save this xml file and open in the browser I get kind of same error round about give or take the same column missing by 2:
This page contains the following errors:
error on line 1 at column 192298: Encoding error
Below is a rendering of the page up to the first error.
I wonder What I can do with data xml response that contain data with characters ?
Anyone any help Appreciated!

Found my answer after digging stack overflow:
I've modified:
FROM:
tree = ET.fromstring(contacts.content, parser=parser)
TO:
tree = ElementTree(fromstring(contacts.content))
REF:https://stackoverflow.com/questions/33962620/elementtree-returns-element-instead-of-elementtree/44483259#44483259

Unable to Parse XML file in Python

I am trying to parse a large xml file (more than 50mb). Getting the following parsing error.
File attached for reference. File
import xml.etree.cElementTree as ET
tree = ET.parse('input_file.xml')
error
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown
ParseError: no element found: line 21, column 0

Your XML is not well-formed, ElementTree cannot parse it. Please take look at your XML file and check whether it has a proper closing tag, maybe special characters and other stuff.

Strip attributes / namespaces from SOAP XML

If I have several tags like this:
<ServiceId xsi:type="xsd:string">aval</ServiceId>
Is xsi:type="xsd:string" technically an attribute?
When I try this:
from StringIO import StringIO
from SOAPpy.wstools.Utility import DOM
badxml = '''<?xml version="1.0" encoding="utf-8"?>
<ServiceId xsi:type="xsd:string">aval</ServiceId>'''
document = DOM.loadDocument(StringIO(badxml))
orig_len = len(document.childNodes[0].toxml())
for node in document.childNodes:
node.removeAttribute('xsi:type')
new_len = len(node.toxml())
diff = orig_len - new_len
print diff
...I get an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/SOAPpy/wstools/Utility.py", line 572, in loadDocument
return xml.dom.minidom.parse(data)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 930, in parse
result = builder.parseFile(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unbound prefix: line 2, column 9
I basically want to remove all attributes from large XML documents.

XSI is a namespace. You can use them in your queries if you need them, removing them can have detrimental effects on your data outcomes or if there are other xml elements with the same element name (but different namespace).
have a look here:
Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"
otherwise what you are doing is a bit of a hack and you might as well read the file as a string and do a mass regex replace on the namespace string you want to delete (not recommended).

Python xml.etree.ElementTree Parsing Forward Slashes

I'm trying to parse XML returned by the Stanford CoreNLP in python using the xml.etree.ElementTree module, but I seem to keep running into this error.
Here is the error I get:
File "my_script.py", line 5
root = ET.fromstring(content)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4473, column 19
I checked out out what's on line 4473 in the XML file:
<word>5 1/2</word>
Column 19 starts at the 5.
I assume the problem is due to the forward slash in the number 5 1/2 since this is the first instance the 5 1/2 occurs in the XML file. Does anyone know a way I can still parse the XML file with the forward-slashes?
Here is the code also:
import xml.etree.ElementTree as ET
f = open("samplefiles/samplefile999.txt.xml","r");
content = f.read()
f.close();
root = ET.fromstring(content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse an XML file with encoding declaration in Python? - python

ET.parse('a.xml', parser=ET.XMLParser(encoding='iso-8859-5')) solved my problem when dealed with xml excel in python

Related

xml.etree.ElementTree parse content in kernel

Python Post Request Response Xml Error load fromstring

Unable to Parse XML file in Python

Strip attributes / namespaces from SOAP XML

Python xml.etree.ElementTree Parsing Forward Slashes

Categories

Resources