Python XML Iterparse halt on text - python

I am new to python, using 3.x, and am running into an issue with an XML file that I'm testing/learning on. When I look at the raw file (which is ASCII encoded btw), the issue (I'm pretty sure) is that there's a U+00A0 code in there.
The XML is as follows:
<?xml version="1.0" encoding="utf-8"?>
<XMLSetData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.clientsite.com/subdir/r2.4/v1">
<FileCreationDate>2018-05-05T11:35:44.1043858-05:00</FileCreationDate>
<XMLSetDataList>
<DataIDNumber>99345346</DataIDNumber>
<DataName>RSRS TVL5697 ULL  Georgetown</DataName>
</XMLSetDataList>
</XMLSetData>
Using Notepad++, it shows me that the text has "xA0 " instead of " " (two spaces) between ULL and Georgetown. So when I do the code below:
import xml.etree.ElementTree as ET
events = ("end", "start-ns", "end-ns")
for event, elem in ET.iterparse(xml_file, events=events):
if event == "end":
eltag = elem.tag
eltext = elem.text
print( eltag, eltext)
It gives me an error stating:
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1222, in iterator
yield from pullparser.read_events()
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1297, in read_events
raise event
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1269, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 30
How do I fix this / get around it? If I remove the xA0 part, it parses fine, but obviously something like this may come up again, and I'd like to programmatically handle it.

Related

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 15

I am making a program that reads and writes an xml file using element tree in python, this is my xml file:
<?xml version='1.0' encoding='utf_8'?>
<data>
<Hoidap hoi="bạn tên gì" dap="tôi tên là Tuấn" />
</data>
here is my python code:
parser = ET.XMLParser(encoding='utf_8')
tree = ET.parse("F:\data.xml", parser=parser)
and here is the error message:
tree = ET.parse("F:\data.xml", parser=parser)
File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\xml\etree\ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\xml\etree\ElementTree.py", line 601, in parse
parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 15
I have searched everywhere but there is nothing similar to this, can someone help me, thanks a lot
encoding should be spelled 'UTF-8'. If the document's encoding declaration was spelled correctly, ET.parse('data.xml') would work since the default parser is XMLParser() and would use the document's declaration.
from xml.etree import ElementTree as ET
parser = ET.XMLParser(encoding='UTF-8')
tree = ET.parse("data.xml", parser=parser)
ET.dump(tree)
Output:
<data>
<Hoidap hoi="bạn tên gì" dap="tôi tên là Tuấn" />
</data>
See Extensible Markup Language (XML) 1.0 (Fifth Edition), 2.8 Prolog and Document Type Declaration, Encoding Declaration:
In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, ....

Error with Python and XML

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.
It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

Shelve raises out of overflow pages

I have a lot of data that I read from a XML file, but only a part of it will be saved in the database later.
The XML is structured like that:
<element id="1" other_attrib="">
<element id="2" other_attrib="">
...
<other_element>
<elem id="1">
<elem id="100">
</other_element>
<other_element>
<elem id...>
</other_element>
all of the element tags precede the other_elements tags, (this XML is third-party, I can't restructure it)
I have to read all of the elements first because they contain additional attributes, but only save those that are referenced by other_elements.
So I opened a shelve with writeback=False, use lxml.iterparse() to parse the XML and save elements on the go, but after many elements get added (I don't know the exact number, but it goes into hundreds of thousands) I receive the following error:
HASH: Out of overflow pages. Increase page size
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 133, in __setitem__
self.dict[key] = f.getvalue()
Basically what I do is for every element iterparse returns:
if elem.tag == 'element':
shlv[elem['id']] = {'attrib1': elem['other_attrib'], 'attrib2': elem['attrib2']}
elif elem.tag == 'other_element':
# Here I iterate through this tags children and find references in shlv object
for ref in elem:
save_in_database(shlv[ref.attrib['id']])
What can I change so that shelve handles more data ? Or should I use something else to store that data ?

WordPress XML ParseError: unbound prefix?

I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>

Categories