Encoding error while parsing RSS with lxml - python

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
But I get an error:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:
https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html
Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)
Just tested this on my machine and it worked fine. Hope it helps!

It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.
This particular rss file begins with the encoding declaration, so everything should just work:
<?xml version="1.0" encoding="utf-8"?>
The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.
import lxml.etree
import urllib2
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
# ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']
uresponse = response.decode("utf8")
print [uresponse]
# [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']
tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
# ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomości...']
lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
# ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomości...']
# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])
# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)
Code can be tried here:
http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.
Edit: okay, the problem actually seems to be with lxml. The following works, for whatever reason:
parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)

Related

Python Post Request Response Xml Error load fromstring

I'm literally new to Python and I have encounter something that I am not sure how to resolve I'm sure it must be a simple fix but haven't found an solution and hope someone with more knowledge in Python will be able to help.
My request:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
all is fine until with the above until I have to manipulate the data before saving it :
E.G:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
import xml.etree.ElementTree as ET
# contacts.encoding = 'utf-8'
parser = ET.XMLParser(encoding="UTF-8")
tree = ET.fromstring(contacts.content, parser=parser)
root = tree.getroot()
for item in root[0][0].findall('.//fields'):
if item[0].text == 'maching-text-here':
if not item[1].text:
item[1].text = 'N/A'
print(item[1].text)
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
in the above I literally replacing empty value with value 'N/A'
the error that I'm receiving is:
Traceback (most recent call last):
File "Desktop/PythonTests/test.py", line 107, in <module>
tree = ET.fromstring(contacts.content, parser=parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1659, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1523, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 192300
looking around this column I can see a text with characters E.G: Sinéd, É is a problem here and actually when I just save this xml file and open in the browser I get kind of same error round about give or take the same column missing by 2:
This page contains the following errors:
error on line 1 at column 192298: Encoding error
Below is a rendering of the page up to the first error.
I wonder What I can do with data xml response that contain data with characters ?
Anyone any help Appreciated!
Found my answer after digging stack overflow:
I've modified:
FROM:
tree = ET.fromstring(contacts.content, parser=parser)
TO:
tree = ElementTree(fromstring(contacts.content))
REF:https://stackoverflow.com/questions/33962620/elementtree-returns-element-instead-of-elementtree/44483259#44483259

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs.
All the URL's give access to XML data.
It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:
File "<stdin>", line 18, in <module>
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error.
Is there a way to get around this error?
Here is my script:
URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
goterm_location = []
goterm_function = []
goterm_process = []
location_list[id] = []
function_list[id] = []
biological_list[id] = []
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
#Try to solve empty line error#
tree = etree.parse(textfile);
#root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
if node.attrib.get('type') == 'GO':
for child in node:
value = child.attrib.get('value');
if value.startswith('C:'):
goterm_C = node.attrib.get('id')
if goterm_C:
location_list[id].append(goterm_C);
if value.startswith('F:'):
goterm_F = node.attrib.get('id')
if goterm_F:
function_list[id].append(goterm_F);
if value.startswith('P:'):
goterm_P = node.attrib.get('id')
if goterm_P:
biological_list[id].append(goterm_P);
I have tried:
tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)
Without success.
Any help would be greatly appreciated
I can't tell which files give the error
Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.
Also, read the error message:
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.
You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:
try:
tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue # go on to the next URL
Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.
I got the same error message in Python 3.6
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
In my case the xml file is not empty. Issue is because of encoding,
Initially used utf-8,
from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')
changing encoding to iso-8859-1 solved my issue,
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')

Python error with decode utf-8 and Japanese characters

Traceback (most recent call last):
File "C:\Program Files (x86)\Python\Projects\test.py", line 70, in <module>
html = urlopen("https://www.google.co.jp/").read().decode('utf-8')
File "C:\Program Files (x86)\Python\lib\http\client.py", line 506, in read
return self._readall_chunked()
File "C:\Program Files (x86)\Python\lib\http\client.py", line 592, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "C:\Program Files (x86)\Python\lib\http\client.py", line 664, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(5034 bytes read, 3158 more expected)
So I am trying to get data from a website but it seems whenever it comes across Japanese characters or other unreadable characters it comes up with this error. All I am using is urlopen and .read().decode('utf-8'). Is there some way I can just ignore all of them or replace them all so there is no error?
In the code you posted, there is no problem with character encoding. Instead you have a problem getting the whole HTTP response. (Look closely at the error message.)
I tried this in an interactive Python shell:
>>> import urllib2
>>> url = urllib2.urlopen("https://www.google.co.jp/")
>>> body = url.read()
>>> len(body)
11155
This worked.
>>> body.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 102: invalid start byte
Ok, there is indeed an encoding error.
>>> url.headers['Content-Type']
'text/html; charset=Shift_JIS'
This is because your HTTP response is not encoded in UTF-8, but in Shift-JIS.
You should probably not use urllib2 but a higher level library that takes care of the HTTP encoding. Or, if you want to do it yourself, see https://stackoverflow.com/a/20714761.
Use requests and BeautifulSoup:
import requests
r = requests.get("https://www.google.co.jp/")
soup = BeautifulSoup(r.content)
print soup.find_all("p")
[<p style="color:#767676;font-size:8pt">© 2013 - プライバシーと利用規約</p>]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2

I am creating XML file in Python and there's a field on my XML that I put the contents of a text file. I do it by
f = open ('myText.txt',"r")
data = f.read()
f.close()
root = ET.Element("add")
doc = ET.SubElement(root, "doc")
field = ET.SubElement(doc, "field")
field.set("name", "text")
field.text = data
tree = ET.ElementTree(root)
tree.write("output.xml")
And then I get the UnicodeDecodeError. I already tried to put the special comment # -*- coding: utf-8 -*- on top of my script but still got the error. Also I tried already to enforce the encoding of my variable data.encode('utf-8') but still got the error. I know this issue is very common but all the solutions I got from other questions didn't work for me.
UPDATE
Traceback: Using only the special comment on the first line of the script
Traceback (most recent call last):
File "D:\Python\lse\createxml.py", line 151, in <module>
tree.write("D:\\python\\lse\\xmls\\" + items[ctr][0] + ".xml")
File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
write(_escape_cdata(text, encoding))
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 243: ordina
l not in range(128)
Traceback: Using .encode('utf-8')
Traceback (most recent call last):
File "D:\Python\lse\createxml.py", line 148, in <module>
field.text = data.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 227: ordina
l not in range(128)
I used .decode('utf-8') and the error message didn't appear and it successfully created my XML file. But the problem is that the XML is not viewable on my browser.
You need to decode data from input string into unicode, before using it, to avoid encoding problems.
field.text = data.decode("utf8")
I was running into a similar error in pywikipediabot. The .decode method is a step in the right direction but for me it didn't work without adding 'ignore':
ignore_encoding = lambda s: s.decode('utf8', 'ignore')
Ignoring encoding errors can lead to data loss or produce incorrect output. But if you just want to get it done and the details aren't very important this can be a good way to move faster.
Python 2
The error is caused because ElementTree did not expect to find non-ASCII strings set the XML when trying to write it out. You should use Unicode strings for non-ASCII instead. Unicode strings can be made either by using the u prefix on strings, i.e. u'€' or by decoding a string with mystr.decode('utf-8') using the appropriate encoding.
The best practice is to decode all text data as it's read, rather than decoding mid-program. The io module provides an open() method which decodes text data to Unicode strings as it's read.
ElementTree will be much happier with Unicodes and will properly encode it correctly when using the ET.write() method.
Also, for best compatibility and readability, ensure that ET encodes to UTF-8 during write() and adds the relevant header.
Presuming your input file is UTF-8 encoded (0xC2 is common UTF-8 lead byte), putting everything together, and using the with statement, your code should look like:
with io.open('myText.txt', "r", encoding='utf-8') as f:
data = f.read()
root = ET.Element("add")
doc = ET.SubElement(root, "doc")
field = ET.SubElement(doc, "field")
field.set("name", "text")
field.text = data
tree = ET.ElementTree(root)
tree.write("output.xml", encoding='utf-8', xml_declaration=True)
Output:
<?xml version='1.0' encoding='utf-8'?>
<add><doc><field name="text">data€</field></doc></add>
#!/usr/bin/python
# encoding=utf8
Try This to starting of python file

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.
Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.
The line where the error occurred is the 3rd one:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI full output is:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
Yeah, It will choke if you have elements with non-ASCII names (<café>). And that's not even ‘bad markup’, for XML...
It's a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.
You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that's not really a good fix. Not trivial to override the rest of the method either.
What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.
This happens if there are non-ascii chars in the input in python versions before Python 3.0
If you are trying to use str(...)on a string containing chars with a char value > 128 (ANSII & unicode), this exception is raised.
Here, the error possibly occurs because getattr tries to use str on a unicode string - it "thinks" it can safely do this because in python versions prior to 3.0 identifiers must not contain unicode.
Check your HTML for unicode characters. Try to replace / encode these and if it still does not work, tell us.

Categories