ast.literal_eval somehow throwing UnicodeDecodeError - python

OK so...
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
Python UnicodeDecodeError - Am I misunderstanding encode?
I've got this python 2.7 code
try:
print '***'
print type(relationsline)
relationsline = relationsline.decode("ascii", "ignore")
print type(relationsline)
relationsline = relationsline.encode("ascii", "ignore")
print type(relationsline)
relations = ast.literal_eval(relationsline)
except ValueError:
return
except UnicodeDecodeError:
return
The last line in the code above sometimes throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position
341: ordinal not in range(128)
I would think that this would (1) start with a string with some (unknown) encoding (2) decode it into a unicode type, representing a string of characters the unicode character set with ascii encodings while ignoring all characters that can't be encoded with ascii (3) encode the unicode type into a string with ascii encoding, ignoring all of the characters that can't be represented in ascii.
Here is the full stack trace:
Traceback (most recent call last):
File "outputprocessor.py", line 69, in <module>
getPersonRelations(lines, fname)
File "outputprocessor.py", line 41, in getPersonRelations
relations = ast.literal_eval(relationsline)
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)
^
SyntaxError: invalid syntax
But that is clearly wrong somewhere. Even more perplexing is that the UnicodeDecodeError is not catching the UnicodeDecodeError. What am I missing? Maybe this is the problem? http://bugs.python.org/issue22221

Look at the stack trace closer. It is throwing a SyntaxError.
You are trying to literal_eval the string "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)". You can encode/decode that string all you want, but ast won't know what to do with it - that's clearly not a valid python literal.
See:
>>> import ast
>>> ast.literal_eval('''UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)''')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)
^
SyntaxError: invalid syntax
I would look at the source of whatever is passing these strings to your function, it's generating some bogus input.

You are trying to literal_eval the traceback from relationsline = relationsline.encode("ascii", "ignore") from the passed in string.
You will need to move your literal_eval check into its own try/except or catch the exception in your original try block or filter the input somehow.

Related

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?
You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

Eliminate accents in python

I have this function to remove accents in a word
def remove_accents(word):
return ''.join(x for x in unicodedata.normalize('NFKD', word) if x in string.ascii_letters)
But when I run it it shows an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 3: ordinal not in range(128)
The character in position 3 is : ó
If your input is a unicode string, it works:
>>> remove_accents(u"foóbar")
u'foobar'
If it isn't, it doesn't. I don't get the error you describe, I get a TypeError instead, and only get the UnicodeDecodeError if I try to cast it to unicode by doing
>>> remove_accents(unicode("foóbar"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If that is your problem, i.e. you have Python 2 str objects as an input, you can solve it by decoding it as utf-8 first:
>>> remove_accents("foóbar".decode("utf-8"))
u'foobar'

Python fails with parsing file using re

I have a file that is mostly ascii file, but there appear some non-ascii characters sometimes. I want to parse this files and extract the lines that are marked in a certain way. Previously I used sed for this, but now I need to do the same in python. (Of course I still can use os.system, but I'm hoping for something more convenient).
I'm doing following.
p = re.compile(".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", encoding="ascii")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
And in the last line I get following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 2227: ordinal not in range(128)
If I remove encoding parameter from the second line, i. e. use default encoding which is utf-8, the error is following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2227: invalid start byte
Could you help me please what can I do here, except calling sed from python?
UPD.
Thanks to #Wooble I found the answer.
The correct code looks following:
p = re.compile(rb".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", "rb")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
I opened file in binary mode and also compile regex from binary string representation.

Why do some strings encode in utf-16, while others only encode in utf-8?

>>> unicode('восстановление информации', 'utf-16')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
>>> unicode('восстановление информации', 'utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
Why do these Russian words encode in UTF-8 fine, but not UTF-16?
You are asking the unicode function to decode a byte string and then giving it the wrong encoding.
Pasting your string into Python-2.7 on OS-X gives
>>> 'восстановление информации'
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
At this stage it is already a UTF-8 encoded string (probably your terminal determined this), so you can decode it by specifying the utf-8 codec
>>> 'восстановление информации'.decode('utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
But not UTF-16 as that would be invalid
>>> 'восстановление информации'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
If you want to encode a unicode string to UTF-8 or UTF-16, then use
>>> u'восстановление информации'.encode('utf-16')
'\xff\xfe2\x04>\x04A\x04A\x04B\x040\x04=\x04>\x042\x04;\x045\x04=\x048\x045\x04 \x008\x04=\x04D\x04>\x04#\x04<\x040\x04F\x048\x048\x04'
>>> u'восстановление информации'.encode('utf-8')
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
Notice the input strings are unicode (have a u at the front), but the outputs here are byte-strings (they don't have u at the start) which contain the unicode data encoded in the respective formats.

UnicodeDecodeError reading string in CSV

I'm having a problem reading some chars in python.
I have a csv file in UTF-8 format, and I'm reading, but when script read:
Preußen Münster-Kaiserslautern II
I get this error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 515, in __call__
handler.get(*groups)
File "/Users/fermin/project/gae/cuotastats/controllers/controllers.py", line 50, in get
f.name = unicode( row[1])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I tried to use Unicode functions and convert string to Unicode, but I haven't found the solution. I tried to use sys.setdefaultencoding('utf8') but that doesn't work either.
Try the unicode_csv_reader() generator described in the csv module docs.

Categories