Python 3 chokes on CP-1252/ANSI reading

Python 3 chokes on CP-1252/ANSI reading - python

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Related

What kind of Encoding does a standard midi file use?

Here's what brought this question up:
with open(path + "/OneChance1.mid") as f:
for line in f.readline():
print(line)
Here I am simply trying to read a midi file to scour its contents. I then receive this error message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 153: character maps to <undefined>
If I use open()'s second param like so: with open(path + "/OneChance1.mid"m encoding='utf-8) as f: then I receive this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 13: invalid start byte
If I change the encoding param to ascii I get another error about an ordinal being out of range. Lastly I tried utf-16 and it said that the file didn't start with BOM (which made me smile for some reason). Also, if I ignore the errors I get characters that resemble nothing of the kind of data I am expecting. My expectations are based on this source: http://www.sonicspot.com/guide/midifiles.html
Anyway, does anyone know what kind of encoding a midi file uses? My research is coming up short in that regard so I thought it would be worth asking on SO. Or maybe someone can point out some other possibilities or blunders?

MIDI files are binary content. By opening the file as a text file however, Python applies the default system encoding in trying to decode the text as Unicode.
Open the file in binary mode instead:
with open(midifile, 'rb') as mfile:
leader = mfile.read(4)
if leader != b'MThd':
raise ValueError('Not a MIDI file!')
You'd have to study the MIDI standard file format if you wanted to learn more from the file. Also see What is the structure of a MIDI file?

It's a binary file, it's not text using a text encoding like you seem to expect.
To open a file in binary mode in Python, pass a string containing "b" as the second argument to open().
This page contains a description of the format.

Parse file in robust way with python 3

I have a log file that I need to go through line by line, and apparently it contains some "bad bytes". I get an error message along the following lines:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9: invalid start byte
I have been able to strip down the problem to a file "log.test" containing the following line:
Message: \260
(At least this is how it shows up in my Emacs.)
I have a file "demo_error.py" which looks like this:
import sys
with open(sys.argv[1], 'r') as lf:
for i, l in enumerate(lf):
print(i, l.strip())
I then run, from the command line:
$ python3 demo_error.py log.test
The full traceback is:
Traceback (most recent call last):
File "demo_error.py", line 5, in <module>
for i, l in enumerate(lf):
File "/usr/local/Cellar/python3/3.4.0/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 13: invalid start byte
My hunch is that I have to somehow specify a more general codec ("raw ascii" for instance) - but I'm not quite sure how to do this.
Note that this is not really a problem in Python 2.7.
And just to make my point clear: I don't mind getting an exception for the line in question - then I can simply discard the line. The problem is that the exception seems to happen on the "for" loop itself, which makes special handling of that particular line impossible.

You can also use the codecs module. When you use the codecs.open() function, you can specify how it handles errors using the errors argument:
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
The errors argument can be one of several different keywords that specify how you want Python to behave when it attempts to decode a character that is invalid for the current encoding. You'll probably be most interested in codecs.ignore_errors or codecs.replace_errors, which cause invalid characters to be either ignored or replaced with a default character, respectively.
This method can be a good alternative when you know you have corrupt data that will cause the UnicodeDecodeError to be raised even when you specify the correct encoding.
Example:
with codecs.open('file.txt', mode='r', errors='ignore'):
# ...stuff...
# Even if there is corrupt data and invalid characters for the default
# encoding, this open() will still succeed

So apparently your file does not contain valid UTF-8 (which is the default encoding).
If you know, what encoding is used (e.g. iso-8859-1 which was afaik the python2 default), you can specify it when opening by using
open(sys.argv[1], mode='r', encoding='iso-8859-1')
If the encoding is unknown or not valid as all, you can open the file as binary.
open(sys.argv[1], mode='rb')
This will make the content accessible as bytes rather than trying to interpret them as characters.

In python <=2.7, strings (str) are arrays of 8 bits characters. So when reading a file composed of 8 bits characters or bytes, you get the bytes without problem, no matter what the actual encoding is. Simply, you may read them with a wrong representation, but it will never throw any exception.
In python >=3,strings are unicode strings (16 bits per character). So when reading a file python has to decode the file, and by default it uses system encoding - not necessarily UTF-8. In your case, it seems to assume UTF-8 encoding, when your log file is not UTF-8 encoding so the exception.
If not sure of the encoding you may reasonably try to use ISO-8859-1 with
open(sys.argv[1], 'r', encoding='iso-8859-1')

UnicodeDecodeError: Converting type string to unicode

I am trying to replace text. Unfortunately, the main string is stored as type unicode, but the string which describes the text to be replaced is stored as type string. Below is a reproducible example:
mystring = u'Bunch of text with non-standard character in the name Rubén'
old = 'Rubén'
new = u'newtext'
mystring.replace(old, new)
This throws an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
I get the same error when I try to convert old to unicode with unicode(old). Several answers solve the problem for specific characters, but I cannot find a generic solution.

You need to convert the old value to Unicode with an explicit codec. What that codec is depends entirely on how you sourced old.
If it is a string literal in the source code, use the source code encoding. Python won't accept your source file unless you specified a valid codec at the top in a comment; see PEP 263
Pasting your old definition into a terminal will use your terminal codec (the terminal sends Python encoded bytes as you paste).
If the data is sourced from anywhere else, you'll need to determine the encoding from that source. For HTTP data, check the Content-Type header for a charset parameter, for example.
Then decode:
old = old.decode(encoding)
When you use unicode(old) without an explicit codec, or try to use the bytestring in unicode.replace(), Python uses the default codec, ASCII.
Demo in my terminal, configured to use UTF-8:
>>> import sys
>>> sys.stdin.encoding # reflects the detected terminal codec
'UTF-8'
>>> old = 'Rubén'
>>> old # shows encoded data in python string literal form
'Rub\xc3\xa9n'
>>> old.decode('utf8') # unicode string literal form
u'Rub\xe9n'
>>> print old.decode('utf8') # string value written to the terminal
Rubén
>>> mystring = u'Bunch of text with non-standard character in the name Rubén'
>>> new = u'newtext'
>>> mystring.replace(old, new)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>> mystring.replace(old.decode('utf8'), new)
u'Bunch of text with non-standard character in the name newtext'
Generally speaking, you want to decode early, encode late; make your data flow a Unicode Sandwich. As soon as your receive text, decode it all to Unicode values, and don't encode again until the data is leaving your program.

Pythonic way of reading NUL in a file

I'm using python to read a text file with the segment below
(can't post a screenshot since i'm a noob) but this is what is looks like in notepad++:
NULSOHSOHNULNULNULSUBMesssage-ID:
error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(f.readline())
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7673: character maps to <undefined>
Opening the file as binary:
f = open('file.txt','rb')
f.readline()
gives me the text as binary
b'\x00\x01\x01\x00\x00\x00\x1a\xb7Message-ID:
but how do I get the text as ascii ? And whats the easiest/pythonic way of handling this ?

The problem is with "byte 0x8f in position 7673", not with "byte 0x00 in position 1". I.e., your NUL is not the problem. If you look at the cp-1252 codepage on wikipedia, you can see that 0x8f has no corresponding character.
The larger issue is that your file is not in a single encoding: it appears to be a mix of binary framing of text segments. What you really need to do is figure out the format of this file and parse it into binary pieces (or perhaps some richer data structure, like a tuple, list, dict, object, etc), then decode the text pieces into unicode if you need to process further.

When opening a file in text mode, you can specifically tell which encoding to use:
f = open('file.txt','r',encoding='ascii')
However your real problem is different: the binary piece that you cited can not be read as ASCII, because the byte \xb7 is outside of ASCII range (0-127). The exception traceback tells that Python is using cp1252 codec by default, which cannot decode your file either.
You need either to figure out which encoding the file has, or to handle it as binary all the time.

Perhaps open it in read the correct read mode?
f = open('file.txt','r')
f.readline()

How do I print a list of strings, when I can't know the char encoding in advance?

I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?

First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.
The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding) or in absence of that just ascii. If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError.
Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:
>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437.
Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½

The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.
from BeautifulSoup import UnicodeDammit
u = UnicodeDammit("Ólafur Jóhann Ólafsson")
print u.unicode
print u.originalEncoding

This page may help you http://wiki.python.org/moin/PrintFails
The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3 chokes on CP-1252/ANSI reading - python

Related

What kind of Encoding does a standard midi file use?

Parse file in robust way with python 3

UnicodeDecodeError: Converting type string to unicode

Pythonic way of reading NUL in a file

How do I print a list of strings, when I can't know the char encoding in advance?

Categories

Resources