pypy3 - UnicodeDecodeError when reading a csv file [duplicate]

pypy3 - UnicodeDecodeError when reading a csv file [duplicate] - python

Ok, so python3 and unicode. I know that all python3 strings are actually unicode strings and all python3 code is stored as utf-8. But how does python3 reads text files? Does it assume that they are encoded in utf-8? Do I need to call decode('utf-8') when reading a text file? What about pandas read_csv() and to_csv()?

Python's built-in function open() has an optional parameter encoding:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding() returns),
but any text encoding supported by Python can be used. See the
codecs module for the list of supported encodings.
Analogous parameter could be found in pandas:
pandas.read_csv(): encoding: str, default None. Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
Series.to_csv(): encoding: string, optional. A string representing the encoding to use if the contents are non-ascii, for python versions prior to 3.
DataFrame.to_csv(): encoding: string, optional. A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

Do I need to call decode('utf-8') when reading a text file?
You need to try-read a text file to make sure it's utf-8 encoding in the file.

Related

json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

This tiny python program:
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
import sys
x = { "name":u"This doesn't work β" }
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
print
Generates this output when run at a terminal:
$ ./tester.py
{"name": "This doesn't work β"}
Which is exactly as I would expect. However, if I redirect stdout to a file, it fails:
$ ./tester.py > output.json
Traceback (most recent call last):
File "./tester.py", line 9, in <module>
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b2' in position 19: ordinal not in range(128)
However, a direct print (without json.dump) can can be redirect to file:
print u"This does work β".encode('utf-8')
It's as if the json package ignores the encoding option if stdout is not a terminal.
How can I get the json package to do what I want?

JSON is a text serialization format (that incidentally has a recommended binary encoding), not a binary serialization format. The json module itself only cares about encoding to the extent that it would like to know what Python 2's terrible str type is supposed to represent (is it ASCII bytes? UTF-8 bytes? latin-1 bytes?).
Since Python 2 text handling is, as stated, terrible, the json module is happy to return either str (when ensure_ascii is true, or the stars align in other cases and it's convinced you've told it str is compatible with your expected encoding, and none of the inputs are actually unicode) or unicode (when ensure_ascii is false, most of the time).
Like the rest of Python 2, sys.stdout is a bit wishy-washy. Even if it is set to an encoding='ascii' by your locale settings, it ignores it when you write a str to it (sys.stdout.write('\xe9') should fail, but instead, it treats the str as pre-encoded raw binary data and doesn't bother to verify it matches the expected encoding. But when unicode comes in, it doesn't have that option; unicode is text (not UTF-8 text, not ASCII text, etc.), from the ideal text world of unicorns and rainbows, and that world isn't expressed in tawdry bytes.
So sys.stdout must encode the result, and it does so with the locale determined encoding (sys.stdout.encoding will tell you what it is). When that's ASCII, and it receives something that can't encode to ASCII, it explodes (as it should).
The point is, the json module is always returning text (either unicode, or str that it's convinced is effectively text in the wishy-washy Python 2 world), and sometimes you get lucky and that text happens to be in a format that bypasses checks in sys.stdout.
But you shouldn't be relying on that. If your output must be in a specific encoding, use that encoding. The simplest way to do this (simplest in the sense that it pushes most work to the interpreter to do for you) is to not use sys.stdout (explicitly, or implicitly via print) and write your data to files you open with io.open (a backport of Python 3's open, that properly handles encodings), explicitly specifying encoding='utf-8'. If you must use sys.stdout, and you insist on ignoring the locale encoding, you can rewrap it, e.g.:
with io.open(sys.stdout.fileno(), encoding='utf-8', closefd=False) as encodedout:
json.dump(x, encodedout, ensure_ascii=False, encoding="utf-8")
which temporarily wraps the stdout file descriptor in a modern file-like object (using closefd to avoid closing the underlying descriptor when it's closed).
TL;DR: Switch to Python 3. Python 2 is awful when it comes to non-ASCII text, and its modules are often even worse (json should absolutely be returning a consistent type, or at least just one type for each setting of ensure_ascii, not dynamically selecting based on the inputs and encoding; it's not even the worst either, the csv module is absolutely awful). Also, it's reached end-of-life, and will not be patched for anything from here on out, so continuing to use it leaves you vulnerable to any security problems found between the beginning of this year and the end of time. Among other things, Python 3 uses str exclusively for text (which has the full Unicode support of Py2's unicode type) and modern Python 3 (3.7+) will coerce ASCII locales to UTF-8 (because basically all systems can actually handle the latter), which should fix all your problems. Non-ASCII text will behave the same as ASCII text, and weirdo locales like yours that insist they're ASCII (and therefore won't handle non-ASCII output) will be "fixed" to work as you desire, without manually encoding and decoding, rewrapping file handles, etc.

Consolidating all the comments and answers into one final answer:
Note: this answer is for Python 2.7. Python 3 is likely to be different.
The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.
You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.
There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.
The encoding argument was a red herring. That option tells the json package how the input strings are encoded.
Here's what finally worked for me:
ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)
tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.

There is an environment variable Python uses that can override encoding to the terminal or for redirection, so this should work without wrapping stdout inside the script.
$ export PYTHONIOENCODING=utf8
$ ./tester.py > output.json

What is the difference between utf-8 and utf-8-sig?

I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file).
Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents.

What is the default encoding method for code assumed by Python interpreter?

Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.

Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'

By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html

Writing UTF-8 friendly parsers in python

I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.

Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.

If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .
If you are using Python 3, strings are encoded in UTF-8 by default.

If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.
Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.

So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
% cat /tmp/u
π is 3.14.
% file /tmp/u
/tmp/u: UTF-8 Unicode text
% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()
% python f.py
['\xcf\x80', 'is', '3.14.']
This changes when you declare or accept standard input using UTF-8.
% cat g.py
s = 'π is 3.14.'
print s.split()
% python g.py
File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).
% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()
% python h.py
['\xcf\x80', 'is', '3.14.']

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx

codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')

You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.

It's the same as in PHP. UTF8 strings are good to print.

The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.