Python doesn't interpret UTF8 correctly

Python doesn't interpret UTF8 correctly - python

I know similar questions have been asked a million times, but despite reading through many of them I can't find a solution that applies to my situation.
I have a django application, in which I've created a management script. This script reads some text files, and outputs them to the terminal (it will do more useful stuff with the contents later, but I'm still testing it out) and the characters come out with escape sequences like \xc3\xa5 instead of the intended å. Since that escape sequence means Ã¥, which is a common misinterpretation of å because of encoding problems, I suspect there are at least two places where this is going wrong. However, I can't figure out where - I've checked all the possible culprits I can think of:
The terminal encoding is UTF-8; echo $LANG gives en_US.UTF-8
The text files are encoded in UTF-8; file * in the directory where they reside results in all entries being listed as "UTF-8 Unicode text" except one, which does not contain any non-ASCII characters and is listed as "ASCII text". Running iconv -f ascii -t utf8 thefile.txt > utf8.txt on that file yields another file with ASCII text encoding.
The Python scripts are all UTF-8 (or, in several cases, ASCII with no non-ASCII characters). I tried inserting a comment in my management script with some special characters to force it to save as UTF-8, but it did not change the behavior. The above observations on the text files apply on all Python script files as well.
The Python script that handles the text files has # -*- encoding: utf-8 -*- at the top; the only line preceding that is #!/usr/bin/python3, but I've tried both changing to .../python for Python 2.7 or removing it entirely to leave it up to Django, without results.
According to the documentation, "Django natively supports Unicode data", so I "can safely pass around Unicode strings" anywhere in the application.
I really can't think of anywhere else to look for a non-UTF-8 link in the chain. Where could I possibly have missed a setting to change to UTF-8?
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
UPDATE:
In response to quiestions in comments:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None) for all files.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not. Adding .decode('utf-8') yields a tuple where both strings are marked with a prepending u and \xe5 (the correct escape sequence for å) instead of the odd characters before - but I can't figure out how to print them as regular strings, with no escape characters. I've tested another call to .decode('utf-8') as well as wrapping in str() but both fail with UnicodeEncodeError complaining that \xe5 can't be encoded in ascii. Since a single string works correctly, I don't know what else to test.
SSCCE:
# -*- coding: utf-8 -*-
import os, sys
for root,dirs,files in os.walk('txt-songs'):
for filename in files:
with open(os.path.join(root,filename)) as f:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)
lines = f.readlines()
print(lines[0].strip()) # works
print(lines[0].strip(), lines[1].strip()) # does not work

The big problem here is that you're mixing up Python 2 and Python 3. In particular, you've written Python 3 code, and you're trying to run it in Python 2.7. But there are a few other problems along the way. So, let me try to explain everything that's going wrong.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not.
The first problem here is that the str of a tuple (or any other collection) includes the repr, not the str, of its elements. The simple way to solve this problem is to not print collections. In this case, there is really no reason to print a tuple at all; the only reason you have one is that you've built it for printing. Just do something like this:
print '({}, {})'.format(lines[0].strip(), lines[1].strip())
In cases where you already have a collection in a variable, and you want to print out the str of each element, you have to do that explicitly. You can print the repr of the str of each with this:
print tuple(map(str, my_tuple))
… or print the str of each directly with this:
print '({})'.format(', '.join(map(str, my_tuple)))
Notice that I'm using Python 2 syntax above. That's because if you actually used Python 3, there would be no tuple in the first place, and there would also be no need to call str.
You've got a Unicode string. In Python 3, unicode and str are the same type. But in Python 2, it's bytes and str that are the same type, and unicode is a different one. So, in 2.x, you don't have a str yet, which is why you need to call str.
And Python 2 is also why print(lines[0].strip(), lines[1].strip()) prints a tuple. In Python 3, that's a call to the print function with two strings as arguments, so it will print out two strings separated by a space. In Python 2, it's a print statement with one argument, which is a tuple.
If you want to write code that works the same in both 2.x and 3.x, you either need to avoid ever printing more than one argument, or use a wrapper like six.print_, or do a from __future__ import print_function, or be very careful to do ugly things like adding in extra parentheses to make sure your tuples are tuples in both versions.
So, in 3.x, you've got str objects and you just print them out. In 2.x, you've got unicode objects, and you're printing out their repr. You can change that to print out their str, or to avoid printing a tuple in the first place… but that still won't help anything.
Why? Well, printing anything, in either version, just calls str on it and then passes it to sys.stdio.write. But in 3.x, str means unicode, and sys.stdio is a TextIOWrapper; in 2.x, str means bytes, and sys.stdio is a binary file.
So, the pseudocode for what ultimately happens is:
sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))
sys.stdio.write(s.encode(sys.getdefaultencoding()))
And, as you saw, those will do different things, because:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None)
You can simulate Python 3 here by using a io.TextIOWrapper or codecs.StreamWriter and then using print >>f, … or f.write(…) instead of print, or you can explicitly encode all your unicode objects like this:
print '({})'.format(', '.join(element.encode('utf-8') for element in my_tuple)))
But really, the best way to deal with all of these problems is to run your existing Python 3 code in a Python 3 interpreter instead of a Python 2 interpreter.
If you want or need to use Python 2.7, that's fine, but you have to write Python 2 code. If you want to write Python 3 code, that's great, but you have to run Python 3.3. If you really want to write code that works properly in both, you can, but it's extra work, and takes a lot more knowledge.
For further details, see What's New In Python 3.0 (the "Print Is A Function" and "Text Vs. Data Instead Of Unicode Vs. 8-bit" sections), although that's written from the point of view of explaining 3.x to 2.x users, which is backward from what you need. The 3.x and 2.x versions of the Unicode HOWTO may also help.

For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
In Python 3.x, the standard print function just writes Unicode to sys.stdout. Since that's a io.TextIOWrapper, its write method is equivalent to this:
self.wrapped_binary_file.write(s.encode(self.encoding, self.errors))
So one likely problem is that sys.stdout.encoding does not match your terminal's actual encoding.
And of course another is that your shell's encoding does not match your terminal window's encoding.
For example, on OS X, I create a myscript.py like this:
print('\u00e5')
Then I fire up Terminal.app, create a session profile with encoding "Western (ISO Latin 1)", create a tab with that session profile, and do this:
$ export LANG=en_US.UTF-8
$ python3 myscript.py
… and I get exactly the behavior you're seeing.

It seems from your comment that you are using python-2 and not python-3.
If you are using python-3, it's worth reading the unicode howto guide on reading/writing to understand what python is doing.
The basic flow if encoding is:
DECODE from encoding to unicode -> Processing -> Encode from unicode to encoding
In python3 the bytes are decoded to strings and strings are encoded to bytes.
The bytes to string decoding is handled for you with open().
[..] the built-in open() function can return a file-like object that
assumes the file’s contents are in a specified encoding and accepts
Unicode parameters for methods such as read() and write(). This works
through open()‘s encoding and errors parameters [..]
So to read in unicode from a utf-8 encoded file you should be doing this:
# python-3
with open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode
If you want similar functionality using python-2, you can use codecs.open():
# python-2
import codecs
with codecs.open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode

Related

If I want to use UTF-8 encoding, the default for python, do I have to encode my string variables to byte variables?

If I have a string that I want to use in byte form encoded as UTF-8, do I need to encode the variable as a byte variable? Or, since Python is by default encoded as UTF-8, will it just treat the string as UTF-8 byte form in certain contexts without explicit encoding?
For example, I'm working on a project where I have an array of dictionaries that map strings to strings. If I write this array to a file with json.dump and then read it with json.load, the strings are recovered just fine, and I get no error, despite never encoding. This indicates to me that if you're just using UTF-8, you don't actually need to convert to byte form. Am I wrong? If I'm right, is this bad practice nonetheless? Would my example be any different if I were just writing strings without the JSON?

Python has multiple defaults regarding encoding.
In Python 3, the situation is as follows:
The source file encoding is UTF-8 by default. You can override this with a comment in one of the first two lines of the module (# coding: latin-1) if you really have to. It only affects string literals (and variable names).
The encoding parameter of str.encode() and bytes.decode() is UTF-8 too.
But when you open a file with open(), then the default for encoding depends on the circumstances (OS, env variables, Python version, build). You can check its value with locale.getpreferredencoding(). This default is also used when you read from sys.stdin or use print().
So I'd say it's okay to rely on the defaults for the first two cases (it's officially recommended for the first one).
But the third one is tricky: The IO default is UTF-8 on many systems, so you might think that with open(path) as f: will always use UTF-8, because it did so during development, but then you port the script to a different server and suddenly it raises UnicodeErrors or produces gibberish.
It's often not necessary to deal with encoded strings (ie. bytes objects) for processing text.
Rather, you make sure to have it decoded when reading and encoded when writing/sending the text.
This is done automatically for streams created with open() (unless you specify binary mode 'rb'/'wb').
If you think input/output has to be UTF-8, then you should explicitly specify encoding='utf8' when calling open().

Is there a way to specify which Unicode format is used in unicode encoding in python 2.7?

so I'd like to encode some values in Unicode in my python 2.7 script. I'd like to know if I can specify which type of Unicode to use, i.e UTF-8 vs UTF-32. Apart from that are there any limitations as to which encodings are supported in python 2.7, and how is the default encoding determined?

So, first things first: you should be using Python 3, not Python 2.
The handling of text and unicode is the major difference between the two versions of the language, and the real reason they had to do incompatible changes, and it is much, much more straightforward in Python 3.
This means to talk about unicode in Python 2 you have to understand certain things - unicode is used to represent text: characters regardless of the underlying representation these characters have.
In Python 2 programs, all text typed in the program itself have to be typed with "u" prefixed strings, like u"..." or u'...' - otherwise the strings are considered "byte strings" - just like one have in C code. (Alternativelly, one can place from __future__ import unicode_literals in the first or second line of the file, so this is done automatically.
Otherwise, all data read into the program, either from text files, database connections, inbound HTTP requests, will usually get as byte strings in Python2, and have to explicitly converted to text-strings (that is "unicode objects" in Python 2 speak) before being processed. This is done by calling the bytes-string .decode method - and you pass as the first parameter to it the encoding name used for those bytes. That is, if you have data you have read from an utf-8 encoded file, it can be decoded to text by doing:
data = data.decode("utf-8") # and so on for other encodings.
Also, if you are typing any non-ascii character in the source code of a Python2 file, regardless of it being inside a string (or, inside a comment, for example), you have to declare the file encoding in the first line of the file.
That is done with a Python comment that is treated in a special way by the language parser - the first LoC should contain:
# encoding: utf-8
(of course, you should type the encoding actually used by your program-editor to store the file. Also, some variants on this marking are allowed, as writting "coding" instead of encoding, the ":" being optional, and so on)
So - what I've described in the previous 5 paragraphs takes place automatically in Python 3. But if you followed up so far, you now have a program running with text to be handled. As you can perceive, you did not mention in your question how you are inputing this text you want to encode in different ways.
So, just as you did explicitly convert the input bytes to in memory unicode strings, now you can use the .encode method to convert the text back to whatever text-encoding you want.
If you have some text that you want to write in a text-file encoded in utf-32 little endian, you do:
with open("myfile.txt", "wt") as file_:
file_.write(data.encode("utf-32 LE"))
The valid text codecs are listed, as per Eran's answer at:
https://docs.python.org/2/library/codecs.html#standard-encodings
Now, if you do some tests with this and succeed, you'd better do two things before proceeding any further:
switch to use Python 3. Python 2 is real obsolete at this point - check if it is not already installed in your system by typing "python3" instead of just "Python". If it is not, just install it - it can live side-by-side with Python2
Read this article, to get a grasp on what really goes on whn we talk about unicode in encodings. (The author, Joel, is the founder of Stackoverflow itself, and the article is from 2003)

In python 2, strings are by default ASCII. You can decode them and re-encode them.
supported encodings can be found here: https://docs.python.org/2/library/codecs.html#standard-encodings
Here's an example:
a = "my string" # a is ASCII encoded bytes
b = u"my string" # b is unicode, not encoded
c = a.decode() # c is unicode, not encoded, by default decoding ASCII, you can specify otherwise as an argument
d = c.encode('utf-32') # d is utf-32 encoded bytes
print type(a) # output: <type 'str'>
print type(b) # output: <type 'unicode'>
print type(c) # output: <type 'unicode'>
print type(d) # output: <type 'str'>
Note 1: that in python 3 things are somewhat different.
Note 2: In order to write non-ascii literals in your script (that is if you want to write a = "☂" as part of your code, as opposed to having a just a variable that contains data you got from somewhere) you have to declare the encoding at the top of the file, more info here. And in python 2 only a small subset of unicode characters are accepted in the literal code. (while in memory of course you are not limited).
Note 3: Of course that while unicode type is to you not encoded, internally python keeps it encoded (either as utf-32 if I'm not mistaken). But that's an internal detail that shouldn't affect your code generally speaking.

how to convert Python 2 unicode() function into correct Python 3.x syntax

I enabled the compatibility check in my Python IDE and now I realize that the inherited Python 2.7 code has a lot of calls to unicode() which are not allowed in Python 3.x.
I looked at the docs of Python2 and found no hint how to upgrade:
I don't want to switch to Python3 now, but maybe in the future.
The code contains about 500 calls to unicode()
How to proceed?
Update
The comment of user vaultah to read the pyporting guide has received several upvotes.
My current solution is this (thanks to Peter Brittain):
from builtins import str
... I could not find this hint in the pyporting docs.....

As has already been pointed out in the comments, there is already advice on porting from 2 to 3.
Having recently had to port some of my own code from 2 to 3 and maintain compatibility for each for now, I wholeheartedly recommend using python-future, which provides a great tool to help update your code (futurize) as well as clear guidance for how to write cross-compatible code.
In your specific case, I would simply convert all calls to unicode to use str and then import str from builtins. Any IDE worth its salt these days will do that global search and replace in one operation.
Of course, that's the sort of thing futurize should catch too, if you just want to use automatic conversion (and to look for other potential issues in your code).

You can test whether there is such a function as unicode() in the version of Python that you're running. If not, you can create a unicode() alias for the str() function, which does in Python 3 what unicode() did in Python 2, as all strings are unicode in Python 3.
# Python 3 compatibility hack
try:
unicode('')
except NameError:
unicode = str
Note that a more complete port is probably a better idea; see the porting guide for details.

Short answer: Replace all unicode calls with str calls.
Long answer: In Python 3, Unicode was replaced with strings because of its abundance. The following solution should work if you are only using Python 3:
unicode = str
# the rest of your goes goes here
If you are using it with both Python 2 or Python 3, use this instead:
import sys
if sys.version_info.major == 3:
unicode = str
# the rest of your code goes here
The other way: run this in the command line
$ 2to3 package -w

First, as a strategy, I would take a small part of your program and try to port it. The number of unicode calls you are describing suggest to me that your application cares about string representations more than most and each use-case is often different.
The important consideration is that all strings are unicode in Python 3. If you are using the str type to store "bytes" (for example, if they are read from a file), then you should be aware that those will not be bytes in Python3 but will be unicode characters to begin with.
Let's look at a few cases.
First, if you do not have any non-ASCII characters at all and really are not using the Unicode character set, it is easy. Chances are you can simply change the unicode() function to str(). That will assure that any object passed as an argument is properly converted. However, it is wishful thinking to assume it's that easy.
Most likely, you'll need to look at the argument to unicode() to see what it is, and determine how to treat it.
For example, if you are reading UTF-8 characters from a file in Python 2 and converting them to Unicode your code would look like this:
data = open('somefile', 'r').read()
udata = unicode(data)
However, in Python3, read() returns Unicode data to begin with, and the unicode decoding must be specified when opening the file:
udata = open('somefile', 'r', encoding='UTF-8').read()
As you can see, transforming unicode() simply when porting may depend heavily on how and why the application is doing Unicode conversions, where the data has come from, and where it is going to.
Python3 brings greater clarity to string representations, which is welcome, but can make porting daunting. For example, Python3 has a proper bytes type, and you convert byte-data to unicode like this:
udata = bytedata.decode('UTF-8')
or convert Unicode data to character form using the opposite transform.
bytedata = udata.encode('UTF-8')
I hope this at least helps determine a strategy.

You can use six library which have text_type function (unicode in py2, str in py3):
from six import text_type

Python & fql: getting "Dami\u00e1n" instead of "Damián"

I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!

You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).

Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().

Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián

Python and UTF-8: kind of confusing

I am on google app engine with Python 2.5. My application have to deal with multilanguages so I have to deal with utf-8.
I have done lots of google but dont get what I want.
1.Whats the usage of # -*- coding: utf-8 -*- ?
2.What is the difference between
s=u'Witaj świecie'
s='Witaj świecie'
'Witaj świecie' is a utf-8 string.
3.When I save the .py file to 'utf-8', do I still need the u before every string?

u'blah' turns it into a different kind of string (type unicode rather than type str) - it makes it a sequence of unicode codepoints. Without it, it is a sequence of bytes. Only bytes can be written to disk or to a network stream, but you generally want to work in Unicode (although Python, and some libraries, will do some of the conversion for you) - the encoding (utf-8) is the translation between these. So, yes, you should use the u in front of all your literals, it will make your life much easier. See Programatic Unicode for a better explanation.
The coding line tells Python what encoding your file is in, so that Python can understand it. Again, reading from disk gives bytes - but Python wants to see the characters. In Py2, the default encoding for code is ASCII, so the coding line lets you put things like ś directly in your .py file in the first place - other than that, it doesn't change how your code works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.