Python Unicode strings and the Python interactive interpreter - python

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.
So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.
str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.
That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?
The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.
# Editor (Sublime Text)
>>> s = "La caña de España"
>>> s
'La ca\xc3\xb1a de Espa\xc3\xb1a'
>>> s.decode("utf-8")
u'La ca\xf1a de Espa\xf1a'
>>> sys.getdefaultencoding()
'ascii'
# Windows python interpreter
>>> s= "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
>>> sys.getdefaultencoding()
'ascii'

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).
So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].
Still, Python knows about that. From cmd.exe you'll probably get something like:
>>> import sys
>>> sys.stdin.encoding
'cp437'
whereas within Sublime Text you'll get something like
>>> import sys
>>> sys.stdin.encoding
'utf-8'

The interpreter uses your command prompt's native encoding for text entry. In your case it's CP437:
>>> print '\xa4'.decode('cp437')
ñ

You're getting confused because the editor and the interpreter are using different encodings themselves. The python interpreter uses your system default (in this case, cp437), while your editor uses utf-8.
Note, the difference disappears if you specify a unicode string, like so:
# Windows python interpreter
>>> s = "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s = u"La caña de España"
>>> s
u'La ca\xf1a de Espa\xf1a'
The moral of the story? Encodings are tricky. Be sure you know what encoding your source files are in, or play it safe by always using the escaped version of special characters.

Related

Why Python (2.7) encode and decode functions failed [duplicate]

I'm really confused. I tried to encode but the error said can't decode....
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
I know how to avoid the error with "u" prefix on the string. I'm just wondering why the error is "can't decode" when encode was called. What is Python doing under the hood?
"你好".encode('utf-8')
encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of
"你好".decode().encode('utf-8')
But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.
Always encode from unicode to bytes.
In this direction, you get to choose the encoding.
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好
The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好
This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:
A unicode object is decoded already, you never want to call decode on it.
A bytestring object is encoded already, you never want to call encode on it.
Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).
These implicit conversions are why you can get UnicodeDecodeError when you've called encode. It's because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there's an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder†, giving you the decoding error inside an encoder.
In fact, in Python 3 the methods str.decode and bytes.encode don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.
† ...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'
You can try this
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
Or
You can also try following
Add following line at top of your .py file.
# -*- coding: utf-8 -*-
If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:
Python 2.7.2 (default, Jan 14 2012, 23:14:09)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
Further reading: Unicode HOWTO.
You use u"你好".encode('utf8') to encode an unicode string.
But if you want to represent "你好", you should decode it. Just like:
"你好".decode("utf8")
You will get what you want. Maybe you should learn more about encode & decode.
In case you're dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.
"你好".encode('ascii','ignore')
or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.
Not particularly useful in this example, but can work better in other scenarios when it's not possible to convert some special characters.
Alternatively you can consider replacing particular character using replace().
If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.
Call locale charmap from the shell (not the python interpreter) and you should see
[user#host dir] $ locale charmap
UTF-8
[user#host dir] $
If this is not the case, and you see something else, e.g.
[user#host dir] $ locale charmap
ANSI_X3.4-1968
[user#host dir] $
Python will (at least in some cases such as in mine) inherit the shell's encoding and will not be able to print (some? all?) unicode characters. Python's own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.
If you find that you have this problem, you can fix that by
[user#host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user#host dir] $ locale charmap
UTF-8
[user#host dir] $
(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.

Is u'string' the same as 'string'.decode('XXX')

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why?
ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string
>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'
The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.
UPDATES
a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文
>>> b = u'中文'
>>> print b
ÖÐÎÄ
Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.
Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:
>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'
The values are indeed not equal, because they are not the same text.
Again, it is important that you use the right codec; a different codec will result in very different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ
I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.
Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.
To see what codec Python detected, print sys.stdin.encoding:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
to gain a more complete understanding on how Unicode works, and how Python handles Unicode.
Assuming Python2.7 by the title.
The answer is no. No because when you issue string.decode(XXX) you'll get a Unicode depending on the codec you pass as argument.
When you use u'string' the codec is inferred by the shell's current encoding, or if it's a file, you'll get ascii as default or whatever # coding: utf-8 special comment you insert at the beginning of the script.
Just to clearify, if codec XXX is ensured to always be the same codec used for the script's input (either the shell or the file) then both approaches behave pretty much the same.
Hope this helps!

Is there an easy way to make unicode work in python?

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.
This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.
There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.
Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.
be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers
Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(

Printing escaped Unicode in Python

>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.
To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…
Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])
Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.

Python Unicode in and out of IDE

when I run my programs from within Eclipse IDE the following piece of code works perfectly:
address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]
but when running the same piece of code directly with python, I get an "UnicodeDecodeError".
What does the IDE does differently that it doesn't fall on this error ?
ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
Edit:
Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clicking" the file it self, and not running from within the IDE.
UnicodeDecodeError indicates that the error happens during decoding of a bytestring into Unicode.
In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:
>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
u"\N{EM DASH}".encode('utf-8') is a bytestring and invoking .encode('utf-8') the 2nd time leads to implicit .decode(sys.getdefaultencoding()) that leads to the UnicodeDecodeError.
What does the IDE does differently that it doesn't fall on this error ?
It probably works in IDE because it changes sys.getdefaultencoding() to utf-8 that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding() on Python 2.
I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
You should use unicodedata.normalize() instead:
>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True
Note: the code in your question may produce surprising results:
#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...
address_name.upper() calls bytes.upper method while i[5].upper() calls unicode.upper method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold() method instead:
key = unicode_address_name.casefold()
... if key == i[5].casefold()...
In general, If you need to sort unicode strings then you could use icu.Collator. Compare the default lexicographical sort:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']
with the order in en_GB locale:
>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']
I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.
The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.
Thank you all for the attention, and I hope this post can help people in the future with a similar problem.

Categories