Python Unicode in and out of IDE

Python Unicode in and out of IDE - python

when I run my programs from within Eclipse IDE the following piece of code works perfectly:
address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]
but when running the same piece of code directly with python, I get an "UnicodeDecodeError".
What does the IDE does differently that it doesn't fall on this error ?
ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
Edit:
Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clicking" the file it self, and not running from within the IDE.

UnicodeDecodeError indicates that the error happens during decoding of a bytestring into Unicode.
In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:
>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
u"\N{EM DASH}".encode('utf-8') is a bytestring and invoking .encode('utf-8') the 2nd time leads to implicit .decode(sys.getdefaultencoding()) that leads to the UnicodeDecodeError.
What does the IDE does differently that it doesn't fall on this error ?
It probably works in IDE because it changes sys.getdefaultencoding() to utf-8 that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding() on Python 2.
I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
You should use unicodedata.normalize() instead:
>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True
Note: the code in your question may produce surprising results:
#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...
address_name.upper() calls bytes.upper method while i[5].upper() calls unicode.upper method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold() method instead:
key = unicode_address_name.casefold()
... if key == i[5].casefold()...
In general, If you need to sort unicode strings then you could use icu.Collator. Compare the default lexicographical sort:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']
with the order in en_GB locale:
>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']

I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.
The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.
Thank you all for the attention, and I hope this post can help people in the future with a similar problem.

Related

What is the most Pythonic way to handle unicode strings in code that runs in both 2.x and 3.x

I'm writing code which must work on both 2 and 3, and occasionally have to deal with utf-8 strings.
Consider the following on 2.x:
>>> mystr = 'Nyår'
>>> mystr_u = mystr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
This will work on 3, but will raise a UnicodeError in 2.x because of the differences between what the str object actually is.
This is obviously problematic when code must work in both versions.
How I am currently getting around this is with the following:
>>> mystr = 'Nyår'
>>> try:
... mystr_u = mystr.encode('utf-8')
... except UnicodeError as e:
... mystr_u = mystr
...
>>> mystr_u
'Ny\xc3\xa5r'
This seems a little messy to me. Is there a more Pythonic way to make my code version-independent when it comes to utf-8 strings?
Edit: Just to clarify, I am not working with literals in the actual code. That was just done for the example. The actual code is getting the string from another call, such as os.listdir() on a directory with entries using unicode characters.

Unless you need to work in Python 3.0-3.2, the simple answer is to use explicitly-unicode string literals, like this:
mystr = u'Nyår'
In Python 2.7, this means the literal is a unicode instead of a str. In Python 3.4+, it's ignored, because literals are already str (which is Unicode) by default.
And now you can use it as Unicode, or encode it to UTF-8 (which will give str in 2.x and bytes in 3.x), or whatever else you want, and it will work consistently.
Also, if this is a literal in a module's or script's source code, make sure to add an encoding declaration to the file—the default isn't UTF-8 until Python 3.5. And of course make sure to actually save the file as UTF-8 in your editor.
If you do need to work with 3.2, there is no ideal option. The best choices are to use six.u to fake it (at some cost in performance on 2.x, and in error handling if you screw up the encoding declaration), or to use 2to3 plus modernize instead of a static codebase.
If you're looking for a way to handle the string type rather than literals, you may want to use six, or just do something like this at the top of your code:
try:
unicode
except NameError:
unicode = str
And then, if you're writing a library function that should take both unicode and bytes and do different things accordingly, you can write code that works properly in both 2.x and 3.x, like:
def func(s, default_encoding='utf-8'):
if not isinstance(s, unicode):
s = s.decode(default_encoding)
# now use s knowing it's unicode

Is there an easy way to make unicode work in python?

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.

This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.

There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.

be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers

Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(

Printing escaped Unicode in Python

>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>

>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.

To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…

Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])

Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.

SQLite, python, unicode, and non-utf data

I started by trying to store strings in sqlite using python, and got the message:
sqlite3.ProgrammingError: You must
not use 8-bit bytestrings unless you
use a text_factory that can interpret
8-bit bytestrings (like text_factory =
str). It is highly recommended that
you instead just switch your
application to Unicode strings.
Ok, I switched to Unicode strings. Then I started getting the message:
sqlite3.OperationalError: Could not
decode to UTF-8 column 'tag_artist'
with text 'Sigur Rós'
when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur RÃ³s'
note: My console was set to display in 'latin_1' as #John Machin pointed out.
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings?
I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update.
Summary of answers
Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like:
str.decode('source_encoding').encode('desired_encoding')
or if the str is already in unicode
str.encode('desired_encoding')
For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python.
The encoding of the string you want to work with, and the encoding you want to get it to.
The system encoding.
The console encoding.
The encoding of the source file
Elaboration:
(1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on.
(2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added:
\# sitecustomize.py
\# this file can be anywhere in your Python path,
\# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('utf_8')
This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding.
(3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working.
(4) If you want to have some test strings, eg
test_str = "ó"
in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file:
# -*- coding: utf_8 -*-
If you don't have this information, python attempts to parse your code as ascii by default, and so:
SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run:
'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Now I will change the system and console encoding to latin_1, and I get this output for the same input:
'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Notice that the 'original' character displays correctly and the builtin unicode() function works now.
Now I change my console output back to utf_8.
'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what.
Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input.
One more thing: there are apparently crazy application developers making life difficult in Windows.
#!/usr/bin/env python
# -*- coding: utf_8 -*-
import os
import sys
def encodingDemo(str):
validStrings = ()
try:
print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
validStrings += ((str,""),)
except UnicodeEncodeError as ude:
print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print ude
try:
x = unicode(str)
print "unicode(str) = ",x
validStrings+= ((x, " decoded into unicode by the default system encoding"),)
except UnicodeDecodeError as ude:
print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()),
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('latin_1')
print "str.decode('latin_1') =",x
validStrings+= ((x, " decoded with latin_1 into unicode"),)
try:
print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
except UnicodeDecodeError as ude:
print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t",
print ude
except UnicodeDecodeError as ude:
print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('utf_8')
print "str.decode('utf_8') =",x
validStrings+= ((x, " decoded with utf_8 into unicode"),)
try:
print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
except UnicodeDecodeError as ude:
print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t",
validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
print ude
except UnicodeDecodeError as ude:
print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee
print
print "Printing information about each character in the original string."
for char in str:
try:
print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
except UnicodeDecodeError as ude:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
except UnicodeEncodeError as uee:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
print uee
try:
x = unicode(char)
print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = unicode(char) ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('latin_1')
print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('utf_8')
print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
print
x = 'ó'
encodingDemo(x)
Much thanks for the answers below and especially to #John Machin for answering so thoroughly.

I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it
repr() and unicodedata.name() are your friends when it comes to debugging such problems:
>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>
If you send oacute_utf8 to a terminal that is set up for latin1, you will get A-tilde followed by superscript-3.
I switched to Unicode strings.
What are you calling Unicode strings? UTF-16?
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I can't imagine how it seems so to you. The story that was being conveyed was that unicode objects in Python and UTF-8 encoding in the database were the way to go. However Martin answered the original question, giving a method ("text factory") for the OP to be able to use latin1 -- this did NOT constitute a recommendation!
Update in response to these further questions raised in a comment:
I didn't understand that the unicode characters still contained an implicit encoding. Am I saying that right?
No. An encoding is a mapping between Unicode and something else, and vice versa. A Unicode character doesn't have an encoding, implicit or otherwise.
It looks to me like unicode("\xF3") and "\xF3".decode('latin1') are the same when evaluated with repr().
Say what? It doesn't look like it to me:
>>> unicode("\xF3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>
Perhaps you meant: u'\xf3' == '\xF3'.decode('latin1') ... this is certainly true.
It is also true that unicode(str_object, encoding) does the same as str_object.decode(encoding) ... including blowing up when an inappropriate encoding is supplied.
Is that a happy circumstance
That the first 256 characters in Unicode are the same, code for code, as the 256 characters in latin1 is a good idea. Because all 256 possible latin1 characters are mapped to Unicode, it means that ANY 8-bit byte, ANY Python str object can be decoded into unicode without an exception being raised. This is as it should be.
However there exist certain persons who confuse two quite separate concepts: "my script runs to completion without any exceptions being raised" and "my script is error-free". To them, latin1 is "a snare and a delusion".
In other words, if you have a file that's actually encoded in cp1252 or gbk or koi8-u or whatever and you decode it using latin1, the resulting Unicode will be utter rubbish and Python (or any other language) will not flag an error -- it has no way of knowing that you have commited a silliness.
or is unicode("str") going to always return the correct decoding?
Just like that, with the default encoding being ascii, it will return the correct unicode if the file is actually encoded in ASCII. Otherwise, it'll blow up.
Similarly, if you specify the correct encoding, or one that's a superset of the correct encoding, you'll get the correct result. Otherwise you'll get gibberish or an exception.
In short: the answer is no.
If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?
If the str object is a valid XML document, it will be specified up front. Default is UTF-8.
If it's a properly constructed web page, it should be specified up front (look for "charset"). Unfortunately many writers of web pages lie through their teeth (ISO-8859-1 aka latin1, should be Windows-1252 aka cp1252; don't waste resources trying to decode gb2312, use gbk instead). You can get clues from the nationality/language of the website.
UTF-8 is always worth trying. If the data is ascii, it'll work fine, because ascii is a subset of utf8. A string of text that has been written using non-ascii characters and has been encoded in an encoding other than utf8 will almost certainly fail with an exception if you try to decode it as utf8.
All of the above heuristics and more and a lot of statistics are encapsulated in chardet, a module for guessing the encoding of arbitrary files. It usually works well. However you can't make software idiot-proof. For example, if you concatenate data files written some with encoding A and some with encoding B, and feed the result to chardet, the answer is likely to be encoding C with a reduced level of confidence e.g. 0.8. Always check the confidence part of the answer.
If all else fails:
(1) Try asking here, with a small sample from the front of your data ... print repr(your_data[:400]) ... and whatever collateral info about its provenance that you have.
(2) Recent Russian research into techniques for recovering forgotten passwords appears to be quite applicable to deducing unknown encodings.
Update 2 BTW, isn't it about time you opened up another question ?-)
One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren't the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.
It's not Windows that's doing it; it's a bunch of crazy application developers. You might have more understandably not paraphrased but quoted the opening paragraph of the effbot article that you referred to:
Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.
Background:
The range U+0000 to U+001F inclusive is designated in Unicode as "C0 Control Characters". These exist also in ASCII and latin1, with the same meanings. They include such familar things as carriage return, line feed, bell, backspace, tab, and others that are used rarely.
The range U+0080 to U+009F inclusive is designated in Unicode as "C1 Control Characters". These exist also in latin1, and include 32 characters that nobody outside unicode.org can imagine any possible use for.
Consequently, if you run a character frequency count on your unicode or latin1 data, and you find any characters in that range, your data is corrupt. There is no universal solution; it depends on how it became corrupted. The characters may have the same meaning as the cp1252 characters at the same positions, and thus the effbot's solution will work. In another case that I've been looking at recently, the dodgy characters appear to have been caused by concatenating text files encoded in UTF-8 and another encoding which needed to be deduced based on letter frequencies in the (human) language the files were written in.

UTF-8 is the default encoding of SQLite databases. This shows up in situations like "SELECT CAST(x'52C3B373' AS TEXT);". However, the SQLite C library doesn't actually check whether a string inserted into a DB is valid UTF-8.
If you insert a Python unicode object (or str object in 3.x), the Python sqlite3 library will automatically convert it to UTF-8. But if you insert a str object, it will just assume the string is UTF-8, because Python 2.x "str" doesn't know its encoding. This is one reason to prefer Unicode strings.
However, it doesn't help you if your data is broken to begin with.
To fix your data, do
db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
for every text column in your database.

I fixed this pysqlite problem by setting:
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
By default text_factory is set to unicode(), which will use the current default encoding (ascii on my machine)

Of course there is. But your data is already broken in the database, so you'll need to fix it:
>>> print u'Sigur RÃ³s'.encode('latin-1').decode('utf-8')
Sigur Rós

My unicode problems with Python 2.x (Python 2.7.6 to be specific) fixed this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
It also solved the error you are mentioning right at the beginning of the post:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
...
EDIT
sys.setdefaultencoding is a dirty hack. Yes, it can solve UTF-8 issues, but everything comes with a price. For more details refer to following links:
Why sys.setdefaultencoding() will break code
Why we need sys.setdefaultencoding(“utf-8”) in a py script?

Python Unicode strings and the Python interactive interpreter

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.
So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.
str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.
That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?
The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.
# Editor (Sublime Text)
>>> s = "La caña de España"
>>> s
'La ca\xc3\xb1a de Espa\xc3\xb1a'
>>> s.decode("utf-8")
u'La ca\xf1a de Espa\xf1a'
>>> sys.getdefaultencoding()
'ascii'
# Windows python interpreter
>>> s= "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
>>> sys.getdefaultencoding()
'ascii'

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).
So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].
Still, Python knows about that. From cmd.exe you'll probably get something like:
>>> import sys
>>> sys.stdin.encoding
'cp437'
whereas within Sublime Text you'll get something like
>>> import sys
>>> sys.stdin.encoding
'utf-8'

The interpreter uses your command prompt's native encoding for text entry. In your case it's CP437:
>>> print '\xa4'.decode('cp437')
ñ

You're getting confused because the editor and the interpreter are using different encodings themselves. The python interpreter uses your system default (in this case, cp437), while your editor uses utf-8.
Note, the difference disappears if you specify a unicode string, like so:
# Windows python interpreter
>>> s = "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s = u"La caña de España"
>>> s
u'La ca\xf1a de Espa\xf1a'
The moral of the story? Encodings are tricky. Be sure you know what encoding your source files are in, or play it safe by always using the escaped version of special characters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Unicode in and out of IDE - python

Related

What is the most Pythonic way to handle unicode strings in code that runs in both 2.x and 3.x

Is there an easy way to make unicode work in python?

Printing escaped Unicode in Python

SQLite, python, unicode, and non-utf data

Python Unicode strings and the Python interactive interpreter

Categories

Resources