Latin-1 and the unicode factory in Python - python

I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print statement, but at the t.add_row call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!

Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html

Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))

After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings.
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.

Related

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?
You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.
I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

UnicodeEncodeError when writing to a file

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?
If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .
This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1
The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ

Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.
And yes, I am declaring.
# -*- coding: utf-8 -*-
on top of my code.
Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.
For example:
stringtest1 = '無與倫比的美麗'
print translate(stringtest1)
results in the proper translation and doing
type(stringtest1)
confirms this to be a string object.
But if do
stringtest1 = u'無與倫比的美麗'
and try to use my translation function I get this error:
File "C:\Python27\lib\urllib.py", line 1275, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)
After researching a bit, it seems this is a common problem:
Problem: neither urllib2.quote nor urllib.quote encode the unicode strings arguments
urllib.quote throws exception on Unicode URL
Now, if I type in a script
stringtest1 = '無與倫比的美麗'
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2
excution of it returns:
stringtest1 無與倫比的美麗
stringtest2 無與倫比的美麗
But just typing the variables in the console:
>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'
gets me that.
My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.
So, how do I convert one thing into the other?
I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).
But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information
Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.
When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').
It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:
def ensureutf8(s):
if isinstance(s, unicode):
s = s.encode('utf8')
return s
which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.
BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

Decode string with hex characters in python 2

I have a hex string and i want to convert it utf8 to insert mysql. (my database is utf8)
hex_string = 'kitap ara\xfet\xfdrmas\xfd'
...
result = 'kitap araştırması'
How can I do that?
Try(Python 3.x):
import codecs
codecs.decode("707974686f6e2d666f72756d2e696f", "hex").decode('utf-8')
From here.
Assuming Python 2.6,
>>> print('kitap ara\xfet\xfdrmas\xfd'.decode('iso-8859-9'))
kitap araştırması
>>> 'kitap ara\xfet\xfdrmas\xfd'.decode('iso-8859-9').encode('utf-8')
'kitap ara\xc5\x9ft\xc4\xb1rmas\xc4\xb1'
Try
hex_string.decode("cp1254").encode("utf-8")
(cp1254 or iso-8859-9 are the Turkish codepages, the former being the usual name on Windows platforms, but in Python, both work equally well)
First you need to decode it from the encoded bytes you have. That appears to be ISO-8859-9 (latin-5), or, if you are using Windows, probably code page 1254, which is based on latin-5.
>>> 'kitap ara\xfet\xfdrmas\xfd'.decode('cp1254')
u'kitap ara\u015ft\u0131rmas\u0131' # u'kitap araştırması'
If you are using Windows, then depending on where you are getting those bytes, it might be more appropriate to decode them as mbcs, which translates to ‘whichever code page the local system is using’. If the string is just sitting in a .py file, you would be better off just writing u'kitap araştırması' in the source and setting a -*- coding declaration to direct Python to decode it. See PEP 263.
As to how to encode unicode strings to UTF-8 for the database, well, if you want to you can do it manually:
>>> u'kitap ara\u015ft\u0131rmas\u0131'.encode('utf-8')
'kitap ara\xc5\x9ft\xc4\xb1rmas\xc4\xb1'
but a good data access layer is likely to do that automatically for you, if you've got the COLLATION of the tables the data is going into right.
String literals explains how to use UTF8 strings in Python source.

SQLite, python, unicode, and non-utf data

I started by trying to store strings in sqlite using python, and got the message:
sqlite3.ProgrammingError: You must
not use 8-bit bytestrings unless you
use a text_factory that can interpret
8-bit bytestrings (like text_factory =
str). It is highly recommended that
you instead just switch your
application to Unicode strings.
Ok, I switched to Unicode strings. Then I started getting the message:
sqlite3.OperationalError: Could not
decode to UTF-8 column 'tag_artist'
with text 'Sigur Rós'
when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur Rós'
note: My console was set to display in 'latin_1' as #John Machin pointed out.
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings?
I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update.
Summary of answers
Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like:
str.decode('source_encoding').encode('desired_encoding')
or if the str is already in unicode
str.encode('desired_encoding')
For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python.
The encoding of the string you want to work with, and the encoding you want to get it to.
The system encoding.
The console encoding.
The encoding of the source file
Elaboration:
(1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on.
(2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added:
\# sitecustomize.py
\# this file can be anywhere in your Python path,
\# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('utf_8')
This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding.
(3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working.
(4) If you want to have some test strings, eg
test_str = "ó"
in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file:
# -*- coding: utf_8 -*-
If you don't have this information, python attempts to parse your code as ascii by default, and so:
SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run:
'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Now I will change the system and console encoding to latin_1, and I get this output for the same input:
'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Notice that the 'original' character displays correctly and the builtin unicode() function works now.
Now I change my console output back to utf_8.
'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what.
Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input.
One more thing: there are apparently crazy application developers making life difficult in Windows.
#!/usr/bin/env python
# -*- coding: utf_8 -*-
import os
import sys
def encodingDemo(str):
validStrings = ()
try:
print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
validStrings += ((str,""),)
except UnicodeEncodeError as ude:
print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print ude
try:
x = unicode(str)
print "unicode(str) = ",x
validStrings+= ((x, " decoded into unicode by the default system encoding"),)
except UnicodeDecodeError as ude:
print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()),
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('latin_1')
print "str.decode('latin_1') =",x
validStrings+= ((x, " decoded with latin_1 into unicode"),)
try:
print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
except UnicodeDecodeError as ude:
print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t",
print ude
except UnicodeDecodeError as ude:
print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('utf_8')
print "str.decode('utf_8') =",x
validStrings+= ((x, " decoded with utf_8 into unicode"),)
try:
print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
except UnicodeDecodeError as ude:
print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t",
validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
print ude
except UnicodeDecodeError as ude:
print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee
print
print "Printing information about each character in the original string."
for char in str:
try:
print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
except UnicodeDecodeError as ude:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
except UnicodeEncodeError as uee:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
print uee
try:
x = unicode(char)
print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = unicode(char) ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('latin_1')
print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('utf_8')
print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
print
x = 'ó'
encodingDemo(x)
Much thanks for the answers below and especially to #John Machin for answering so thoroughly.
I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it
repr() and unicodedata.name() are your friends when it comes to debugging such problems:
>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>
If you send oacute_utf8 to a terminal that is set up for latin1, you will get A-tilde followed by superscript-3.
I switched to Unicode strings.
What are you calling Unicode strings? UTF-16?
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I can't imagine how it seems so to you. The story that was being conveyed was that unicode objects in Python and UTF-8 encoding in the database were the way to go. However Martin answered the original question, giving a method ("text factory") for the OP to be able to use latin1 -- this did NOT constitute a recommendation!
Update in response to these further questions raised in a comment:
I didn't understand that the unicode characters still contained an implicit encoding. Am I saying that right?
No. An encoding is a mapping between Unicode and something else, and vice versa. A Unicode character doesn't have an encoding, implicit or otherwise.
It looks to me like unicode("\xF3") and "\xF3".decode('latin1') are the same when evaluated with repr().
Say what? It doesn't look like it to me:
>>> unicode("\xF3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>
Perhaps you meant: u'\xf3' == '\xF3'.decode('latin1') ... this is certainly true.
It is also true that unicode(str_object, encoding) does the same as str_object.decode(encoding) ... including blowing up when an inappropriate encoding is supplied.
Is that a happy circumstance
That the first 256 characters in Unicode are the same, code for code, as the 256 characters in latin1 is a good idea. Because all 256 possible latin1 characters are mapped to Unicode, it means that ANY 8-bit byte, ANY Python str object can be decoded into unicode without an exception being raised. This is as it should be.
However there exist certain persons who confuse two quite separate concepts: "my script runs to completion without any exceptions being raised" and "my script is error-free". To them, latin1 is "a snare and a delusion".
In other words, if you have a file that's actually encoded in cp1252 or gbk or koi8-u or whatever and you decode it using latin1, the resulting Unicode will be utter rubbish and Python (or any other language) will not flag an error -- it has no way of knowing that you have commited a silliness.
or is unicode("str") going to always return the correct decoding?
Just like that, with the default encoding being ascii, it will return the correct unicode if the file is actually encoded in ASCII. Otherwise, it'll blow up.
Similarly, if you specify the correct encoding, or one that's a superset of the correct encoding, you'll get the correct result. Otherwise you'll get gibberish or an exception.
In short: the answer is no.
If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?
If the str object is a valid XML document, it will be specified up front. Default is UTF-8.
If it's a properly constructed web page, it should be specified up front (look for "charset"). Unfortunately many writers of web pages lie through their teeth (ISO-8859-1 aka latin1, should be Windows-1252 aka cp1252; don't waste resources trying to decode gb2312, use gbk instead). You can get clues from the nationality/language of the website.
UTF-8 is always worth trying. If the data is ascii, it'll work fine, because ascii is a subset of utf8. A string of text that has been written using non-ascii characters and has been encoded in an encoding other than utf8 will almost certainly fail with an exception if you try to decode it as utf8.
All of the above heuristics and more and a lot of statistics are encapsulated in chardet, a module for guessing the encoding of arbitrary files. It usually works well. However you can't make software idiot-proof. For example, if you concatenate data files written some with encoding A and some with encoding B, and feed the result to chardet, the answer is likely to be encoding C with a reduced level of confidence e.g. 0.8. Always check the confidence part of the answer.
If all else fails:
(1) Try asking here, with a small sample from the front of your data ... print repr(your_data[:400]) ... and whatever collateral info about its provenance that you have.
(2) Recent Russian research into techniques for recovering forgotten passwords appears to be quite applicable to deducing unknown encodings.
Update 2 BTW, isn't it about time you opened up another question ?-)
One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren't the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.
It's not Windows that's doing it; it's a bunch of crazy application developers. You might have more understandably not paraphrased but quoted the opening paragraph of the effbot article that you referred to:
Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.
Background:
The range U+0000 to U+001F inclusive is designated in Unicode as "C0 Control Characters". These exist also in ASCII and latin1, with the same meanings. They include such familar things as carriage return, line feed, bell, backspace, tab, and others that are used rarely.
The range U+0080 to U+009F inclusive is designated in Unicode as "C1 Control Characters". These exist also in latin1, and include 32 characters that nobody outside unicode.org can imagine any possible use for.
Consequently, if you run a character frequency count on your unicode or latin1 data, and you find any characters in that range, your data is corrupt. There is no universal solution; it depends on how it became corrupted. The characters may have the same meaning as the cp1252 characters at the same positions, and thus the effbot's solution will work. In another case that I've been looking at recently, the dodgy characters appear to have been caused by concatenating text files encoded in UTF-8 and another encoding which needed to be deduced based on letter frequencies in the (human) language the files were written in.
UTF-8 is the default encoding of SQLite databases. This shows up in situations like "SELECT CAST(x'52C3B373' AS TEXT);". However, the SQLite C library doesn't actually check whether a string inserted into a DB is valid UTF-8.
If you insert a Python unicode object (or str object in 3.x), the Python sqlite3 library will automatically convert it to UTF-8. But if you insert a str object, it will just assume the string is UTF-8, because Python 2.x "str" doesn't know its encoding. This is one reason to prefer Unicode strings.
However, it doesn't help you if your data is broken to begin with.
To fix your data, do
db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
for every text column in your database.
I fixed this pysqlite problem by setting:
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
By default text_factory is set to unicode(), which will use the current default encoding (ascii on my machine)
Of course there is. But your data is already broken in the database, so you'll need to fix it:
>>> print u'Sigur Rós'.encode('latin-1').decode('utf-8')
Sigur Rós
My unicode problems with Python 2.x (Python 2.7.6 to be specific) fixed this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
It also solved the error you are mentioning right at the beginning of the post:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
...
EDIT
sys.setdefaultencoding is a dirty hack. Yes, it can solve UTF-8 issues, but everything comes with a price. For more details refer to following links:
Why sys.setdefaultencoding() will break code
Why we need sys.setdefaultencoding(“utf-8”) in a py script?

Categories