Unicode utf-8/utf-16 encoding in Python

Unicode utf-8/utf-16 encoding in Python - python

In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") >  ■S0
What's going on here?

It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.

You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.

Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?

Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

Related

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.

The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.

You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.

The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Encoding error in Python with Chinese characters

I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.
I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is:
line = line.decode("GB18030")
The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
In this file, there are about 5 such errors in about a million lines.
I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.
Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence

""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?
Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.
How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?
How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?
Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.
In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.
You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.
If the file is not confidential, you may like to make it available for download.
Was the file created on Windows? How are you reading it in Python? (show code)
Update after OP comments:
Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.
Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.
If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.
It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.
You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...
Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.
If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')
Update after further information
All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())
The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?
Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.
Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.
Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.
The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""
It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:
C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character
Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'
And the offer to send you some debugging code (see example output below) is still open.
>>> import decode_debug as de
>>> def logger(s):
... sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>
Eureka: -- Probable cause of sometimes losing the quote character --
It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:
>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>

You might try chardet.

Try this:
codecs.open(file, encoding='gb18030', errors='replace')
Don't forget the parameter errors, you can also set it to 'ignore'.

Python Encoding issue

Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you

Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.

In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.

Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?

SQLite, python, unicode, and non-utf data

I started by trying to store strings in sqlite using python, and got the message:
sqlite3.ProgrammingError: You must
not use 8-bit bytestrings unless you
use a text_factory that can interpret
8-bit bytestrings (like text_factory =
str). It is highly recommended that
you instead just switch your
application to Unicode strings.
Ok, I switched to Unicode strings. Then I started getting the message:
sqlite3.OperationalError: Could not
decode to UTF-8 column 'tag_artist'
with text 'Sigur Rós'
when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur RÃ³s'
note: My console was set to display in 'latin_1' as #John Machin pointed out.
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings?
I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update.
Summary of answers
Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like:
str.decode('source_encoding').encode('desired_encoding')
or if the str is already in unicode
str.encode('desired_encoding')
For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python.
The encoding of the string you want to work with, and the encoding you want to get it to.
The system encoding.
The console encoding.
The encoding of the source file
Elaboration:
(1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on.
(2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added:
\# sitecustomize.py
\# this file can be anywhere in your Python path,
\# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('utf_8')
This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding.
(3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working.
(4) If you want to have some test strings, eg
test_str = "ó"
in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file:
# -*- coding: utf_8 -*-
If you don't have this information, python attempts to parse your code as ascii by default, and so:
SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run:
'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Now I will change the system and console encoding to latin_1, and I get this output for the same input:
'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Notice that the 'original' character displays correctly and the builtin unicode() function works now.
Now I change my console output back to utf_8.
'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what.
Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input.
One more thing: there are apparently crazy application developers making life difficult in Windows.
#!/usr/bin/env python
# -*- coding: utf_8 -*-
import os
import sys
def encodingDemo(str):
validStrings = ()
try:
print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
validStrings += ((str,""),)
except UnicodeEncodeError as ude:
print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print ude
try:
x = unicode(str)
print "unicode(str) = ",x
validStrings+= ((x, " decoded into unicode by the default system encoding"),)
except UnicodeDecodeError as ude:
print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()),
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('latin_1')
print "str.decode('latin_1') =",x
validStrings+= ((x, " decoded with latin_1 into unicode"),)
try:
print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
except UnicodeDecodeError as ude:
print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t",
print ude
except UnicodeDecodeError as ude:
print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('utf_8')
print "str.decode('utf_8') =",x
validStrings+= ((x, " decoded with utf_8 into unicode"),)
try:
print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
except UnicodeDecodeError as ude:
print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t",
validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
print ude
except UnicodeDecodeError as ude:
print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee
print
print "Printing information about each character in the original string."
for char in str:
try:
print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
except UnicodeDecodeError as ude:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
except UnicodeEncodeError as uee:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
print uee
try:
x = unicode(char)
print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = unicode(char) ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('latin_1')
print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('utf_8')
print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
print
x = 'ó'
encodingDemo(x)
Much thanks for the answers below and especially to #John Machin for answering so thoroughly.

I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it
repr() and unicodedata.name() are your friends when it comes to debugging such problems:
>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>
If you send oacute_utf8 to a terminal that is set up for latin1, you will get A-tilde followed by superscript-3.
I switched to Unicode strings.
What are you calling Unicode strings? UTF-16?
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I can't imagine how it seems so to you. The story that was being conveyed was that unicode objects in Python and UTF-8 encoding in the database were the way to go. However Martin answered the original question, giving a method ("text factory") for the OP to be able to use latin1 -- this did NOT constitute a recommendation!
Update in response to these further questions raised in a comment:
I didn't understand that the unicode characters still contained an implicit encoding. Am I saying that right?
No. An encoding is a mapping between Unicode and something else, and vice versa. A Unicode character doesn't have an encoding, implicit or otherwise.
It looks to me like unicode("\xF3") and "\xF3".decode('latin1') are the same when evaluated with repr().
Say what? It doesn't look like it to me:
>>> unicode("\xF3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>
Perhaps you meant: u'\xf3' == '\xF3'.decode('latin1') ... this is certainly true.
It is also true that unicode(str_object, encoding) does the same as str_object.decode(encoding) ... including blowing up when an inappropriate encoding is supplied.
Is that a happy circumstance
That the first 256 characters in Unicode are the same, code for code, as the 256 characters in latin1 is a good idea. Because all 256 possible latin1 characters are mapped to Unicode, it means that ANY 8-bit byte, ANY Python str object can be decoded into unicode without an exception being raised. This is as it should be.
However there exist certain persons who confuse two quite separate concepts: "my script runs to completion without any exceptions being raised" and "my script is error-free". To them, latin1 is "a snare and a delusion".
In other words, if you have a file that's actually encoded in cp1252 or gbk or koi8-u or whatever and you decode it using latin1, the resulting Unicode will be utter rubbish and Python (or any other language) will not flag an error -- it has no way of knowing that you have commited a silliness.
or is unicode("str") going to always return the correct decoding?
Just like that, with the default encoding being ascii, it will return the correct unicode if the file is actually encoded in ASCII. Otherwise, it'll blow up.
Similarly, if you specify the correct encoding, or one that's a superset of the correct encoding, you'll get the correct result. Otherwise you'll get gibberish or an exception.
In short: the answer is no.
If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?
If the str object is a valid XML document, it will be specified up front. Default is UTF-8.
If it's a properly constructed web page, it should be specified up front (look for "charset"). Unfortunately many writers of web pages lie through their teeth (ISO-8859-1 aka latin1, should be Windows-1252 aka cp1252; don't waste resources trying to decode gb2312, use gbk instead). You can get clues from the nationality/language of the website.
UTF-8 is always worth trying. If the data is ascii, it'll work fine, because ascii is a subset of utf8. A string of text that has been written using non-ascii characters and has been encoded in an encoding other than utf8 will almost certainly fail with an exception if you try to decode it as utf8.
All of the above heuristics and more and a lot of statistics are encapsulated in chardet, a module for guessing the encoding of arbitrary files. It usually works well. However you can't make software idiot-proof. For example, if you concatenate data files written some with encoding A and some with encoding B, and feed the result to chardet, the answer is likely to be encoding C with a reduced level of confidence e.g. 0.8. Always check the confidence part of the answer.
If all else fails:
(1) Try asking here, with a small sample from the front of your data ... print repr(your_data[:400]) ... and whatever collateral info about its provenance that you have.
(2) Recent Russian research into techniques for recovering forgotten passwords appears to be quite applicable to deducing unknown encodings.
Update 2 BTW, isn't it about time you opened up another question ?-)
One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren't the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.
It's not Windows that's doing it; it's a bunch of crazy application developers. You might have more understandably not paraphrased but quoted the opening paragraph of the effbot article that you referred to:
Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.
Background:
The range U+0000 to U+001F inclusive is designated in Unicode as "C0 Control Characters". These exist also in ASCII and latin1, with the same meanings. They include such familar things as carriage return, line feed, bell, backspace, tab, and others that are used rarely.
The range U+0080 to U+009F inclusive is designated in Unicode as "C1 Control Characters". These exist also in latin1, and include 32 characters that nobody outside unicode.org can imagine any possible use for.
Consequently, if you run a character frequency count on your unicode or latin1 data, and you find any characters in that range, your data is corrupt. There is no universal solution; it depends on how it became corrupted. The characters may have the same meaning as the cp1252 characters at the same positions, and thus the effbot's solution will work. In another case that I've been looking at recently, the dodgy characters appear to have been caused by concatenating text files encoded in UTF-8 and another encoding which needed to be deduced based on letter frequencies in the (human) language the files were written in.

UTF-8 is the default encoding of SQLite databases. This shows up in situations like "SELECT CAST(x'52C3B373' AS TEXT);". However, the SQLite C library doesn't actually check whether a string inserted into a DB is valid UTF-8.
If you insert a Python unicode object (or str object in 3.x), the Python sqlite3 library will automatically convert it to UTF-8. But if you insert a str object, it will just assume the string is UTF-8, because Python 2.x "str" doesn't know its encoding. This is one reason to prefer Unicode strings.
However, it doesn't help you if your data is broken to begin with.
To fix your data, do
db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
for every text column in your database.

I fixed this pysqlite problem by setting:
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
By default text_factory is set to unicode(), which will use the current default encoding (ascii on my machine)

Of course there is. But your data is already broken in the database, so you'll need to fix it:
>>> print u'Sigur RÃ³s'.encode('latin-1').decode('utf-8')
Sigur Rós

My unicode problems with Python 2.x (Python 2.7.6 to be specific) fixed this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
It also solved the error you are mentioning right at the beginning of the post:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
...
EDIT
sys.setdefaultencoding is a dirty hack. Yes, it can solve UTF-8 issues, but everything comes with a price. For more details refer to following links:
Why sys.setdefaultencoding() will break code
Why we need sys.setdefaultencoding(“utf-8”) in a py script?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.