Python UnicodeDecodeError - Am I misunderstanding encode?

Python UnicodeDecodeError - Am I misunderstanding encode? - python

Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.
>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)

… There's a reason they're called "encodings" …
A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.
In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.
Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.
So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.
So, what you meant to do, was:
"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")
It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.
Anyway, all you have to remember for your to-and-fro Unicode conversions is:
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
In both cases, you need to specify the encoding that will be used.
I'm not very clear, I'm sleepy, but I sure hope I help.
PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)
PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications.

encode is available to unicode strings, but the string you have there does not seems unicode (try with u'add \x93Monitoring\x93 to list ')
>>> u'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
'add \x93Monitoring\x93 to list '

And the magic line is:
unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')
The one liner that wont raise exceptions when it is most needed (remove bad Unicode characters...)

This seems to work:
'add \x93Monitoring\x93 to list '.decode('latin-1').encode('latin-1')
Any issues with that? I wonder when 'ignore', 'replace' and other such encode error handling comes in?

Related

encoding string in python [duplicate]

I have this issue and I can't figure out how to solve it. I have this string:
data = '\xc4\xb7\x86\x17\xcd'
When I tried to encode it:
data.encode()
I get this result:
b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'
I only want:
b'\xc4\xb7\x86\x17\xcd'
Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.

You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.
So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.
If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.
If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.
As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:
raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.
So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:
>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'
So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).
So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:
>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)
Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.

You can use 'raw_unicode_escape' as your encoding:
In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'
As mentioned in comments you can also pass the encoding directly to the encode method of your string.
In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

UnicodeEncodeError when writing to a file

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?

If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1

The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Python Encoding issue

Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you

Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.

In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.

Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.