Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you
Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.
In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.
Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?
Related
I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.
This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.
There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.
Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.
be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers
Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(
I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).
Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().
Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián
I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.
I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is:
line = line.decode("GB18030")
The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
In this file, there are about 5 such errors in about a million lines.
I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.
Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence
""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?
Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.
How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?
How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?
Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.
In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.
You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.
If the file is not confidential, you may like to make it available for download.
Was the file created on Windows? How are you reading it in Python? (show code)
Update after OP comments:
Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.
Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.
If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.
It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.
You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...
Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.
If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')
Update after further information
All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())
The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?
Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.
Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.
Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.
The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""
It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:
C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character
Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'
And the offer to send you some debugging code (see example output below) is still open.
>>> import decode_debug as de
>>> def logger(s):
... sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>
Eureka: -- Probable cause of sometimes losing the quote character --
It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:
>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>
You might try chardet.
Try this:
codecs.open(file, encoding='gb18030', errors='replace')
Don't forget the parameter errors, you can also set it to 'ignore'.
In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") > ■S0
What's going on here?
It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.
You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.
Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?
Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.
Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.
>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
… There's a reason they're called "encodings" …
A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.
In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.
Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.
So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.
So, what you meant to do, was:
"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")
It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.
Anyway, all you have to remember for your to-and-fro Unicode conversions is:
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
In both cases, you need to specify the encoding that will be used.
I'm not very clear, I'm sleepy, but I sure hope I help.
PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)
PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications.
encode is available to unicode strings, but the string you have there does not seems unicode (try with u'add \x93Monitoring\x93 to list ')
>>> u'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
'add \x93Monitoring\x93 to list '
And the magic line is:
unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')
The one liner that wont raise exceptions when it is most needed (remove bad Unicode characters...)
This seems to work:
'add \x93Monitoring\x93 to list '.decode('latin-1').encode('latin-1')
Any issues with that? I wonder when 'ignore', 'replace' and other such encode error handling comes in?