Handle wrongly encoded character in Python unicode string

Handle wrongly encoded character in Python unicode string - python

I am dealing with unicode strings returned by the python-lastfm library.
I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.
For example, the original string i am expecting in the variable a is "Glück"
>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.
How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \xfc.

You have to convert your unicode string into a standard string using some encoding e.g. utf-8:
some_unicode_string.encode('utf-8')
Apart from that: this is a dupe of
BeautifulSoup findall with class attribute- unicode encode error
and at least ten other related questions on SO. Research first.

Your unicode string is fine:
>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'
The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)
>>> print u'Gl\xfcck'
Glück

At the beginning of your code, just after imports, add these 3 lines.
import sys # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')
It will override system default encoding (ascii) for the course of your program.
Edit: You shouldn't do this unless you are sure of the consequences, see comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')

Do not str() cast to string what you've got from model fields, as long as it is an unicode string already.
(oops I have totally missed that it is not django-related)

I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8. The problem manifest itself when I start processing words and some of them would't show the decoding error.
# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40)
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
I solve the error calling the encode method on the string:
>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

Related

Issue in encode/decode in python 3 with non-ascii character

I am trying to use python3 unicode_escape to escape \n in my string, but the challenge is there are non-ascii characters present in the whole string, and if I use utf8 to encode and then decode the bytes using unicode_escape then the special character gets garbled. Is there any way to have the \n escaped with a new line without garbling the special character?
s = "hello\\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))
Expected Result:
hello
world└--
Actual Result:
hello
worldâ--

As user wowcha observes, the unicode-escape codec assumes a latin-1 encoding, but your string contains a character that is not encodable as latin-1.
>>> s = "hello\\nworld└--"
>>> s.encode('latin-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
Encoding the string as utf-8 gets around the encoding problem, but results in mojibake when decoding from unicode-escape
The solution is to use the backslashreplace error handler when encoding. This will convert the problem character to an escape sequence that can be encoded as latin-1 and does not get mangled when decoded from unicode-escape.
>>> s.encode('latin-1', errors='backslashreplace')
b'hello\\nworld\\u2514--'
>>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
'hello\nworld└--'
>>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
hello
world└--

Try removing the second escape backslash and decode using utf8:
>>> s = "hello\nworld└--"
>>> print(s.encode('utf8').decode('utf8'))
hello
world└--

I believe the problem you are having is that unicode_escape was deprecated in Python 3.3 and it seems to be assuming your code is 'latin-1' due to that being the original codec used within the unicode_excape function...
Looking at the python documentation for codecs we see that Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default. which tells us that unicode_escape assumes that your text is ISO Latin-1.
So if we run your code with latin1 encoding we get this error:
s.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
And the unicode character error is '\u2514' which when converted is '└' the simplest way to put it is the character cannot be used within a Latin-1 string hence why you get a different character.
I also think it's right to point out that within your string you have '\\n' and not just '\n' the extra backslash means this symbol is not carriage return but instead it is ignored the backward slash indicates to ignore the '\n'. Perhaps try not using the \\n...

Reading/Decoding UTF-8 Escape Characters into Native Characters

I am using the unicodecsv drop-in module for Python 2.7 to read a CSV file containing columns of words in 28 different languages, some of which are accented and/or utilise completely different alphabet/character systems. I am loading the CSV
with open(sourceFile, 'rU') as keywordCSV:
keywordList = csv.reader(keywordCSV, encoding='utf-8-sig', dialect=csv.excel)
but reading from keywordList is currently producing unicode escape characters/sequences rather than the native character symbols. Whilst this is not ideal (ideally I would be able to load the unicode in the csv as native character symbols from the start), it is acceptable so long as I can convert these into native character symbols later on in the script (when exporting to whichever file type will make this easiest). How is this, or preferably the ideal case, done? I have tried using workarounds such as these to no avail, and I am still not sure if this is an interpreter issue or an encoding issue within the script.
The reason I have used utf-8-sig when reading the file is that not doing so was resulting in a (BOM)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155:
but this has now stopped happening for reasons unbeknown to me. Similarly, I am using 'rU' when opening the file as not doing so produces a
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
but I am not sure if either of these are appropriate.
In this question, printing each character one by one results in the native characters being printed (something that also works in my code when run from the terminal), is there are a way of iterating through the characters and converting each one to its native character?
Apologies for posting another question on this already saturated topic, but I haven't been able to get other people's suggestions working for this case. Perhaps I have been looking in the wrong place in trying to decode the encoded csv output at the end of the script, and rather the problem is in my csv.reader's encoding. Any help will be very much appreciated.

What you are seeing is the repr() of your Unicode characters. In Python 2.7, repr() only displays ASCII characters normally. Characters outside the ASCII range are displayed using escapes. This is for debugging purposes to make non-printing characters or characters not supported by the current code page visible. If you want to see the characters rendered, print them, but note that characters not supported by the terminal's configured code page may not work:
>>> s = u'\N{LATIN SMALL LETTER E WITH ACUTE}'
>>> s
u'\xe9'
>>> print repr(s)
u'\xe9'
>>> print s
é
>>> print unicode(s)
é
In the following case, the character isn't supported by the configured code page 437:
>>> s = u'\N{HORIZONTAL ELLIPSIS}'
>>> s
u'\u2026'
>>> print s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position 0: character maps to <undefined>

How to decode chars in Python respectively?

I have tried this problem
# -*- coding: utf-8 -*-
s = "Ñ ÑÑÑÐ°Ñ! Ð½ÐµÑ ÑÐ¸Ð»"
e = s.encode('ascii')
print e
but it gives me this error.
Traceback (most recent call last):
File "C:/Users/username/Desktop/unicode.py", line 3, in <module>
e = s.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
How do I get the text to be readable? I have been trying for hours! Not sure how to fix this. Any help would be greatly appreciated!

You have a whole slew of problems here.
First, you've stuck Unicode characters into a str literal instead of a unicode literal. That's almost always a bad idea.
Second, you've called encode on a str. But encode is for converting unicode to str.* In order to do that, Python has to first decode your str to a unicode so that it can call encode on it. And if you force Python to decode for you without telling it which codec to use, it will use sys.getdefaultencoding(), which is almost never what you want. (In particular, it's not going to be UTF-8 just because your source encoding is.)
You can fix those first two problems just by adding one letter:
s = u"Ñ ÑÑÑÐ°Ñ! Ð½ÐµÑ ÑÐ¸Ð»"
But it's still not going to work. Why? Because you're asking it to encode non-ASCII characters into the ASCII character set. Which is impossible. So it's going to call the error handler. Since you didn't specify an error handler, you get the default, called strict. As the name implies, strict raises an exception when you ask it do something impossible.
There are other error handlers—see the str.encode docs for a full list. I'm not sure what output you were expecting, but you can get backslash-escaped text, or text with all the non-ASCII characters replaced by ?s, or a few other possibilities. For example:
e = s.encode('ascii', 'replace')
Of course if you didn't actually want ASCII, but rather UTF-8, then everything is easy: just tell Python you want UTF-8 instead of ASCII:
e = s.encode('utf-8')
* There are a few special codecs, like hex and gzip, that convert str to str, unicode to unicode, or str to unicode, but ascii isn't one of them.

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.

Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.

Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.

Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

Double-decoding unicode in python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)

ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'

What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")

Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß

Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.