Unicode and overriding '__str__' - python

I'm getting a unicode error only when overriding my class' __str__ method. What's going on?
In Test.py:
class Obj(object):
def __init__(self):
self.title = u'\u2018'
def __str__(self):
return self.title
print "1: ", Obj().title
print "2: ", str(Obj())
Running this I get:
$ python Test.py
1: ‘
2:
Traceback (most recent call last):
File "Test.py", line 11, in <module>
print "2: ", str(Obj())
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
EDIT: Please don't just say that str(u'\u2018') also raises an Error! (while that may be related). This circumvents the entire purpose of built-in method overloading --- at no point should this code call str(u'\u2018')!!

You're using Python 2.x. str() calls __str__ and expects you to return a string—that is, a str. But you're not; you're returning a unicode object. So str() helpfully tries to convert that to a str since it's what str() is supposed to return.
Now, in Python 2.x strings are sequences of bytes, not codepoints, so Python is trying to convert your Unicode object to a sequence of bytes. Since you didn't (and can't, in this scenario) specify what encoding to use when making the string, Python uses the default encoding of ASCII. This fails because ASCII can't represent the character.
Possible solutions:
Use Python 3, where all strings are Unicode. This will provide you with an entertainingly different set of things to wrap your head around, but this won't be one of them.
Override __unicode__() instead of __str__() and use unicode() instead of str() when converting your object to a string. You still have the problem (shared with Python 3) of how to get that converted into a sequence of bytes that will output correctly.
Figure out what encoding your terminal is using (i.e. sys.stdout.encoding) and have __str__() convert the Unicode object to that encoding before returning it. Note that there's still no guarantee that the character is representable in that encoding; you can't convert your example string to the default Windows terminal encoding, for example. In this case you could fall back to e.g. unicode-escape encoding if you get an exception trying to convert to the output encoding.

Problem is that str() cannot handle u'\u2018' (unicode), since it tries to convert it to ascii and there is no ascii character for it.
>>> str(u'\u2018')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
>>>
You can look at this for more info...

Related

Python character handling in terminal

I'm in an interactive Python 2.7 Terminal (Terminal default output is "utf-8"). I have a string from the internet, lets call it a
>>> a
u'M\xfcssen'
>>> a[1]
u'\xfc'
I wonder why its value is not ü so I try
>>> print(a)
Müssen
>>> print(a[1])
ü
which works as intended.
So my first question is, what does print a do, which is missing if i just type a?
and out of curiosity: Why is it that I get another output for the following in the same python terminal session?
>>> "ü"
'\xc3\xbc'
>>> print "ü"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> print u"ü"
ü
You have to understand how python stores various data types and which functions expect which input. Its all quite confusing and also depends on your LOCALE setting of your terminal.
The following link might help to reduce the confusion: https://pythonhosted.org/kitchen/unicode-frustrations.html
All str objects like "My String" are stored as 8bit per character. In your case '\xc3\xbc' is the utf8 representation of the UMLAUT-U as a str object.
For unicode objects, python uses 16bit or 32bit integer to store the string.
Now the print function expects str objects as input. That's why the following works.
>>> print '\xc3\xbc'
ü
To turn the UMLAUT-U from a str into a unicode object. you have to tell python that the string is in UTF8 representation before you convert it into a unicode object.
>>> unicode('\xc3\xbc'.decode('utf8'))
u'\xfc'
what does print a do, which is missing if i just type a?
The interactive >>> prompt outputs values using the Python source code representation of the value, as returned by the repr() function. That's why you get not just \xFC for the ü character but also quote marks around the string. The prompt is trying to show you what you would need to type into a Python program to get the string value you have.
The print statement outputs the raw string conversion of the value, as returned by the str() function.
For some types repr() and str() generate the same output, but this is not the case for strings.

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.
Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.
Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.
Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

Python Unicode Ascii, Ordinal not In range, frustrating error

Here is my problem...
Database stores everything in unicode.
hashlib.sha256().digest() accepts str and returns str.
When I try to stuff hash function with the data, I get the famous error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal not in range(128)
This is my data
>>> db_digest
u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>>
>>> hashlib.sha256(db_digest)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x90' in position 1: ordinal not in range(128)
>>>
>>> asc_db_digest
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hashlib.sha256(asc_db_digest)
<sha256 HASH object # 0x7f7da0f04300>
So all I am asking for is a way to turn db_digest into asc_db_digest
Edit
I have rephrased the question as it seems I haven't recognized teh problem correctly at first place.
If you have a unicode string that only contains code points from 0 to 255 (bytes) you can convert it to a Python str using the raw_unicode_escape encoding:
>>> db_digest = u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hash_digest = "'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape')
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape') == hash_digest
True
hashes operates on bytes (bytes, str in Python 2.x), not strings (unicode in 2.x, str in 3.x). Therefore, you must supply bytes anyways. Try:
hashlib.sha1(salt.encode('utf-8') + data).digest()
The hash will contain "characters" that are in the range 0-255. These are all valid Unicode characters, but it's not a Unicode string. You need to convert it somehow. The best solution would be to encode it into something like base64.
There's also a hacky solution to convert the bytes returned directly into a pseudo-Unicode string, exactly as your database appears to be doing it:
hash_unicode = u''.join([unichr(ord(c)) for c in hash_digest])
You can also go the other way, but this is more dangerous as the "string" will contain characters outside of the ASCII range of 0-127 and might throw errors when you try to use it.
asc_db_digest = ''.join([chr(ord(c)) for c in db_digest])

Handle wrongly encoded character in Python unicode string

I am dealing with unicode strings returned by the python-lastfm library.
I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.
For example, the original string i am expecting in the variable a is "Glück"
>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.
How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \xfc.
You have to convert your unicode string into a standard string using some encoding e.g. utf-8:
some_unicode_string.encode('utf-8')
Apart from that: this is a dupe of
BeautifulSoup findall with class attribute- unicode encode error
and at least ten other related questions on SO. Research first.
Your unicode string is fine:
>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'
The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)
>>> print u'Gl\xfcck'
Glück
At the beginning of your code, just after imports, add these 3 lines.
import sys # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')
It will override system default encoding (ascii) for the course of your program.
Edit: You shouldn't do this unless you are sure of the consequences, see comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')
Do not str() cast to string what you've got from model fields, as long as it is an unicode string already.
(oops I have totally missed that it is not django-related)
I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8. The problem manifest itself when I start processing words and some of them would't show the decoding error.
# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40)
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
I solve the error calling the encode method on the string:
>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

Double-decoding unicode in python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)
ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß
Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752

Categories