Python unicode character in __str__

Python unicode character in __str__ - python

I'm trying to print cards using their suit unicode character and their values. I tried doing to following:
def __str__(self):
return u'\u2660'.encode('utf-8')
like suggested in another thread, but I keep getting errors saying UnicodeEncodeError: ascii, ♠, 0, 1, ordinal not in range(128). What can I do to get those suit character to show up when I print a list of cards?

Where does that UnicodeEncodeError occur exactly? I can think about two possible issues here:
The UnicodeEncodeError occurs in you __unicode__ method.
Your __unicode__ method returns a byte string instead of a unicode object and that byte string contains non-ASCII characters.
Do you have a __unicode__ method in your class?
I tried this on the Python console according to the actual data from your comment:
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print '\xe2\x99\xa0'
♠
It seems to work. Could you please try to print the same on your console? Maybe your console encoding is the problem.

Depending on how you have encoded those "suit symbols" into a byte string, you'll need to make the unicode string back for it by mentioning the appropriate codec (for example, thebytestr.decode('latin-1') if latin-1 is how you encoded it!), before making the utf-8 encoding of that unicode string. Just unicode(something) uses the default encoding, which is ASCII and therefore totally ignorant of any "suit symbols"!-)
As I said back then (3 months ago), I'd go for implementing __unicode__ instead of __str__, but that's just a minor issue of simplicity. The core point is, rather: if your byte string includes anything outside of the limited ASCII encoding, you must know what encoding your byte string uses, and decode it back into Unicode by explicitly using that codec!

I ran the same code and got
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print ('\xe2\x99\xa0')
â™

Related

encoding string in python [duplicate]

I have this issue and I can't figure out how to solve it. I have this string:
data = '\xc4\xb7\x86\x17\xcd'
When I tried to encode it:
data.encode()
I get this result:
b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'
I only want:
b'\xc4\xb7\x86\x17\xcd'
Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.

You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.
So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.
If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.
If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.
As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:
raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.
So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:
>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'
So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).
So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:
>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)
Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.

You can use 'raw_unicode_escape' as your encoding:
In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'
As mentioned in comments you can also pass the encoding directly to the encode method of your string.
In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'

Encoding unicode with python

I’m trying to understand how encoding unicode works in python2.7, so far it’s easy to find the solution but I haven’t found any clear explanation as what is going on here. Here is an example.
The introduction
We have a unicode variable we received, called filter_type
filter_type = u'some_välüe'.
We put this into a dict and pass this into the python library urllib.urlencode.
Like so:
urllib.urlencode({"param:" ..., "filter_type": filter_type}
This issue.
Inside urllib.urlencode it loops around the data given to it and wraps the keys and values into the str() builtin function to get a string representation of each key and value before encoding it into a url.
We get an error similar to the following:
{UnicodeEncodeError}'ascii' codec can't encode character u'\xf1' in position 42: ordinal not in range(128).
You get this same error by doing str(u'some_välüe').
So after some research and digging into this it looks like when you wrap unicode values in the str() it tries to encode the value into the default encoding that is set. (my assumption)
>>> import sys
>>> sys.getdefaultencoding()
ascii
The solution.
So we can fix this by encoding these unicode strings with utf-8.
filter_type = u'some_välüe'.encode('utf-8').
The question.
But here is the question. Before i mentioned that urllib.urlencode wraps keys and values into the str() function.
These values are already encoded now, so..
What does str() does in this case now?
Does the representation of a unicode object change when it’s encoded to utf-8?
If it does why did str() try to encode the unicode object to ascii (default) in the first place.

Python & fql: getting "Dami\u00e1n" instead of "Damián"

I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!

You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).

Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().

Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián

Python throws UnicodeEncodeError although I am doing str.decode(). Why?

Consider this function:
def escape(text):
print repr(text)
escaped_chars = []
for c in text:
try:
c = c.decode('ascii')
except UnicodeDecodeError:
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
It should escape all non ascii characters by the corresponding htmlentitydefs. Unfortunately python throws
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
when the variable text contains the string whose repr() is u'Tam\xe1s Horv\xe1th'.
But, I don't use str.encode(). I only use str.decode(). Do I miss something?

It's a misleading error-report which comes from the way python handles the de/encoding process. You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. And somewhere there lies the origin for this misleading Exception messages.
You may check for yourself: either
u'\x80'.encode('ascii')
or
u'\x80'.decode('ascii')
will throw a UnicodeEncodeError, where a
u'\x80'.encode('utf8')
will not, but
u'\x80'.decode('utf8')
again will!
I guess you are confused by the meaning of encoding and decoding.
To put it simple:
decode encode
ByteString (ascii) --------> UNICODE ---------> ByteString (utf8)
codec codec
But why is there a codec-argument for the decode method? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec as an argument. If not provided it assumes you mean the sys.getdefaultencoding() to be implicitly used.
so when you use c.decode('ascii') you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii.
See also:
https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

You're passing a string that's already unicode. So, before Python can call decode on it, it has to actually encode it - and it does so by default using the ASCII encoding.
Edit to add It depends on what you want to do. If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace').

Python has two types of strings: character-strings (the unicode type) and byte-strings (the str type). The code you have pasted operates on byte-strings. You need a similar function to handle character-strings.
Maybe this:
def uescape(text):
print repr(text)
escaped_chars = []
for c in text:
if (ord(c) < 32) or (ord(c) > 126):
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
I do wonder whether either function is truly necessary for you. If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8') as the final step before delivering it to the client. Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding.

This answer always works for me when I have this problem:
def byteify(input):
'''
Removes unicode encodings from the given input string.
'''
if isinstance(input, dict):
return {byteify(key):byteify(value) for key,value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
from How to get string objects instead of Unicode ones from JSON in Python?

I found solution in this-site
reload(sys)
sys.setdefaultencoding("latin-1")
a = u'\xe1'
print str(a) # no exception

decode a str make no sense.
I think you can check ord(c)>127

Python UnicodeDecodeError - Am I misunderstanding encode?

Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.
>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)

… There's a reason they're called "encodings" …
A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.
In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.
Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.
So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.
So, what you meant to do, was:
"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")
It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.
Anyway, all you have to remember for your to-and-fro Unicode conversions is:
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
In both cases, you need to specify the encoding that will be used.
I'm not very clear, I'm sleepy, but I sure hope I help.
PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)
PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications.

encode is available to unicode strings, but the string you have there does not seems unicode (try with u'add \x93Monitoring\x93 to list ')
>>> u'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
'add \x93Monitoring\x93 to list '

And the magic line is:
unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')
The one liner that wont raise exceptions when it is most needed (remove bad Unicode characters...)

This seems to work:
'add \x93Monitoring\x93 to list '.decode('latin-1').encode('latin-1')
Any issues with that? I wonder when 'ignore', 'replace' and other such encode error handling comes in?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.