I am using Python 2.7.11.
I have 2 tuples:
>>> t1 = (u'aaa', u'bbb')
>>> t2 = ('aaa', 'bbb')
And I tried this:
>>> t1==t2
True
How could Python treat unicode and non-unicode the same?
Python 2 considers bytestrings and unicode equal. By the way, this has nothing to do with the containing tuple. Instead it's to do with an implicit type-conversion, which I will explain below.
It's difficult to demonstrate it with 'easy' ascii codepoints, so to see what really goes on under the hood, we can provoke a failure by using higher codepoints:
>>> bites = u'Ç'.encode('utf-8')
>>> unikode = u'Ç'
>>> print bites
Ç
>>> print unikode
Ç
>>> bites == unikode
/Users/wim/Library/Python/2.7/bin/ipython:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
#!/usr/bin/python
False
On seeing a unicode and bytes comparison above, python has implicitly attempted to decode the bytestring to a unicode object by making an assumption that the bytes were encoded with sys.getdefaultencoding() (which is 'ascii' on my platform).
In the case I just showed above, this failed, because the bytes were encoded in 'utf-8'. Now, let's make it "work":
>>> bites = u'Ç'.encode('ISO8859-1')
>>> unikode = u'Ç'
>>> import sys
>>> reload(sys) # please don't ever actually use this hack, guys
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('ISO8859-1')
>>> bites == unikode
True
Your upconversion "works" in pretty much the same way, but using an 'ascii' codec. These kind of implicit conversions between bytes and unicode are actually pretty evil and can cause a lot of pain, so it was decided to stop doing those in Python 3 because "explicit is better than implicit".
As a minor digression, on Python 3+ your code is actually both representing unicode string literals so they are equal anyway. The u prefix is silently ignored. If you want a bytestring literal in python3, you need to specify it like b'this'. Then you would want to either 1) explicitly decode the bytes, or 2) explicitly encode the unicode object before making a comparison.
Related
Is there a way to convert a \x escaped string like "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80" into readable form: "語言"?
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> print(a)
\xe8\xaa\x9e\xe8\xa8\x80
I am aware that there is a similar question here, but it seems the solution is only for latin characters. How can I convert this form of string into readable CJK characters?
Decode it first using 'unicode-escape', then as 'utf8':
a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
decoded = a.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(decoded)
# 語言
Note that since we can only decode bytes objects, we need to transparently encode it in between, using 'latin1'.
Starting with string a which appears to follow python's hex escaping rules, you can decode it to a bytes object plus length of string decoded.
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> import codecs
>>> codecs.escape_decode(a)
(b'\xe8\xaa\x9e\xe8\xa8\x80', 24)
You don't need the length here, so just get item 0. Now its time for some guessing. Assuming that this string actually represented a utf-8 encoding, you now have a bytes array that you can decode
>>> codecs.escape_decode(a)[0].decode('utf-8')
'語言'
If the underlying encoding was different (say, a Windows CJK code page), you'd have to decode with its decoder.
Text like this could make a valid Python bytes literal. Assuming we don't have to worry about invalid input, we can simply construct a string that looks like the corresponding source code, and use ast.literal_eval to interpret it that way (this is safe, unlike using eval). Finally we decode the resulting bytes as UTF-8. Thus:
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> ast.literal_eval(f"b'{a}'")
b'\xe8\xaa\x9e\xe8\xa8\x80'
>>> ast.literal_eval(f"b'{a}'").decode('utf-8')
'語言'
Such a codec is missing in stdlib. My package all-escapes registers a codec which can be used:
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> a.encode('all-escapes').decode()
'語言'
In Python 2, you can call str.decode to get a unicode object, and unicode.encode to get a str object.
>>> "foo".decode('utf-8')
u'foo'
>>> u"foo".encode('utf-8')
'foo'
Python 3 is similar, using bytes.decode to get a string, and str.encode to get a bytes object.
>>> "foo".encode('utf-8')
b'foo'
>>> b"foo".decode('utf-8')
'foo'
However, Python 2 (but not Python 3) also provides methods the wrong way around: you can call .encode on a str object, or .decode on a unicode object!
>>> "foo".encode('utf-8')
'foo'
>>> u"foo".decode('utf-8')
u'foo'
Why is this? Is there any time that it is useful to call .decode on a unicode object, or vice versa?
Because in Python 2, the thought was that you would want to treat text in byte strings (str objects) and Unicode strings (unicode objects) interchangeably, transparently. When a bytestring is expected, unicode objects are transparently encoded (to ASCII), and conversely, when Unicode is expected, a str object is transparently decoded, assuming ASCII again.
So str.encode() will first decode, then encode again. Or unicode.decode() will encode first, to then decode the result.
There is only a use for this if your code wants to accept either str or unicode objects and treat these interchangeably. So a function that expects a bytestring and attempts to decode that bytestring will continue to work, even if you pass in a unicode object containing only ASCII codepoints.
This led to a huge amount of confusion and errors (just search for UnicodeEncodeError and UnicodeDecodeError here on Stack Overflow), so in Python 3 the types were dis-entangled.
I noticed the following holds:
>>> u'abc' == 'abc'
True
>>> 'abc' == u'abc'
True
Will this always be true or could it possibly depend on the system locale?
(It seems strings are unicode in python 3: e.g. this question, but bytes in 2.x)
Python 2 coerces between unicode and str using the ASCII codec when comparing the two types. So yes, this is always true.
That is to say, unless you mess up your Python installation and use sys.setdefaultencoding() to change that default. You cannot do that normally, because the sys.setdefaultencoding() function is deleted from the module at start-up time, but there is a Cargo Cult going around where people use reload(sys) to reinstate that function and change the default encoding to something else to try and fix implicit encoding and decoding problems. This is a dumb thing to do for precisely this reason.
I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.
Consider the next example:
>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà
I'm using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:
>>> print s.encode('latin1')
баба
Why so? Is there spec for such behavior?
CPython, 2.7.
Edit
The code I was actually looking for is
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True
Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.
When you type a character such as б into the terminal, you see a б, but what is really inputted is a sequence of bytes.
Since your terminal encoding is cp1251, typing баба results in the sequence of bytes equal to the unicode баба encoded in cp1251:
In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'
(Note I use utf-8 above because my terminal encoding is utf-8, not cp1251. For me, "баба".decode('utf-8') is just unicode for баба.)
Since typing баба results in the sequence of bytes \xe1\xe0\xe1\xe0, when you type u"баба" into the terminal, Python receives u'\xe1\xe0\xe1\xe0' instead. This is why you are seeing
>>> s
u'\xe1\xe0\xe1\xe0'
This unicode happens to represent áàáà.
And when you type
>>> print s.encode('latin1')
the latin1 encoding converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0'.
The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0', and decodes them with cp1251, thus printing баба:
In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба
Try:
>>> s = "баба"
(without the u) instead. Or,
>>> s = "баба".decode('cp1251')
to make s unicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):
>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'
Or the short but less-readily comprehensible
>>> s = u'\u0431\u0430\u0431\u0430'