Encoding used for u"" literals - python

Consider the next example:
>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà
I'm using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:
>>> print s.encode('latin1')
баба
Why so? Is there spec for such behavior?
CPython, 2.7.
Edit
The code I was actually looking for is
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True
Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.

When you type a character such as б into the terminal, you see a б, but what is really inputted is a sequence of bytes.
Since your terminal encoding is cp1251, typing баба results in the sequence of bytes equal to the unicode баба encoded in cp1251:
In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'
(Note I use utf-8 above because my terminal encoding is utf-8, not cp1251. For me, "баба".decode('utf-8') is just unicode for баба.)
Since typing баба results in the sequence of bytes \xe1\xe0\xe1\xe0, when you type u"баба" into the terminal, Python receives u'\xe1\xe0\xe1\xe0' instead. This is why you are seeing
>>> s
u'\xe1\xe0\xe1\xe0'
This unicode happens to represent áàáà.
And when you type
>>> print s.encode('latin1')
the latin1 encoding converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0'.
The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0', and decodes them with cp1251, thus printing баба:
In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба
Try:
>>> s = "баба"
(without the u) instead. Or,
>>> s = "баба".decode('cp1251')
to make s unicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):
>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'
Or the short but less-readily comprehensible
>>> s = u'\u0431\u0430\u0431\u0430'

Related

Understanding unicode and encoding in Python

When I enter following in the python 2.7 console
>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'
I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.
u'áíóús' is a string of text. What you see echoed in the REPL is the canonical representation of that object:
>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'
The things like \xe1 are related to hexadecimal ordinals of each character:
>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']
Only the last character was in the ascii range, i.e. ordinals in range(128), so only that last character "s" is plainly visible in Python 2.x:
>>> chr(0x73)
's'
'áíóús' is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:
>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'

String literal Vs Unicode literal Vs unicode type object - Memory representation

Python 2.x doc says,
Unicode string is a sequence of code points
Unicode strings are expressed as instances of the unicode type
>>> ThisisNotUnicodeString = 'a정정💛' # What is the memory representation?
>>> ThisisNotUnicodeString
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> type(ThisisNotUnicodeString)
<type 'str'>
>>> a = u'a정정💛' # Which encoding technique used to represent in memory? utf-8?
>>> a
u'a\uc815\uc815\U0001f49b'
>>> type(a)
<type 'unicode'>
>>> b = unicode('a정정💛', 'utf-8')
>>> b
u'a\uc815\uc815\U0001f49b'
>>> c = unicode('a정정💛', 'utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
>>>
Question:
1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent 정 or 💛 character in memory.
2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?
3) Why c is not represented in memory, using utf-16 technique?
1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent 정 or 💛 character in memory.
In the interactive prompt, which encoding will be used to encode Python 2.X's str type depends on your shell encoding, for example if you run the terminal under a Linux system with the encoding of the terminal being UTF-8:
>>> s = "a정정💛"
>>> s
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
Now try to change the encoding from your terminal window to something else, in this case I've changed the shell's encoding from UTF-8 to WINDOWS-1250:
>>> s = "a???"
If you try this with a tty session you get a diamonds instead of ? at least under Ubuntu you may get different characters.
As you can conclude which encoding will be used to determine the encoding of str in the interactive prompt is shell-dependent. This applies to code run interactively under Python interpreter, code that's not run interactively will raise an exception:
#main.py
s = "a정정💛"
Trying to run the code raises SynatxError:
$ python main.py
SyntaxError: Non-ASCII character '\xec' in file main.py...
This is because Python 2.X uses ASCII by default:
>>> sys.getdefaultencoding()
'ascii'
Then, you have to specifiy the encoding explicity in your code by doing this:
#main.py
#*-*encoding:utf-8*-*
s = "a정정💛"
2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?
Keep in mind that the encoding scheme can differ if you run your code in different shells, I have tested this under Linux, this could be slightly different for Windows, so check your operating system's documentation.
To know the number of bytes occupied, use len:
>>> s = "a정정💛"
>>> len(s)
11
s occupies exactly 11 bytes.
2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?
Well, it's a confusion, unicode type does not have encoding. It's just a sequence of Unicode character points (a.k.a U+0040 for Commercial At).
3) Why c is not represented in memory, using utf-16 technique?
UTF-8 is an encoding scheme that's different from UTF-16--UTF-8 represents characters' bytes differently from that of UTF-16. Here:
>>> c = unicode('a정정💛', 'utf-16')
You're essentially doing this:
>>> "a정정💛"
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> unicode('a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b', 'utf-16')
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
This is because you're trying to decode UTF-8 with UTF-16. Again, both use different number of bytes to represent characters, they're just two different encoding schemes--different ways to represent characters in bytes.
For your reference:
Python str vs unicode types
Which encoding technique used to represent in memory? utf-8?
You can try the following:
ThisisNotUnicodeString.decode('utf-8')
If you get a result, it's UTF-8, otherwise it's not.
If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:
ThisisNotUnicodeString.decode('utf-8').encode('utf-16')
So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.
ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')

Python - convert unicode and hex to unicode

I have a supposedly unicode string like this:
u'\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xaf\xc2\xbc\xc2\x81\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xc2\xbc\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xe2\u20ac\u0161\xc2\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\xb0\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xc6\u2019\xe2\u20ac\xa0\xc3\xa3\xe2\u20ac\u0161\xc2\xa3\xc3\xa3\xc6\u2019\xc2\x90\xc3\xa3\xc6\u2019\xc2\xab\xc3\xaf\xc2\xbc\xcb\u2020\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xaf\xc2\xbc\xe2\u20ac\xb0'
How do I get the correct unicode string out of this? I think, the actual unicode value is ラブライブ!スクールアイドルフェスティバル(スクフェス)
You have a Mojibake, an incorrectly decoded piece text.
You can use the ftfy library to un-do the damage:
>>> from ftfy import fix_text
>>> fix_text(s)
u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
>>> print fix_text(s)
ラブライブ!スクールアイドルフェスティバル(スクフェス)
According to ftfy, your data was encoded as UTF-8, then decoded as Windows codepage 1252; the ftfy.fixes.fix_one_step_and_explain() function shows the repair steps needed:
>>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
[(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]
(the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252, but some bad decoders then just copy the original byte; the special codec reverses that process).
In fact, in your case this was done twice, not a feat I had seen before:
>>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
ラブライブ!スクールアイドルフェスティバル(スクフェス)

How could python treat unicode and non-unicode tuple as equal?

I am using Python 2.7.11.
I have 2 tuples:
>>> t1 = (u'aaa', u'bbb')
>>> t2 = ('aaa', 'bbb')
And I tried this:
>>> t1==t2
True
How could Python treat unicode and non-unicode the same?
Python 2 considers bytestrings and unicode equal. By the way, this has nothing to do with the containing tuple. Instead it's to do with an implicit type-conversion, which I will explain below.
It's difficult to demonstrate it with 'easy' ascii codepoints, so to see what really goes on under the hood, we can provoke a failure by using higher codepoints:
>>> bites = u'Ç'.encode('utf-8')
>>> unikode = u'Ç'
>>> print bites
Ç
>>> print unikode
Ç
>>> bites == unikode
/Users/wim/Library/Python/2.7/bin/ipython:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
#!/usr/bin/python
False
On seeing a unicode and bytes comparison above, python has implicitly attempted to decode the bytestring to a unicode object by making an assumption that the bytes were encoded with sys.getdefaultencoding() (which is 'ascii' on my platform).
In the case I just showed above, this failed, because the bytes were encoded in 'utf-8'. Now, let's make it "work":
>>> bites = u'Ç'.encode('ISO8859-1')
>>> unikode = u'Ç'
>>> import sys
>>> reload(sys) # please don't ever actually use this hack, guys
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('ISO8859-1')
>>> bites == unikode
True
Your upconversion "works" in pretty much the same way, but using an 'ascii' codec. These kind of implicit conversions between bytes and unicode are actually pretty evil and can cause a lot of pain, so it was decided to stop doing those in Python 3 because "explicit is better than implicit".
As a minor digression, on Python 3+ your code is actually both representing unicode string literals so they are equal anyway. The u prefix is silently ignored. If you want a bytestring literal in python3, you need to specify it like b'this'. Then you would want to either 1) explicitly decode the bytes, or 2) explicitly encode the unicode object before making a comparison.

Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.

Categories