Confused about unicode representations - python

I am confused about hex representation of Unicode.
I have an example file with a single mathematical integral sign character in it. That is U+222B
If I cat the file or edit it in vi I get an integral sign displayed.
A hex dump of the file shows its hex content is 88e2 0aab
In python I can create an integral unicode character and print p rendering on my terminal and integral sign.
>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
∫
What confuses me is I can open a file with the integral sign in it, get the integral symbol but the hex content is different.
>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
∫
One is a Unicode object and one is a plain string but what is the relationship between the two hex codes apparently for the same character? How would I manually convert one to another?

The plain string has been encoded using UTF-8, one of a variety of ways to represent Unicode code points in bytes. UTF-8 is a multibyte encoding which has the often useful feature that it is a superset of ASCII - the same byte encodes any ASCII character in UTF-8 or in ASCII.
In Python 2.x, use the encode method on a Unicode object to encode it, and decode or the unicode constructor to decode it:
>>> u'\u222b'.encode('utf8')
'\xe2\x88\xab'
>>> '\xe2\x88\xab'.decode('utf8')
u'\u222b'
>>> unicode('\xe2\x88\xab', 'utf8')
u'\u222b'
print, when given a Unicode argument, implicitly encodes it. On my system:
>>> sys.stdout.encoding
'UTF-8'
See this answer for a longer discussion of print's behavior:
Why does Python print unicode characters when the default encoding is ASCII?
Python 3 handles things a bit differently; the changes are documented here:
http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Okay i have it. Thanks for the answers. i wanted to see how to do the conversion rather than convert a string using Python.
the conversion works this way.
If you have a unicode character, in my example an integral symbol.
Octal dump produces
echo -n "∫"|od -x
0000000 88e2 00ab
Each hex pair are reversed so it really means
e288ab00
The first hex character is E. the high bit means this is a Unicode string and the next two bits indicate it is 3 three bytes (16 bits) to represent the character.
The first two bits of the remaining hex digits are throw away (they signify they are unicode.) the full bit stream is
111000101000100010101011
Throw away the first 4 bits and the first two bits of the remaining hex digits
0010001000101011
Re-expressing this in hex
222B
They you have it!

Related

Python3 counting UTF-16 code points in a string

I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate]

This question already has answers here:
What does a b prefix before a python string mean?
(2 answers)
Closed 3 years ago.
From what I understand, a Python3 string is a sequence of bytes that has been decoded to be readable by humans, and a Python3 bytes object are the raw bytes that are not human readable. What I'm having trouble understanding, however, is how strings encoded with UTF-8 or ASCII are displayed as a string prefixed by a b, rather than a sequence of bytes
string = "I am a string"
# prints a sequence of bytes, like I would expect
string.encode("UTF-16")
b'\xff\xfeI\x00 \x00a\x00m\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
# Prints a sequence of human readable characters, which I don't understand
string.encode("UTF-8")
b'I am a string'
Why does a string encoded by UTF-8 or ASCII not display a sequence of bytes?
UTF-8 is a backwards-compatible superset of ASCII, i.e. anything that’s valid ASCII is valid UTF-8 and everything present in ASCII is encoded by UTF-8 using the same byte as ASCII. So it’s not “UTF-8 or ASCII” so much as “just some of ASCII”. Try other Unicode:
>>> "café".encode("UTF-8")
b'caf\xc3\xa9'
or other ASCII that wouldn’t be very helpful to look at in character form:
>>> "hello\f\n\t\r\v\0\N{SOH}\N{DEL}".encode("UTF-8")
b'hello\x0c\n\t\r\x0b\x00\x01\x7f'
The reason the repr of bytes displays printable characters instead of \xnn escapes when possible is because it’s helpful if you do happen to have bytes that contain ASCII.
And, of course, it’s still a well-formed bytes literal:
>>> b'I am a string'[0]
73
Additionally: From the docs
While bytes literals and representations are based on ASCII text,
bytes objects actually behave like immutable sequences of integers,
with each value in the sequence restricted such that 0 <= x < 256
(attempts to violate this restriction will trigger ValueError). This
is done deliberately to emphasise that while many binary formats
include ASCII based elements and can be usefully manipulated with some
text-oriented algorithms, this is not generally the case for arbitrary
binary data
-emphasis added.
At the end of the day, this is a design choice that python made for displaying bytes.

Convert UTF-8 to string literals in Python

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
My string is: 'Entre\xc3\xa9'
Example one:
This code:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
returns the result: u'Entre\xe9'
If I then continue by printing this:
print u'Entre\xe9'
I get the result: Entreé
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
Example:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
I would like result of "c" to be:
Entreé
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.
You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().
Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
Your a value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder

Is the Unicode code point value equal to the UTF-16BE representation for every character?

I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?
No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.

Latin1 character values not displaying the same as in utf8

FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)
For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.
For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:
>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1')
>>u'\x80'
Then I used this print statement for both of them, and this is what I got:
for utf8:
>>> print unichr(0x20AC).encode('utf-8')
€
for latin-1:
>>> print unichr(0x80).encode('latin-1')
€
What in the heck happened? Why did encode return '€' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??
You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.
And that explains your confusion here:
Why did encode return '€' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.

Categories