Does python automatically decode ASCII and UTF-8 byte strings? [duplicate]

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate] - python

This question already has answers here:
What does a b prefix before a python string mean?
(2 answers)
Closed 3 years ago.
From what I understand, a Python3 string is a sequence of bytes that has been decoded to be readable by humans, and a Python3 bytes object are the raw bytes that are not human readable. What I'm having trouble understanding, however, is how strings encoded with UTF-8 or ASCII are displayed as a string prefixed by a b, rather than a sequence of bytes
string = "I am a string"
# prints a sequence of bytes, like I would expect
string.encode("UTF-16")
b'\xff\xfeI\x00 \x00a\x00m\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
# Prints a sequence of human readable characters, which I don't understand
string.encode("UTF-8")
b'I am a string'
Why does a string encoded by UTF-8 or ASCII not display a sequence of bytes?

UTF-8 is a backwards-compatible superset of ASCII, i.e. anything that’s valid ASCII is valid UTF-8 and everything present in ASCII is encoded by UTF-8 using the same byte as ASCII. So it’s not “UTF-8 or ASCII” so much as “just some of ASCII”. Try other Unicode:
>>> "café".encode("UTF-8")
b'caf\xc3\xa9'
or other ASCII that wouldn’t be very helpful to look at in character form:
>>> "hello\f\n\t\r\v\0\N{SOH}\N{DEL}".encode("UTF-8")
b'hello\x0c\n\t\r\x0b\x00\x01\x7f'
The reason the repr of bytes displays printable characters instead of \xnn escapes when possible is because it’s helpful if you do happen to have bytes that contain ASCII.
And, of course, it’s still a well-formed bytes literal:
>>> b'I am a string'[0]
73
Additionally: From the docs
While bytes literals and representations are based on ASCII text,
bytes objects actually behave like immutable sequences of integers,
with each value in the sequence restricted such that 0 <= x < 256
(attempts to violate this restriction will trigger ValueError). This
is done deliberately to emphasise that while many binary formats
include ASCII based elements and can be usefully manipulated with some
text-oriented algorithms, this is not generally the case for arbitrary
binary data
-emphasis added.
At the end of the day, this is a design choice that python made for displaying bytes.

Related

Python3 counting UTF-16 code points in a string

I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.

Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2

For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.

What does the value mean when assigning non-ascii character to a python built-in string?

I am recently studying something related to encoding and I am confused about the following:
See if I have
a = "哈" ## whatever non-ascii char is fine
a[0] ## = "\xe5"
a[1] ## = "\x93"
a[2] ## = "\x88"
len(a) would be 3, and each of the value would be "\xe5", "\x93", and "\x88"
I understand that if I do:
a.decode("utf-8") ## = u"\u54c8"
It will become a unicode string, and the code point would be "\u54c8".
The question is: what encoding method does the built-in python string use?
Why a[0] not be "\x54" and a[1] not be "\xc8" so that they together are "54c8"?
I guess the encoding in built-in python str should not be utf-8 because the right utf-8 code point should be "\u54c8". Is that right?

UTF-8 and Unicode are not the same thing. Unicode is an abstract mapping of integer values to characters; UTF-8 is one particular way of representing those integers as a sequence of bytes. \xe5\x93\x88 is the three-byte UTF-8 encoding of the integer 0x54c8, which cannot be represented by a single byte.
The default encoding in Python 2 was ISO-8859, but was changed to UTF-8 in Python 3.

The result of pasting a non-ascii character into the interpreter like that is dependent on your terminal encoding. It's likely (from seeing your data) that it's a utf-8 encoding on your terminal.
a = "哈"
When you evaluate that line of code in Python 2 interactive interpreter, you'll create a bytestring object that is already encoded.
To get a text object from it, you'll have to decode the data using:
a.decode(encoding)
It helps to always think of a str object as bytes and a unicode object as text.
There is no simple relationship between the codepoint and the utf-8 encoded bytes. The relationship that is simple is that
u'哈' == u'\u54c8' == unichr(21704)
Think of the codepoint as just an index in a big table, which you use to lookup the character at that index. The above equality just shows that 哈 is the character at codepoint 21704 (because in hex, 0x54c8 is 21704).
If you want to know the relationship between a codepoint (21704) and the UTF bytes (the \xe5 and \x93 stuff), I already wrote a long answer about that here. You can read it if you're interested to learn how to encode/decode UTF by hand.

Confused about unicode representations

I am confused about hex representation of Unicode.
I have an example file with a single mathematical integral sign character in it. That is U+222B
If I cat the file or edit it in vi I get an integral sign displayed.
A hex dump of the file shows its hex content is 88e2 0aab
In python I can create an integral unicode character and print p rendering on my terminal and integral sign.
>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
∫
What confuses me is I can open a file with the integral sign in it, get the integral symbol but the hex content is different.
>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
∫
One is a Unicode object and one is a plain string but what is the relationship between the two hex codes apparently for the same character? How would I manually convert one to another?

The plain string has been encoded using UTF-8, one of a variety of ways to represent Unicode code points in bytes. UTF-8 is a multibyte encoding which has the often useful feature that it is a superset of ASCII - the same byte encodes any ASCII character in UTF-8 or in ASCII.
In Python 2.x, use the encode method on a Unicode object to encode it, and decode or the unicode constructor to decode it:
>>> u'\u222b'.encode('utf8')
'\xe2\x88\xab'
>>> '\xe2\x88\xab'.decode('utf8')
u'\u222b'
>>> unicode('\xe2\x88\xab', 'utf8')
u'\u222b'
print, when given a Unicode argument, implicitly encodes it. On my system:
>>> sys.stdout.encoding
'UTF-8'
See this answer for a longer discussion of print's behavior:
Why does Python print unicode characters when the default encoding is ASCII?
Python 3 handles things a bit differently; the changes are documented here:
http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Okay i have it. Thanks for the answers. i wanted to see how to do the conversion rather than convert a string using Python.
the conversion works this way.
If you have a unicode character, in my example an integral symbol.
Octal dump produces
echo -n "∫"|od -x
0000000 88e2 00ab
Each hex pair are reversed so it really means
e288ab00
The first hex character is E. the high bit means this is a Unicode string and the next two bits indicate it is 3 three bytes (16 bits) to represent the character.
The first two bits of the remaining hex digits are throw away (they signify they are unicode.) the full bit stream is
111000101000100010101011
Throw away the first 4 bits and the first two bits of the remaining hex digits
0010001000101011
Re-expressing this in hex
222B
They you have it!

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?

You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.

Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?
Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.
Latter edit: pprint does that pretty well.

try:
string.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".
In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.

To Check if Unicode
>>>a = u'F'
>>>isinstance(a, unicode)
True
To Check if it is UTF-8 or ASCII
>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'

I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.
For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:
print len(my_unicode_string.encode('utf-8'))
Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate] - python

Related

Python3 counting UTF-16 code points in a string

What does the value mean when assigning non-ascii character to a python built-in string?

Confused about unicode representations

Large strings and len()

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

Categories

Resources