Large strings and len()

Large strings and len() - python

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?

You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.

Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Related

Python3 counting UTF-16 code points in a string

I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.

Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2

For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate]

This question already has answers here:
What does a b prefix before a python string mean?
(2 answers)
Closed 3 years ago.
From what I understand, a Python3 string is a sequence of bytes that has been decoded to be readable by humans, and a Python3 bytes object are the raw bytes that are not human readable. What I'm having trouble understanding, however, is how strings encoded with UTF-8 or ASCII are displayed as a string prefixed by a b, rather than a sequence of bytes
string = "I am a string"
# prints a sequence of bytes, like I would expect
string.encode("UTF-16")
b'\xff\xfeI\x00 \x00a\x00m\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
# Prints a sequence of human readable characters, which I don't understand
string.encode("UTF-8")
b'I am a string'
Why does a string encoded by UTF-8 or ASCII not display a sequence of bytes?

UTF-8 is a backwards-compatible superset of ASCII, i.e. anything that’s valid ASCII is valid UTF-8 and everything present in ASCII is encoded by UTF-8 using the same byte as ASCII. So it’s not “UTF-8 or ASCII” so much as “just some of ASCII”. Try other Unicode:
>>> "café".encode("UTF-8")
b'caf\xc3\xa9'
or other ASCII that wouldn’t be very helpful to look at in character form:
>>> "hello\f\n\t\r\v\0\N{SOH}\N{DEL}".encode("UTF-8")
b'hello\x0c\n\t\r\x0b\x00\x01\x7f'
The reason the repr of bytes displays printable characters instead of \xnn escapes when possible is because it’s helpful if you do happen to have bytes that contain ASCII.
And, of course, it’s still a well-formed bytes literal:
>>> b'I am a string'[0]
73
Additionally: From the docs
While bytes literals and representations are based on ASCII text,
bytes objects actually behave like immutable sequences of integers,
with each value in the sequence restricted such that 0 <= x < 256
(attempts to violate this restriction will trigger ValueError). This
is done deliberately to emphasise that while many binary formats
include ASCII based elements and can be usefully manipulated with some
text-oriented algorithms, this is not generally the case for arbitrary
binary data
-emphasis added.
At the end of the day, this is a design choice that python made for displaying bytes.

How to acces string with mixed 1 byte and 2 byte symbols?

I read questions and answers for a quiz from a file in UTF-8 encoding but the answer can consists 1 byte symbols (English) and 2 byte symbols (Russian) in the same text:
"best car тайота"`
I need to write answer replaced with "*" so it looks like "**** *** ******" to help guess what answer is. For determining length I use
len(answer.decode('utf-8'))
But in the next hint when I want to show some symbols like "b*s* ca* *а*от*", I can access the 1 byte symbols via answer[index] but I can't read 2 byte symbols this way, and that's why I get "b*s* ca*" without 2 byte symbols.
Is there solution for this?

Decode the string to a Unicode value once, and do your replacements in that.
A unicode string object supports the same operations as byte strings; just be careful when mixing byte strings and Unicode strings as that could trigger an automatic encode or decode (leading to UnicodeEncode or UnicodeDecode errors). Printing the string should automatically encode the value to match your terminal codec.
You may want to read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Confused about unicode representations

I am confused about hex representation of Unicode.
I have an example file with a single mathematical integral sign character in it. That is U+222B
If I cat the file or edit it in vi I get an integral sign displayed.
A hex dump of the file shows its hex content is 88e2 0aab
In python I can create an integral unicode character and print p rendering on my terminal and integral sign.
>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
∫
What confuses me is I can open a file with the integral sign in it, get the integral symbol but the hex content is different.
>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
∫
One is a Unicode object and one is a plain string but what is the relationship between the two hex codes apparently for the same character? How would I manually convert one to another?

The plain string has been encoded using UTF-8, one of a variety of ways to represent Unicode code points in bytes. UTF-8 is a multibyte encoding which has the often useful feature that it is a superset of ASCII - the same byte encodes any ASCII character in UTF-8 or in ASCII.
In Python 2.x, use the encode method on a Unicode object to encode it, and decode or the unicode constructor to decode it:
>>> u'\u222b'.encode('utf8')
'\xe2\x88\xab'
>>> '\xe2\x88\xab'.decode('utf8')
u'\u222b'
>>> unicode('\xe2\x88\xab', 'utf8')
u'\u222b'
print, when given a Unicode argument, implicitly encodes it. On my system:
>>> sys.stdout.encoding
'UTF-8'
See this answer for a longer discussion of print's behavior:
Why does Python print unicode characters when the default encoding is ASCII?
Python 3 handles things a bit differently; the changes are documented here:
http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Okay i have it. Thanks for the answers. i wanted to see how to do the conversion rather than convert a string using Python.
the conversion works this way.
If you have a unicode character, in my example an integral symbol.
Octal dump produces
echo -n "∫"|od -x
0000000 88e2 00ab
Each hex pair are reversed so it really means
e288ab00
The first hex character is E. the high bit means this is a Unicode string and the next two bits indicate it is 3 three bytes (16 bits) to represent the character.
The first two bits of the remaining hex digits are throw away (they signify they are unicode.) the full bit stream is
111000101000100010101011
Throw away the first 4 bits and the first two bits of the remaining hex digits
0010001000101011
Re-expressing this in hex
222B
They you have it!

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].

There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large strings and len() - python

Related

Python3 counting UTF-16 code points in a string

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate]

How to acces string with mixed 1 byte and 2 byte symbols?

Confused about unicode representations

Returning the first N characters of a unicode string

Categories

Resources