Python, len and slices on unicode strings

Python, len and slices on unicode strings - python

I am handling a situation where I need to make a string fit in the allocated gap in the screen, as I'm using unicode len() and slices[] work apparently on bytes and I end up cutting unicode strings too short, because € only occupies one space in the screen but 2 for len() or slices[].
I have the encoding headers properly setup, and I'm willing to use other things than slices or len() to deal with this, but I really need to know how many spaces will the string take and how to cut it to the available.
$cat test.py
# -*- coding: utf-8 -*-
a = "2 €uros"
b = "2 Euros"
print len(b)
print len(a)
print a[3:]
print b[3:]
$python test.py
7
9
��uros
uros

You're not creating Unicode strings there; you're creating byte strings with UTF-8 encoding (which is variable-length, as you're seeing). You need to use constants of the form u"..." (or u'...'). If you do that, you get the expected result:
% cat test.py
# -*- coding: utf-8 -*-
a = u"2 €uros"
b = u"2 Euros"
print len(b)
print len(a)
print a[3:]
print b[3:]
% python test.py
7
7
uros
uros

Related

(python utf-8) using 'à','ç','é','è','ê','ë','î','ô','ù'

I am having trouble with accent in python
I wrote # -- coding: utf-8 -- so it can recognize the accent.
But still sometime it doesn't work. I get '?' and when I use it after I get an error " SyntaxError: Non-ASCII character '\xc3' "
Why ? What should I change? Thanks
(doesn't work for all those characters 'à','ç','é','è','ê','ë','î','ô','ù',"‘","’")
this is my code :
# -*- coding: utf-8 -*-
testList = ['à','ç','é','è','ê','ë','î','ô','ù',"‘","’"]
testCharacter = raw_input('test a character : ') # example : é
print(testCharacter) # getting é
print(testCharacter[0]) # getting ?
print(testCharacter + testCharacter[0]) # getting é?
testCharacterPosition = testList.index(testCharacter)
print(testCharacterPosition) #getting 2
this is the result on my console :
test a character : é
é
?
é?
2

It seems you are still using python2 (you should consider switching to python3 since python2 is discontinued).
If pasted some utf8 string, it is encoded and therefore consists of multiple characters, e.g.:
>>> s = 'à'
>>> s
'\xc3\xa0'
>>> s[0]
'\xc3'
Of course this will print an question mark since one alone doesn't make the full character:
>>> print(s + s[0])
à�
you can convert this to a unicode string, which then consists of one character:
>>> s.decode('utf-8')
u'\xe0'
>>> print(s.decode('utf-8'))
à
You can get around decode when directly using unicode strings in py2:
>>> s = u'à'
>>> s
u'\xe0'
Better would be to use python3, which simplifies the whole thing to:
>>> s = 'à'
>>> s
'à'
>>>

Weird behaviour when trying to print characters of a byte string

Why this short code behaves differently from a run to other ?
# -*- coding: utf-8 -*-
for c in 'aɣyul':
print c
The outputs that I have in each run are:
# nothing
---
a
---
l
---
u
l
---
a
y
u
l
...etc
EDIT:
I know how to solve the problem, the question is just why Python prints a different part of the string, instead of the same part, at each run ?

You need to add an u at leading of your string which make that python treads with your string as an unicode, and decode your character while printing:
>>> for c in u'aɣyul':
... print c
...
a
ɣ
y
u
l
Note that without encoding python will break the unicode character in two separate hex value and in each print you will get the string representation of this hex values:
>>> 'aɣyul'
'a\xc9\xa3yul'
^ ^
If you want to know that why python break the unicode to 2 hex value that's because of that instances of str contain raw 8-bit values while a unicode character used more than 8 bit memory.
You can also decode the hex values manually:
>>> print '\xc9\xa3'.decode('utf8')
ɣ

python wrong character encoding comparison

Have a problem with cyrillic character comparison in Python. Here is the small testcase%
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def convert(text):
result = []
for i in xrange(len(text)):
if text[i].lower() == 'й':
result.append('q')
print result
if __name__ == '__main__':
convert('йцукенг')
You definitely see, that the first character should be equal to the character in the condition. But the condition fails and result is empty.
Also if I try to print whole string (text) it works well, but if I try to print just a character (like text[2]) — I get '?' in the output.
I'm sure the problem is with encoding, but how can I do correct comparison of separate characters?

You are seeing this behavior because you are looping over the bytes in a UTF-8 string, not over the characters. Here is an example of the difference:
>>> 'й' # note that this is two bytes
'\xd0\xb9'
>>> 'йцукенг'[0] # but when you loop you are looking at a single byte
'\xd0'
>>> len('йцукенг') # 7 characters, but 14 bytes
14
This is why it is necessary to use Unicode for checking the character, as in mVChr's answer.
These easiest way to do this is to leave all of your code exactly the same, and just add a u prefix to all of your string literals (u'йцукенг' and u'й').

Presuming you're using Python 2.X, you should use unicode strings, try:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def convert(text):
result = []
for i in xrange(len(text)):
if text[i].lower() == unicode('й', 'utf8'):
result.append('q')
print result
if __name__ == '__main__':
convert(unicode('йцукенг', 'utf8'))
Or you can simply enter the raw unicode strings of u'йцукенг' and u'й'

Split an utf-8 encoded string given a bytes offset (python 2.7)

Having an utf-8 encoded string like this:
bar = "hello ｡◕‿‿◕｡"
and a bytes offset that tells me at which byte I have to split the string:
bytes_offset = 9
how can I split the bar string in two parts resulting in:
>>first_part
'hello ｡' <---- #9 bytes 'hello \xef\xbd\xa1'
>>second_part
'◕‿‿◕｡'
In a nutshell:
given a bytes offset, how can I transform it in the actual char index position of an utf-8 encoded string?

UTF-8 Python 2.x strings are basically byte strings.
# -*- coding: utf-8 -*-
bar = "hello ｡◕‿‿◕｡"
assert(isinstance(bar, str))
first_part = bar[:9]
second_part = bar[9:]
print first_part
print second_part
Yields:
hello ｡
◕‿‿◕｡
Python 2.6 on OSX here but I expect the same from 2.7. If I split on 10 or 11 instead of 9, I get ? characters output implying that it broke the sequence of bytes in the middle of a multibyte character sequence; splitting on 12 moves the first "eyeball" to the first part of the string.
I have PYTHONIOENCODING set to utf8 in the terminal.

Character offset is a number of characters before byte offset:
def byte_to_char_offset(b_string, b_offset, encoding='utf8'):
return len(b_string[:b_offset].decode(encoding))

Truncating unicode so it fits a maximum size when encoded for wire transfer

Given a Unicode string and these requirements:
The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)
See Also:
Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
Related Javascript question: Using JavaScript to truncate text to a certain size

def unicode_truncate(s, length, encoding='utf-8'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:
>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.
As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)
Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.
# coding: UTF-8
def decodeok(bytestr):
try:
bytestr.decode("UTF-8")
except UnicodeDecodeError:
return False
return True
def is_first_byte(byte):
"""return if the UTF-8 #byte is the first byte of an encoded character"""
o = ord(byte)
return ((0b10111111 & o) != o)
def truncate_utf8(bytestr, maxlen):
u"""
>>> us = u"ウィキペディアにようこそ"
>>> s = us.encode("UTF-8")
>>> trunc20 = truncate_utf8(s, 20)
>>> print trunc20.decode("UTF-8")
ウィキペディ
>>> len(trunc20)
18
>>> trunc21 = truncate_utf8(s, 21)
>>> print trunc21.decode("UTF-8")
ウィキペディア
>>> len(trunc21)
21
"""
L = maxlen
for x in xrange(1, 5):
if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
return bytestr[:L-x]
return bytestr[:L]
if __name__ == '__main__':
# unicode doctest hack
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()

This will do for UTF8, If you like to do it in regex.
import re
partial="\xc2\x80\xc2\x80\xc2"
re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
"\xc2\x80\xc2\x80"
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings
Its really straight forward just like UTF8 algorithm
From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is [\xc0-\xdf], it will be partial one.
From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}
From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}
Update :
If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
Let me know if there is any problem with that regex.

For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:
Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
Truncate 3 bytes more than my final limit
Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string = partial_string.decode('unicode_escape')
Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.
Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.
mxlen=255
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
mxlen -= 1
truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, len and slices on unicode strings - python

Related

(python utf-8) using 'à','ç','é','è','ê','ë','î','ô','ù'

Weird behaviour when trying to print characters of a byte string

python wrong character encoding comparison

Split an utf-8 encoded string given a bytes offset (python 2.7)

Truncating unicode so it fits a maximum size when encoded for wire transfer

Categories

Resources