python utf-8 behaviour [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python returning the wrong length of string when using special characters
I read a multilingual string from file in windows-1251, for example s="qwe абв" (second part in Russian), and then:
for i in s.decode('windows-1251').encode('utf-8').split():
print i, len(i)
and I get:
qwe 3
абв 6
Oh God, why? o_O

In programming languages you can't always think of strings as a sequence of characters, because generally they are actually a sequence of bytes. You can't store every character or symbol in 8 bits, character encodings create some rules to combine multiple bytes into a single character.
In the case of the string 'абв' encoded in utf-8, what you have is 6 bytes that represent 3 characters. If you want to count the number of characters instead of the number of bytes, make sure you are taking the length from a unicode string.

>>> print "абв"
абв
>>> print [char for char in "абв"]
['\xd0', '\xb0', '\xd0', '\xb1', '\xd0', '\xb2']
That's why :)

Related

How to convert string hex into bytes format? [duplicate]

This question already has answers here:
Show hex value for all bytes, even when ASCII characters are present
(2 answers)
Closed 3 months ago.
The string is like "e52c886a88b6f421a9324ea175dc281478f03003499de6162ca72ddacf4b09e0", when I run the code, the output is not my expectation, like this.
hexstr = "e52c886a88b6f421a9324ea175dc281478f03003499de6162ca72ddacf4b09e0"
hexstr = bytes.fromhex(hexstr)
print(hexstr)
The output is
b'\xe5,\x88j\x88\xb6\xf4!\xa92N\xa1u\xdc(\x14x\xf00\x03I\x9d\xe6\x16,\xa7-\xda\xcfK\t\xe0'
My expected output should like b'\xe5\x2c\xc8\x86......
Your code is correct.
Python tries to be helpful by displaying bytes that map to an ASCII character as that character. For example, \x2c maps to ,.
>>> b',' == b'\x2c'
True

Efficient way to cut a (UTF-8) string in Python to a given maximal byte length [duplicate]

This question already has answers here:
Truncating string to byte length in Python
(4 answers)
Closed 3 years ago.
For storage in a given Oracle table (whose field lengths are defined in bytes) I need to cut strings beforehand in Python 3 to a maximal length in Bytes, although the strings can contain UTF-8 characters.
My solution is to concatenate the result string character by character from the original string and check when the result string exceeds the length limit:
def cut_str_to_bytes(s, max_bytes):
"""
Ensure that a string has not more than max_bytes bytes
:param s: The string (utf-8 encoded)
:param max_bytes: Maximal number of bytes
:return: The cut string
"""
def len_as_bytes(s):
return len(s.encode(errors='replace'))
if len_as_bytes(s) <= max_bytes:
return s
res = ""
for c in s:
old = res
res += c
if len_as_bytes(res) > max_bytes:
res = old
break
return res
This is obviously rather slow. What is an efficient way to do this?
ps: I saw Truncate a string to a specific number of bytes in Python, but their solution to use sys.getsizeof() does not give the number of bytes of the string characters, but rather the size of the whole string object (Python need some bytes to manage the string object), so that does not really help.
It is valid to cut a UTF-8 string anywhere except in the middle of a multibyte character. So, if you want the longest UTF-8 string within a maximum byte length, what you need is to first take the max bytes and then reduce it as long as it has an unfinished character at the end.
Compared to your solution, which has O(n) complexity, because it goes character-by-character, this one just removes up to 3 bytes from the end (because a UTF-8 character is never longer than 4 bytes).
RFC 3629 specifies these as valid UTF-8 byte sequences:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So, the simplest way to go with a valid UTF-8 stream:
if the last character is 0xxxxxxx, all is fine
otherwise, find the location of a 11xxxxxx within the last 4 bytes to see whether you have a complete character, based on the table above
Therefore, this should work:
def cut_str_to_bytes(s, max_bytes):
# cut it twice to avoid encoding potentially GBs of `s` just to get e.g. 10 bytes?
b = s[:max_bytes].encode('utf-8')[:max_bytes]
if b[-1] & 0b10000000:
last_11xxxxxx_index = [i for i in range(-1, -5, -1)
if b[i] & 0b11000000 == 0b11000000][0]
# note that last_11xxxxxx_index is negative
last_11xxxxxx = b[last_11xxxxxx_index]
if not last_11xxxxxx & 0b00100000:
last_char_length = 2
elif not last_11xxxxxx & 0b0010000:
last_char_length = 3
elif not last_11xxxxxx & 0b0001000:
last_char_length = 4
if last_char_length > -last_11xxxxxx_index:
# remove the incomplete character
b = b[:last_11xxxxxx_index]
return b.decode('utf-8')
Alternatively, you may try decoding the last bytes, rather than doing the low-level stuff, but I'm not sure the code would be simpler that way...
Note: The function shown here works for strings which are longer than two characters. A version which also covers the edge cases of shorter strings can be found on GitHub.

Printing escape sequence character in python [duplicate]

This question already has answers here:
How to get the ASCII value of a character
(5 answers)
Closed 3 years ago.
I tried to print the escape sequence characters or the ASCII representation of numbers in Python in a for loop.
Like:
for i in range(100, 150):
b = "\%d" %i
print(b)
I expected the output like,
A
B
C
Or something.
But I got like,
\100
\101
How to print ASCII representation of the numbers?
There's a builtin function for python called ord and chr
ord is used to get the value of ASCII letter, for example:
print(ord('h'))
The output of the above is 104
ord only support a one length string
chr is inverse of ord
print(chr(104))
The output of the above is 'h'
chr only supports integer. float, string, and byte doesn't support
chr and ord are really important if you want to make a translation of a text file (encoded text file)
You can use the ord() function to print the ASCII value of a character.
print(ord('b'))
> 98
Likewise, you can use the chr() function to print the ASCII character represented by a number.
print(chr(98))
> b

Capitalize first letter in string if first character is not letter? [duplicate]

This question already has answers here:
python capitalize first letter only
(10 answers)
Closed 9 years ago.
I'd like to capitalize the first letter in a string. The string will be a hash (and therefore mostly numbers), so string.title() won't work, because a string like 85033ba6c would be changed to 85033Ba6C, not 85033Ba6c, because the number seperates words, confusing title(). I'd like to capitalize the first letter of a string, no matter how far into the string the letter is. Is there a function for this?
Using re.sub with count:
>>> strs = '85033ba6c'
>>> re.sub(r'[A-Za-z]',lambda m:m.group(0).upper(),strs,1)
'85033Ba6c'
It is assumed in this answer that there is at least one character in the string where isalpha will return True (otherwise, this raises StopIteration)
i,letter = next(x for x in enumerate(myhash) if x[1].isalpha())
new_string = ''.join((myhash[:i],letter.upper(),myhash[i+1:]))
Here, I pick out the character (and index) of the first alpha character in the string. I turn that character into an uppercase character and I join the rest of the string with it.

how does python2.7 deal with unicode? I am more and more confused

in linux, I opened terminal and input python2.7 and then input the codes as follows:
>>> s = u'\u0561'
>>> print s
ա
>>> len(s)
1
the length of u'\u0561' is only 1? Why?I learned that every non-alphabet character's length is 2~4 byte in unicode, why does it use only 1 byte? and i test other unicode characters, i found that almost all the unicode character's length is 1, why?
The len function doesn't count the number of bytes - it count the number of items in any sequence (in this case, the number of characters in the string).
the length of u'\u0561' is only 1? Why?
Because ա is one character.
In other words, for the same reason that the len() of ['hi mom this is an incredibly long string'] is 1: because 'hi mom this is an incredibly long string' is one list item.
It's giving you the length in characters, not bytes.
\u0561
This is one character, so the length is one.

Categories