python wrong character encoding comparison - python

Have a problem with cyrillic character comparison in Python. Here is the small testcase%
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def convert(text):
result = []
for i in xrange(len(text)):
if text[i].lower() == 'й':
result.append('q')
print result
if __name__ == '__main__':
convert('йцукенг')
You definitely see, that the first character should be equal to the character in the condition. But the condition fails and result is empty.
Also if I try to print whole string (text) it works well, but if I try to print just a character (like text[2]) — I get '?' in the output.
I'm sure the problem is with encoding, but how can I do correct comparison of separate characters?

You are seeing this behavior because you are looping over the bytes in a UTF-8 string, not over the characters. Here is an example of the difference:
>>> 'й' # note that this is two bytes
'\xd0\xb9'
>>> 'йцукенг'[0] # but when you loop you are looking at a single byte
'\xd0'
>>> len('йцукенг') # 7 characters, but 14 bytes
14
This is why it is necessary to use Unicode for checking the character, as in mVChr's answer.
These easiest way to do this is to leave all of your code exactly the same, and just add a u prefix to all of your string literals (u'йцукенг' and u'й').

Presuming you're using Python 2.X, you should use unicode strings, try:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def convert(text):
result = []
for i in xrange(len(text)):
if text[i].lower() == unicode('й', 'utf8'):
result.append('q')
print result
if __name__ == '__main__':
convert(unicode('йцукенг', 'utf8'))
Or you can simply enter the raw unicode strings of u'йцукенг' and u'й'

Related

Python 3: Unescape(?) a string [duplicate]

I have encounter a case where I need to convert a string of character into a character string in python.
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
print s #gives: \x80\x78\x07\x00\x75\xb3
what I want is that, given the string s, I can get the real character store in s. which in this case is "\x80, \x78, \x07, \x00, \x75, and \xb3"(something like this)�xu�.
You can use string-escape encoding (Python 2.x):
>>> s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
>>> s.decode('string-escape')
'\x80x\x07\x00u\xb3'
Use unicode-escape encoding (in Python 3.x, need to convert to bytes first):
>>> s.encode().decode('unicode-escape')
'\x80x\x07\x00u³'
you can simply write a function, taking the string and returning the converted form!
something like this:
def str_to_chr(s):
res = ""
s = s.split("\\")[1:] #"\\x33\\x45" -> ["x33","x45"]
for(i in s):
res += chr(int('0'+i, 16)) # converting to decimal then taking the chr
return res
remember to print the return of the function.
to find out what does each line do, run that line, if still have questions comment it... i'll answer
or you can build a string from the byte values, but that might not all be "printable" depending on your encoding, example:
# -*- coding: utf-8 -*-
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
r = ''
for byte in s.split('\\x'):
if byte: # to get rid of empties
r += chr(int(byte,16)) # convert to int from hex string first
print (r) # given the example, not all bytes are printable char's in utf-8
HTH, Edwin

Convert escaped utf-8 string to utf in python 3

I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.

Weird behaviour when trying to print characters of a byte string

Why this short code behaves differently from a run to other ?
# -*- coding: utf-8 -*-
for c in 'aɣyul':
print c
The outputs that I have in each run are:
# nothing
---
a
---
l
---
u
l
---
a
y
u
l
...etc
EDIT:
I know how to solve the problem, the question is just why Python prints a different part of the string, instead of the same part, at each run ?
You need to add an u at leading of your string which make that python treads with your string as an unicode, and decode your character while printing:
>>> for c in u'aɣyul':
... print c
...
a
ɣ
y
u
l
Note that without encoding python will break the unicode character in two separate hex value and in each print you will get the string representation of this hex values:
>>> 'aɣyul'
'a\xc9\xa3yul'
^ ^
If you want to know that why python break the unicode to 2 hex value that's because of that instances of str contain raw 8-bit values while a unicode character used more than 8 bit memory.
You can also decode the hex values manually:
>>> print '\xc9\xa3'.decode('utf8')
ɣ

Split an utf-8 encoded string given a bytes offset (python 2.7)

Having an utf-8 encoded string like this:
bar = "hello 。◕‿‿◕。"
and a bytes offset that tells me at which byte I have to split the string:
bytes_offset = 9
how can I split the bar string in two parts resulting in:
>>first_part
'hello 。' <---- #9 bytes 'hello \xef\xbd\xa1'
>>second_part
'◕‿‿◕。'
In a nutshell:
given a bytes offset, how can I transform it in the actual char index position of an utf-8 encoded string?
UTF-8 Python 2.x strings are basically byte strings.
# -*- coding: utf-8 -*-
bar = "hello 。◕‿‿◕。"
assert(isinstance(bar, str))
first_part = bar[:9]
second_part = bar[9:]
print first_part
print second_part
Yields:
hello 。
◕‿‿◕。
Python 2.6 on OSX here but I expect the same from 2.7. If I split on 10 or 11 instead of 9, I get ? characters output implying that it broke the sequence of bytes in the middle of a multibyte character sequence; splitting on 12 moves the first "eyeball" to the first part of the string.
I have PYTHONIOENCODING set to utf8 in the terminal.
Character offset is a number of characters before byte offset:
def byte_to_char_offset(b_string, b_offset, encoding='utf8'):
return len(b_string[:b_offset].decode(encoding))

Python, len and slices on unicode strings

I am handling a situation where I need to make a string fit in the allocated gap in the screen, as I'm using unicode len() and slices[] work apparently on bytes and I end up cutting unicode strings too short, because € only occupies one space in the screen but 2 for len() or slices[].
I have the encoding headers properly setup, and I'm willing to use other things than slices or len() to deal with this, but I really need to know how many spaces will the string take and how to cut it to the available.
$cat test.py
# -*- coding: utf-8 -*-
a = "2 €uros"
b = "2 Euros"
print len(b)
print len(a)
print a[3:]
print b[3:]
$python test.py
7
9
��uros
uros
You're not creating Unicode strings there; you're creating byte strings with UTF-8 encoding (which is variable-length, as you're seeing). You need to use constants of the form u"..." (or u'...'). If you do that, you get the expected result:
% cat test.py
# -*- coding: utf-8 -*-
a = u"2 €uros"
b = u"2 Euros"
print len(b)
print len(a)
print a[3:]
print b[3:]
% python test.py
7
7
uros
uros

Categories