Two unicode encodings represent 1 cyrillic letter

Two unicode encodings represent 1 cyrillic letter - python

I have such string in unicode and utf-8 representation:
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
and
Ð•ÑÐ»Ð¸ Ð¿Ð¾Ð²ÐµÐ·ÐµÑ‚ Ñ‚Ð¾ ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ ÑÐºÐ¸Ð½Ñƒ.
The desired ouput is "Если повезет то сегодня уже скину".
I have tried all possible encodings but still wasn't able to get it in complete cyrillic form.
The best I got was
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
using windows-1252.
And also I've noticed that one cyrillic letter in desired string means two unicode encodings.
For example: \u00d0\u0095 = 'Е'.
Maybe someone knows what encoding and how to use it to get a normal result?

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):
Python:
>>> s = '\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> s
'Ð\x95Ñ\x81Ð»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ\x82 Ñ\x82Ð¾Ñ\x81ÐµÐ³Ð¾Ð´Ð½Ñ\x8fÑ\x83Ð¶ÐµÑ\x81ÐºÐ¸Ð½Ñ\x83'
>>> print(s)
ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑÐ¾ÑÐµÐ³Ð¾Ð´Ð½ÑÑÐ¶ÐµÑÐºÐ¸Ð½Ñ
>>> s.encode('latin1')
b'\xd0\x95\xd1\x81\xd0\xbb\xd0\xb8\xd0\xbf\xd0\xbe\xd0\xb2\xd0\xb5\xd0\xb7\xd0\xb5\xd1\x82 \xd1\x82\xd0\xbe\xd1\x81\xd0\xb5\xd0\xb3\xd0\xbe\xd0\xb4\xd0\xbd\xd1\x8f\xd1\x83\xd0\xb6\xd0\xb5\xd1\x81\xd0\xba\xd0\xb8\xd0\xbd\xd1\x83'
>>> s.encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
You may also have a literal string of Unicode escape codes, which is a bit trickier:
>>> s=r'\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> print(s)
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'

d0 95 d1 81 d0 bb d0 b8 is the correct UTF-8 octet stream for "Если".
So you need to convert each character to a byte (8-bit word, octet) by removing the most significant part (which is always 0 anyway in your example). Then decode them as UTF-8.
Or better, go back to the source from which you got this, and make sure the stream of octets is not seen as single-byte encoding.

Related

How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError?

How come the following works without any errors in Python?
>>> '你好'.encode('UTF-8').decode('ISO8859-1')
'ä½\xa0å¥½'
>>> _.encode('ISO8859-1').decode('UTF-8')
'你好'
I would have expected it to fail with a UnicodeEncodeError or UnicodeDecodeError
Is there some property of ISO8859-1 and UTF-8 such that I can take any UTF-8 encoded string and decode it to a ISO8859-1 string, which can later be reversed to get the original UTF-8 string?
I'm working with an older database that only supports the ISO8859-1 character set. It seems like the developers were able to store Chinese and other languages in this database by decoding UTF-8 encoded strings into ISO8859-1, and storing the resulting garbage string in the database. Downstream systems which query this database then have to encode the garbage string in ISO8859-1 and then decode the result with UTF-8 to get the correct string.
I would have assumed that such a process would not work at all.
What am I missing?

The special property of ISO-8859-1 is that the 256 characters it represents correspond 1:1 with the first 256 Unicode code points, so byte 00h decodes to U+0000, and byte FFh decodes to U+00FF.
So if you encode as UTF-8 and decode as ISO-8859-1 you get a Unicode string made up of code points whose values match the UTF-8 bytes encoded:
>>> s = '你好'
>>> s.encode('utf8').hex()
'e4bda0e5a5bd'
>>> s.encode('utf8').decode('iso-8859-1')
'ä½\xa0å¥½'
>>> for c in u:
... print(f'{c} U+{ord(c):04X}')
...
ä U+00E4 # Unicode code points are the same as the bytes of UTF-8.
½ U+00BD
  U+00A0
å U+00E5
¥ U+00A5
½ U+00BD
>>> u.encode('iso-8859-1').hex() # transform back to bytes.
'e4bda0e5a5bd'
>>> u.encode('iso-8859-1').decode('utf8') # and decode to UTF-8 again.
'你好'
Any 8-bit encoding that has a representation for all 256 bytes would also work, it just wouldn't be a 1:1 mapping. Code Page 1256 is one such encoding:
>>> for c in s.encode('utf8').decode('cp1256'):
... print(f'{c} U+{ord(c):04X}')
...
ن U+0646 # This would still .encode('cp1256') back to byte E4, for example
½ U+00BD
  U+00A0
ه U+0647
¥ U+00A5
½ U+00BD

No, there is no special property of ISO8859-1, but one property common on many 8-bit encoding: they accept all bytes from 0 to 255.
So you decode('ISO8859-1') is just transforming bytes into 256 characters (and control codes) in a unique way. Then you do the contrary action, so you lose nothing.
This happens with most of old 8-bit encoding: they should just have a corresponding Unicode codepoint (because Python expect strings to be Unicode strings).
Note: really ISO8859-1 is special with Unicode: the first 256 codepoint of Unicode correspond to the Latin-1 characters (with same number). But this doesn't matter much on your experiment.

Obtaining the original bytes after decoding to unicode and back

I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?
Examples:
some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'
some_bytes.decode('unicode-escape')
yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡
some_bytes.decode('unicode-escape').encode()
yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2
So when you are encoding it is added before every code-point.
For more details, you can see below link
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex

Python - convert unicode and hex to unicode

I have a supposedly unicode string like this:
u'\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xaf\xc2\xbc\xc2\x81\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xc2\xbc\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xe2\u20ac\u0161\xc2\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\xb0\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xc6\u2019\xe2\u20ac\xa0\xc3\xa3\xe2\u20ac\u0161\xc2\xa3\xc3\xa3\xc6\u2019\xc2\x90\xc3\xa3\xc6\u2019\xc2\xab\xc3\xaf\xc2\xbc\xcb\u2020\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xaf\xc2\xbc\xe2\u20ac\xb0'
How do I get the correct unicode string out of this? I think, the actual unicode value is ラブライブ！スクールアイドルフェスティバル（スクフェス）

You have a Mojibake, an incorrectly decoded piece text.
You can use the ftfy library to un-do the damage:
>>> from ftfy import fix_text
>>> fix_text(s)
u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
>>> print fix_text(s)
ラブライブ!スクールアイドルフェスティバル(スクフェス)
According to ftfy, your data was encoded as UTF-8, then decoded as Windows codepage 1252; the ftfy.fixes.fix_one_step_and_explain() function shows the repair steps needed:
>>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
[(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]
(the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252, but some bad decoders then just copy the original byte; the special codec reverses that process).
In fact, in your case this was done twice, not a feat I had seen before:
>>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
ラブライブ！スクールアイドルフェスティバル（スクフェス）

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?

unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'

The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

encode and decode for a specific character set

There is no difference for the printing results, what is the usage of encoding and decoding for utf-8?
And is it encode('utf8') or encode('utf-8')?
u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)

str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.
str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.
So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is:
'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.
utf_8 might be the most canonical, but it doesn't really matter.

Usually Python will first try to decode it to unicode before it can encode it back to UTF-8.There are encording which doesnt have anything to do with the character sets which can be applied to 8 bit strings
For eg
data = u'\u00c3' # Unicode data
data = data.encode('utf8')
print data
'\xc3\x83' //the output.
Please have a look through here and here.It would be helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Two unicode encodings represent 1 cyrillic letter - python

Related

How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError?

Obtaining the original bytes after decoding to unicode and back

Python - convert unicode and hex to unicode

Python: decoding a string that consists of both unicode code points and unicode text

encode and decode for a specific character set

Categories

Resources