Obtaining the original bytes after decoding to unicode and back

Obtaining the original bytes after decoding to unicode and back - python

I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?
Examples:
some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'
some_bytes.decode('unicode-escape')
yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡
some_bytes.decode('unicode-escape').encode()
yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2
So when you are encoding it is added before every code-point.
For more details, you can see below link
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex

Related

Two unicode encodings represent 1 cyrillic letter

I have such string in unicode and utf-8 representation:
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
and
Ð•ÑÐ»Ð¸ Ð¿Ð¾Ð²ÐµÐ·ÐµÑ‚ Ñ‚Ð¾ ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ ÑÐºÐ¸Ð½Ñƒ.
The desired ouput is "Если повезет то сегодня уже скину".
I have tried all possible encodings but still wasn't able to get it in complete cyrillic form.
The best I got was
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
using windows-1252.
And also I've noticed that one cyrillic letter in desired string means two unicode encodings.
For example: \u00d0\u0095 = 'Е'.
Maybe someone knows what encoding and how to use it to get a normal result?

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):
Python:
>>> s = '\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> s
'Ð\x95Ñ\x81Ð»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ\x82 Ñ\x82Ð¾Ñ\x81ÐµÐ³Ð¾Ð´Ð½Ñ\x8fÑ\x83Ð¶ÐµÑ\x81ÐºÐ¸Ð½Ñ\x83'
>>> print(s)
ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑÐ¾ÑÐµÐ³Ð¾Ð´Ð½ÑÑÐ¶ÐµÑÐºÐ¸Ð½Ñ
>>> s.encode('latin1')
b'\xd0\x95\xd1\x81\xd0\xbb\xd0\xb8\xd0\xbf\xd0\xbe\xd0\xb2\xd0\xb5\xd0\xb7\xd0\xb5\xd1\x82 \xd1\x82\xd0\xbe\xd1\x81\xd0\xb5\xd0\xb3\xd0\xbe\xd0\xb4\xd0\xbd\xd1\x8f\xd1\x83\xd0\xb6\xd0\xb5\xd1\x81\xd0\xba\xd0\xb8\xd0\xbd\xd1\x83'
>>> s.encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
You may also have a literal string of Unicode escape codes, which is a bit trickier:
>>> s=r'\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> print(s)
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'

d0 95 d1 81 d0 bb d0 b8 is the correct UTF-8 octet stream for "Если".
So you need to convert each character to a byte (8-bit word, octet) by removing the most significant part (which is always 0 anyway in your example). Then decode them as UTF-8.
Or better, go back to the source from which you got this, and make sure the stream of octets is not seen as single-byte encoding.

Recover byte string with invalid characters(UTF-8) after having decoded it in Python

I've logged a lot of texts that were being decoded to unicode(UTF-8) from a byte string.
Example:
From upstream I received a a lot of byte strings, like:
b_st = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00'
I saved those in my computer after doing a decoding
b_un = b_st.decode("utf-8", "replace")
As you can see the initial byte string have a invalid
characters to decode to UTF-8(e.g. \xff) so those will be replaced.
After that I tried to recover the byte string from that unicode text doing: b_un.encode("utf-8") but it returns to me another byte string, not the same as the original.
Is it possible to recover the original byte string?
PS. I didn't decode those texts intentionally, I didnt read the default behavior of an a Class that automatically converts any text to unicode if necessary.

replace is a lossy codec error handler, replacing any un-decodable bytes with \ufffd (the unicode replacement character)
as such it is impossible to recover your original image
If you're handed a byte string, you can write it to a file by using a binary io object:
with open(filename, 'wb') as f:
f.write(byte_string)

Convert bytearray containing unicode data to str

I need to convert a bytearray which contains non-encoded raw unicode data to an unicode string, e.g. the unicode \u2167 represents the roman number 8:
print(u'\u2167')
Ⅷ
having this information stored in a bytearray I need to find a way to convert it back to unicode. Decoding from e.g. 'utf8' obviously does not work:
b = bytearray([0x21,0x67])
print(b.decode('utf8'))
!g
Any ideas?
EDIT
#Luke's comment got me on the right track. Apparently the original data (not the simplified one I am showing here) is encoded as UTF-16le. The data is obtained from a wxpython TextDataObject. wxpython internally usually uses unicode. That is what made me think that I am dealing with unicode data.

... a bytearray which contains non-encoded raw unicode data
If it is in a bytearray, it is by definition encoded. The Python bytes or bytearray types can contain encoded Unicode data. The str type contains Unicode code points. You .decode a byte string to a Unicode string, and .encode a Unicode string into byte strings. The encoding used for your example is UTF-16BE:
>>> b = bytearray([0x21,0x67])
>>> b.decode('utf-16be')
'Ⅷ'

The line print(b.decode('utf8')) is not correct, correct usage is :
print(b.decode("utf-8"))

python3 decode external utf8 string

Suppose I have the following string that I want to decode as utf-8:
str ='\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
# expect 'אאא'
Using python 3, I would expect the following to work, but it doesn't:
bytes(str, 'ascii').decode('unicode-escape')
# prints '×××'
bytes(str, 'ascii').decode('utf-8')
# prints '\\u00d7\\u0090\\u00d7\\u0090\\u00d7\\u0090'
Any help?

You can do it with multiple trips through encode/decode.
print(st.encode('ascii').decode('unicode-escape').encode('iso-8859-1').decode('utf-8'))
The first is the preferred alternate to bytes. The second converts the escape sequences to their equivalent characters. The third takes advantage of Unicode being based on ISO-8859-1 for the first 256 code points to convert those characters directly back into bytes. Finally you can decode the UTF-8 string.

encode and decode for a specific character set

There is no difference for the printing results, what is the usage of encoding and decoding for utf-8?
And is it encode('utf8') or encode('utf-8')?
u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)

str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.
str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.
So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is:
'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.
utf_8 might be the most canonical, but it doesn't really matter.

Usually Python will first try to decode it to unicode before it can encode it back to UTF-8.There are encording which doesnt have anything to do with the character sets which can be applied to 8 bit strings
For eg
data = u'\u00c3' # Unicode data
data = data.encode('utf8')
print data
'\xc3\x83' //the output.
Please have a look through here and here.It would be helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Obtaining the original bytes after decoding to unicode and back - python

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2 So when you are encoding it is added before every code-point. For more details, you can see below link https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex

Related

Two unicode encodings represent 1 cyrillic letter

Recover byte string with invalid characters(UTF-8) after having decoded it in Python

Convert bytearray containing unicode data to str

python3 decode external utf8 string

encode and decode for a specific character set

Categories

Resources