remove prefix b in encoded text in python - python

text = "hello world what is happening"
encodedText = text.encode('utf-16') #Encoding the input text
textReplaced = encodedText.replace('h'.encode('utf-16'), 'Q'.encode('utf-16')) #Doing the replacement of an encoded character by another encoded character
print('Input : ', text)
print('Expected Output : Qello world wQat is Qappening')
print('Actual Output : ', textReplaced.decode('utf-16'))
print('Encoded h : ', 'h'.encode('utf-16'))
print('Encoded Q : ', 'Q'.encode('utf-16'))
print('Encoded Actual Output : ', textReplaced)
Output:
Input : hello world what is happening
Expected Output : Qello world wQat is Qappening
Actual Output : Qello world what is happening
Encoded h : b'\xff\xfeh\x00'
Encoded Q : b'\xff\xfeQ\x00'
Encoded Actual Output : b'\xff\xfeQ\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00w\x00h\x00a\x00t\x00 \x00i\x00s\x00 \x00h\x00a\x00p\x00p\x00e\x00n\x00i\x00n\x00g\x00'
The problem with the code is since the encoded character has a prefix b' for every encoded string or character, the replacement is done only on the first occurrence in the Encoded Input.

The problem is that the replacement bytes include the byte order mark (b'\xff\xfe'), which is only present a the beginning of the bytestring. If you are obliged to do the replacing in bytes rather than in str, you need to encode the replacement bytes without a BOM by using the UTF-16 encoding that matches the endianness of your system (or the bytes, which might not be the same).
Assuming the endianness of the bytes is that of your system, this will work:
>>> import sys
>>> enc = 'utf-16-le' if sys.byteorder == 'little' else 'utf-16-be'
>>> textReplaced = encodedText.replace('h'.encode(enc), 'Q'.encode(enc))
>>> textReplaced.decode('utf-16')
'Qello world wQat is Qappening'
An even simpler, and more flexible, approach would be to use the bytes.translate method.
>>> trans_table = bytes.maketrans('h'.encode('utf-16'), 'Q'.encode('utf-16'))
>>> print(encodedText.translate(trans_table).decode('utf-16'))
Qello world wQat is Qappening

Related

Converting bytes string with escape characters into a normal string

There are million questions and answers related to that but nothing worked with me.
I have a byte string that has escape characters, I want to decode it and convert it to string but without the escape characters and when printed it would be something like the following
>>> normal_string = "Hello\r\nWorld"
>>> print(normal_string)
Hello
World
Here is a part of my byte string
>>> bytes_string = b'Normal sentence\r\n\tIndented sentence\r\nanother one'
>>> converted_string = str(bytes_string, 'utf-8')
>>> print(converted_string)
Normal sentence\r\n\tIndented sentence\r\nanother one
and I want this
>>> print(string_that_I_want)
Normal sentence
Indented sentence
another one
>>> print(bytes_string.decode("unicode_escape"))
Normal sentence
Indented sentence
another one
>>>

python write file dealing with encode

I'm confused. I need HELP!!!
I'm dealing with a file contains Chinese characters,for instance, let's call it a.TEST, and here is what's inside.
你好 中国 Hello China 1 2 3
You don't need to understand what the chinese means.(Actually it's 'hello China')
>>> f=open('wr.TRAIN')
>>> print f.read()
你好 中国 Hello China 1 2 3
>>> f.seek(0)
>>> content = f.readline()
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>> print content
你好 中国 Hello China 1 2 3
>>> type(content)
<type 'str'>
>>> isinstance(content,unicode)
False
Here comes the first Question: Why python shell give me the utf-8of content when i just type content,meanwhile print content cmd can output the form that I want to see?
The Second Question: what's the difference between unicode and str?
Someone told me that encode is convert unicode to str, but what i learned from Unicode HowTo tells me encode is convert unicode to utf-8
Not over yet! :)
here is test.py
#!/usr/bin/python
#-*- coding: utf-8 -*-
fr = open('a.TEST')
fw = open('out.TEST','w')
content = fr.readline()
content_list = content.split()
print content
fw.write('{0}'.format(content_list))
fr.close()
fw.close()
Third Question:Why the chinese character turn into utf-8 code when I do .split()?
and I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work, but it doesn't.
I don't want what's written into out.TEST is character encoding form, I want it to be exactly the character that look like originally(你好). How to do it?
What is Encoding
A file consists of bytes. You can represent each byte with a number between 0 and 255 (or 0x00 and 0xFF in hexadecimal).
Text is also written as bytes. There is an agreement on the way text is written. That is an encoding. The most basic encoding is ASCII and other encodings are usually based on it. For example, ASCII defines that number 65 (0x41) represents 'A', 66 (0x42) represents 'B' etc.
How are Strings Represented
In python, you can define a string using numeric values:
>>> '\x41\x42\x43'
'ABC'
'\x41\x42\x43' is exactly the same thing as 'ABC'. Python will always represent the string using the more readable textual representation ('ABC').
However, some numeric values are not printable characters, so they will be represented in numeric form:
>>> '\x00\x01\x02\x03\x04'
'\x00\x01\x02\x03\x04'
Others characters have aliases to make your job easier:
>>> '\x0a\x0d\x09'
'\n\r\t'
Different Encodings
ASCII table defines meaning of numbers 0-127 and includes only the english alphabet. Numbers 128-255 are undefined. So, other encodings define a meaning for 128-255. Yet others change the meaning of the whole range 0-255.
There are many encodings and they define 128-255 differently.
For example, character 185 (0xB9) is ą in windows-1250 encoding, but it is š in iso-8859-2 encoding.
So, what happens if you print \xb9? It depends on the encoding used in the console. In my case (my console uses cp852 encoding) it is:
>>> print '\xb9'
╣
Because of that ambiguity, string '\xb9' will never be represented as '╣' (nor 'ą'...). That would hide the true value. It will be represented as the numeric value:
>>> '\xb9'
'\xb9'
Also:
>>> '╣'
'\xb9'
See also the string from the question in my console:
>>> content = '\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> print content
ńŻáňąŻ ńŞşňŤŻ Hello China 1 2 3
But what happens if variable is just entered in the console?
When a variable is enteren in cosole without print, its representation is printed. It is the same as the following:
>>> print repr(content)
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
What is Unicode?
Unicode table aims to define a numeric representation of all characters in the world and more. It can actually do that, because it is not limited to 256 values (or to any other limit actually). This is not an encoding, but a universal mapping of numbers to characters.
For example, unicode defines that number 353 (0x0161) is character š. That is allways true regardless of your locale and encodings you use. That character can be stored in files (or memory) in any encoding which supports š.
What is UTF-8?
When encoding a unicode character, one can use any encoding, but not all of them will support all characters.
For example, š (unicode 0x0161) can be encoded in iso-8869-2 as 0xB9, but it cannot be encoded in iso-8869-1 at all.
So, to be able to encode anything, you need an encoding which supports every unicode character. UTF-8 is one of those encodings, but there are others:
>>> u'\u0161'.encode('utf-7')
'+AWE-'
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> u'\u0161'.encode('utf-16le')
'a\x01'
>>> u'\u0161'.encode('utf-16be')
'\x01a'
>>> u'\u0161'.encode('utf-32le')
'a\x01\x00\x00'
>>> u'\u0161'.encode('utf-32be')
'\x00\x00\x01a'
The good thing about utf-8 is that the whole ASCII range is unchanged and as long as only ASCII is used, only one byte is used per character:
>>> u'abcdefg'.encode('utf-8')
'abcdefg'
Unicode in Python 2
Important: This is really specific to Python 2. Python 3 is different.
Unlike str objects, which are strings of bytes, unicode objects are strings of unicode characters.
They can be encoded into a str in chosen encoding, or decoded from str in chosen encoding.
A unicode string is specified using u before the opening quote. The characters inside are interpreted using current encoding, or they can be specified in numeric format \uHEX:
>>> u'ABCD'
u'ABCD'
>>>
>>> u'\u0041\u0042\u0043'
u'ABC'
>>> u'šâů'
u'\u0161\xe2\u016f'
And Now the Answers
First Question
contents prints repr(contents)
print contents prints contents
Second Question
UTF-8 strings are byte strings (str). You get them by encoding the unicode:
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> '\xc5\xa1'.decode('utf-8')
u'\u0161'
So yes, encode converts unicode to str. The str can be utf-8, but it does not have to be.
Third Question
A) "Why the chinese character turn into utf-8 code when I do .split()?"
They were utf-8 all the time.
B) "I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work"
content_list is not a string. It is a list. When a list is converted to a string, it is done using its repr, which also does repr of all of the contents.
For example:
>>> 'a \n a \n a'
'a \n a \n a'
>>> print 'a \n a \n a'
a
a
a
>>> print ['a \n a \n a']
['a \n a \n a']
The last print printed repr(list) which contains repr(str).
In the beginning, there was just english characters, and people was not satisfied.
Then they want to display every character in the world.But there is problem. One byte can only represent 255 characters. There just simply not enough place to hold them.
Then people decide to use two byte to represent one character.And call it 'utf8'.
No matter what characters you write in, it's all store in byte form.
In Python, there is no such datatype called 'unicode', just 'str'. And 'unicode' is an encoding system of 'str'.
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd' is byte form of "你好 中国".
It can not display without an encoding system specified.
I suppose you could blame linux/unix. Python has no problem to display 'utf-8' characters, while 'cat' cannot.

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

String to unicode code point escpe sequences in Python

Here is my problem... I have a "normal" String like :
Hello World
And unlike all the other subjects I have found, I WANT to print it as it's Unicode Codepoint Escape value !
The output I am looking for is something like this:
\u0015\u0123
If anyone has an idea :)
You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.
Use some non-ASCII codepoints to see the difference:
>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'
Python will just use the characters themselves when it shows you a bytes value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x.. escape code, or a single-character escape sequence if there is one (\n for newline).
From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:
>>> '\u0015\u0123'
'\x15ģ'
Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE) is a codepoint in the 0x00-0xFF range and is shown using the shorter \x.. escape notation.
To show only unicode escape sequences for your text, you need to process it character by character:
>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a
It is important to stress that this is not UTF-8, however.
You can use ord to the encoded bytes into numbers and use string formatting you display their hex values.
>>> s = u'Hello World \u0664\u0662'
>>> print s
Hello World ٤٢
>>> print ''.join('\\x%02X' % ord(c) for c in s.encode('utf-8'))
\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x20\xD9\xA4\xD9\xA2

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää
The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

Categories