How to convert a '\u5f71\u89c6\...' in to its real meaning?(python) - python

I want to convert
'[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]'
to
'[["[[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb", "205.53 MB"]]'
Because I make a mistake that I use json.dumps(file_list) convert a list object to str, and save the result to db.I find this mistabke until using sphinx to index these data...
I have tried use data.decode('utf-8'), but it seems not work.

Just decode from JSON again:
>>> import json
>>> json.loads('[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]')
[['[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb', '205.53 MB']]
You don't have UTF-8 encoded data, you have JSON-encoded data, which uses \uhhhh escape sequences to represent Unicode codepoints.

Related

how to convert python binary str literal to real bytes

In python 3, I have a str like this, which is the exactly literal representation of bytes data:
'8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'
I would like to convert it to real byte,
b'8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'
I tried to use .encode() on the str data, but the result added many "xc2":
b'8\xc2\x81p\xc2\x925\x00\x003dx\xc2\x91P\x00x\xc2\x923\x00\x00\xc2\x91Pd\x00\xc2\x921d\xc2\x81p1\x00\x00'.
I also tried:
import ast
ast.literal_eval("b'8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'")
The result is:
ValueError: source code string cannot contain null bytes
How to convert the str input to the bytes as exactly the same as follows?
b'8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'
You are on the right track with the encode function already. Just try with this encoding:
>>> '8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'.encode('raw_unicode_escape')
b'8\x81p\x925\x00\x003dx\x91P\x00x\x923\x00\x00\x91Pd\x00\x921d\x81p1\x00\x00'
I took it from this table in Python's codecs documentation
Edit: I just found it needs raw_unicode_escape instead of unicode_escape

Read in special characters for pandas dataframe [duplicate]

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

Convert bytearray containing unicode data to str

I need to convert a bytearray which contains non-encoded raw unicode data to an unicode string, e.g. the unicode \u2167 represents the roman number 8:
print(u'\u2167')
Ⅷ
having this information stored in a bytearray I need to find a way to convert it back to unicode. Decoding from e.g. 'utf8' obviously does not work:
b = bytearray([0x21,0x67])
print(b.decode('utf8'))
!g
Any ideas?
EDIT
#Luke's comment got me on the right track. Apparently the original data (not the simplified one I am showing here) is encoded as UTF-16le. The data is obtained from a wxpython TextDataObject. wxpython internally usually uses unicode. That is what made me think that I am dealing with unicode data.
... a bytearray which contains non-encoded raw unicode data
If it is in a bytearray, it is by definition encoded. The Python bytes or bytearray types can contain encoded Unicode data. The str type contains Unicode code points. You .decode a byte string to a Unicode string, and .encode a Unicode string into byte strings. The encoding used for your example is UTF-16BE:
>>> b = bytearray([0x21,0x67])
>>> b.decode('utf-16be')
'Ⅷ'
The line print(b.decode('utf8')) is not correct, correct usage is :
print(b.decode("utf-8"))

Python: Cyrillic handling

I got this data returned b'\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e from an API. This data is in Russian which I know for sure. I am guessing these values are unicode representation of the cyrillic letters?
The data returned was a byte array.
How can I convert that into readable cyrillic string? Pretty much I need a way to convert that kind into readable human text.
EDIT: Yes this is JSON data. Forgot to mention, sorry.
Chances are you have JSON data; JSON uses \uhhhh escape sequences to represent Unicode codepoints. Use the json.loads() function on unicode (decoded) data to produce a Python string:
import json
string = json.loads(data.decode('utf8'))
UTF-8 is the default JSON encoding; check your response headers (if you are using a HTTP-based API) to see if a different encoding was used.
Demo:
>>> import json
>>> json.loads(b'"\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e"'.decode('utf8'))
'Кейтлинпро'

Convert Unicode string to UTF-8, and then to JSON

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

Categories