Python: Cyrillic handling

Python: Cyrillic handling - python

I got this data returned b'\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e from an API. This data is in Russian which I know for sure. I am guessing these values are unicode representation of the cyrillic letters?
The data returned was a byte array.
How can I convert that into readable cyrillic string? Pretty much I need a way to convert that kind into readable human text.
EDIT: Yes this is JSON data. Forgot to mention, sorry.

Chances are you have JSON data; JSON uses \uhhhh escape sequences to represent Unicode codepoints. Use the json.loads() function on unicode (decoded) data to produce a Python string:
import json
string = json.loads(data.decode('utf8'))
UTF-8 is the default JSON encoding; check your response headers (if you are using a HTTP-based API) to see if a different encoding was used.
Demo:
>>> import json
>>> json.loads(b'"\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e"'.decode('utf8'))
'Кейтлинпро'

Related

Read in special characters for pandas dataframe [duplicate]

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadkovÃ¡'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}

The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadkovÃ¡'}
corrected_sender = Horníková

I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'

Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadkovÃ¡'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'LÃª Nguyá»\x85n PhÃº'

Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

Python converting characters to URL encoding

I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.

Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')

when python loads json,how to convert str to unicode ,so I can print Chinese characters?

I got a json file like this:
{
'errNum': 0,
'retData': {
'city': "武汉"
}
}
import json
content = json.loads(result) # supposing json file named result
cityname = content['retData']['city']
print cityname
After that, I got a output : \u6b66\u6c49
I know it's unicode of Chinese character of 武汉 ,but the type of it is str
isinstance(cityname,str) is True.
so how can I convert this str to unicode and output will be 武汉
I also have tried these solutions:
>>> u'\u6b66\u6c49'
u'\u6b66\u6c49'
>>> print u'\u6b66\u6c49'
武汉
>>> print '\u6b66\u6c49'.decode()
\u6b66\u6c49
>>> print '\u6b66\u6c49'
\u6b66\u6c49
Searched something about ascii,unicode and utf-8 ,encode and decode ,but also cannot understand,it is crazy!
I need some help ,Thanks !

Perhaps this answer comes five years too late, but since I had a similar issue that I was trying to solve, while building a preprocessor for the Japanese language, here is the answer I found.
when you loads the result to content add the following flag:
content = json.loads(result, ensure_ascii=False)
This fixed my issue.

Your json contains escaped unicode characters. You can decode them into actual unicode characters using the unicode_escape codec:
print cityname.decode('unicode_escape')
Note that, while this will usually work, depending on the source of the unicode escaping you could have problems with characters outside the Basic Multilingual Plane (U+0 to U+FFFF). A convenient quote from user #bobince that I took from a comment:
Note that ... there are a number of different formats that use \u
escapes - Python unicode literals (which unicode-escape handles), Java
properties, JavaScript string literals, JSON, and so on. It is
important to know which one you are dealing with because they all have
slightly different rules about what other escapes are valid.
unicode-escape may or may not be a valid way of parsing that data
depending on where it comes from.

How to convert a '\u5f71\u89c6\...' in to its real meaning?(python)

I want to convert
'[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]'
to
'[["[[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb", "205.53 MB"]]'
Because I make a mistake that I use json.dumps(file_list) convert a list object to str, and save the result to db.I find this mistabke until using sphinx to index these data...
I have tried use data.decode('utf-8'), but it seems not work.

Just decode from JSON again:
>>> import json
>>> json.loads('[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]')
[['[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb', '205.53 MB']]
You don't have UTF-8 encoded data, you have JSON-encoded data, which uses \uhhhh escape sequences to represent Unicode codepoints.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Cyrillic handling - python

Related

Read in special characters for pandas dataframe [duplicate]

The correct way to load and read JSON file contains special characters in Python

Python converting characters to URL encoding

when python loads json,how to convert str to unicode ,so I can print Chinese characters?

How to convert a '\u5f71\u89c6\...' in to its real meaning?(python)

Categories

Resources