Python - Issues with Unicode String from API Call - python

I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.
When I call the endpoint, the name prints out with the unicode attached to it:
>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))
>>> print(last_name)
"Mitrovi\u0107"
>>> print(type(last_name))
<class 'str'>
If I were to take copy and paste that output and put it in a variable on its own like so:
>>> print("Mitrovi\u0107")
Mitrović
>>> print(type("Mitrovi\u0107"))
<class 'str'>
Then it prints just fine?
What is wrong with the API endpoint call and the string that comes from it?

Well, you serialise the string with json.dumps() before printing it, that's why you get a different output.
Compare the following:
>>> print("Mitrović")
Mitrović
and
>>> print(json.dumps("Mitrović"))
"Mitrovi\u0107"
The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.
Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).
Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:
>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"

Count the number of characters in your string & I'll bet you'll notice that the result of json is 13 characters:
"M-i-t-r-o-v-i-\-u-0-1-0-7", or "Mitrovi\\u0107"
When you copy "Mitrovi\u0107" you're coping 8 characters and the '\u0107' is a single unicode character.
That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you're reading it as ascii first. Carefully look at exactly what you're receiving.

Related

Python removing nested unicode 'u' sign from string

I have a unicode object which should represent a json but it contains the unicode u in it as part of the string value e.g. u'{u\'name\':u\'my_name\'}'
My goal is to be able to load this into a json object. Just using json.loads fails. I know this happens because of the u inside the string which are not part of an acceptable json format.
I, then, tired sanitizing the string using replace("u\'", "'"), encode('ascii', 'ignore') and other methods without success.
What finally worked was using ast.literal_eval but I'm worried about using it. I found a few sources online claiming its safe. But, I also found other sources claiming it's bad practice and one should avoid it.
Are there other methods I'm missing?
The unicode string is the result of unicode being called on a dictionary.
>>> d = {u'name': u'myname'}
>>> u = unicode(d)
>>> u
u"{u'name': u'myname'}"
If you control the code that's doing this, the best fix is to change it to call json.dumps instead.
>>> json.dumps(d)
'{"name": "myname"}'
If you don't control the creation of this object, you'll need to use ast.literal_eval to create the dictionary, as the unicode string is not valid json.
>>> json.loads(u)
Traceback (most recent call last):
...
ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
>>> ast.literal_eval(u)
{u'name': u'myname'}
The docs confirm that ast.literal_eval is safe:
can be used for safely evaluating strings containing Python values from untrusted sources
You could use eval instead, but as you don't control the creation of the object you cannot be certain that it has not been crafted by a malicious user, to cause damage to your system.

Python - How can I convert a special character to the unicode representation?

In a dictionary, I have the following value with equals signal:
{"appVersion":"o0u5jeWA6TwlJacNFnjiTA=="}
To be explicit, I need to replace the = for the unicode representation '\u003d' (basically the reverse process of [json.loads()][1]). How can I set the unicode value to a variable without store the value with two scapes (\\u003d)?.
I've tryed of different ways, including the enconde/decode, repr(), unichr(61), etc, and even searching a lot, cound't find anything that does this, all the ways give me the following final result (or the original result):
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
Since now, thanks for your attention.
EDIT
When I debug the code, it gives me the value of the variable with 2 escapes. The program will get this value and use it to do the following actions, including the extra escape. I'm using this code to construct a json by the json.dumps() and the result returned is a unicode with 2 escapes.
Follow a print of the final result after the JSON construction. I need to find a way to store the value in the var with just one escape.
I don't know if make difference, but I'm doing this to a custom BURP Plugin, manipulating some selected requests.
Here is an image of my POC, getting the value of the var.
The extra backslash is not actually added, The Python interpreter uses the repr() to indicate that it's a backslash not something like \t or \n when the string containing \ gets printed:
I hope this helps:
>>> t['appVersion'] = t["appVersion"].replace('=', '\u003d')
>>> t['appVersion']
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
>>> print(t['appVersion'])
o0u5jeWA6TwlJacNFnjiTA\u003d\u003d
>>> t['appVersion'] == 'o0u5jeWA6TwlJacNFnjiTA\u003d\u003d'
True

when python loads json,how to convert str to unicode ,so I can print Chinese characters?

I got a json file like this:
{
'errNum': 0,
'retData': {
'city': "武汉"
}
}
import json
content = json.loads(result) # supposing json file named result
cityname = content['retData']['city']
print cityname
After that, I got a output : \u6b66\u6c49
I know it's unicode of Chinese character of 武汉 ,but the type of it is str
isinstance(cityname,str) is True.
so how can I convert this str to unicode and output will be 武汉
I also have tried these solutions:
>>> u'\u6b66\u6c49'
u'\u6b66\u6c49'
>>> print u'\u6b66\u6c49'
武汉
>>> print '\u6b66\u6c49'.decode()
\u6b66\u6c49
>>> print '\u6b66\u6c49'
\u6b66\u6c49
Searched something about ascii,unicode and utf-8 ,encode and decode ,but also cannot understand,it is crazy!
I need some help ,Thanks !
Perhaps this answer comes five years too late, but since I had a similar issue that I was trying to solve, while building a preprocessor for the Japanese language, here is the answer I found.
when you loads the result to content add the following flag:
content = json.loads(result, ensure_ascii=False)
This fixed my issue.
Your json contains escaped unicode characters. You can decode them into actual unicode characters using the unicode_escape codec:
print cityname.decode('unicode_escape')
Note that, while this will usually work, depending on the source of the unicode escaping you could have problems with characters outside the Basic Multilingual Plane (U+0 to U+FFFF). A convenient quote from user #bobince that I took from a comment:
Note that ... there are a number of different formats that use \u
escapes - Python unicode literals (which unicode-escape handles), Java
properties, JavaScript string literals, JSON, and so on. It is
important to know which one you are dealing with because they all have
slightly different rules about what other escapes are valid.
unicode-escape may or may not be a valid way of parsing that data
depending on where it comes from.

Convert Unicode string to UTF-8, and then to JSON

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

How do I get rid of the "u" from a decoded JSON object?

I have a dictionary of dictionaries in Python:
d = {"a11y_firesafety.html":{"lang:hi": {"div1": "http://a11y.in/a11y/idea/a11y_firesafety.html:hi"}, "lang:kn": {"div1": "http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn}}}
I have this in a JSON file and I encoded it using json.dumps(). Now when I decode it using json.loads() in Python I get a result like this:
temp = {u'a11y_firesafety.html': {u'lang:hi': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:hi'}, u'lang:kn': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn'}}}
My problem is with the "u" which signifies the Unicode encoding in front of every item in my temp (dictionary of dictionaries). How to get rid of that "u"?
Why do you care about the 'u' characters? They're just a visual indicator; unless you're actually using the result of str(temp) in your code, they have no effect on your code. For example:
>>> test = u"abcd"
>>> test == "abcd"
True
If they do matter for some reason, and you don't care about consequences like not being able to use this code in an international setting, then you could pass in a custom object_hook (see the json docs here) to produce dictionaries with string contents rather than unicode.
You could also use this:
import fileinput
fout = open("out.txt", 'a')
for i in fileinput.input("in.txt"):
str = i.replace("u\"","\"").replace("u\'","\'")
print >> fout,str
The typical json responses from standard websites have these two encoding representations - u' and u"
This snippet gets rid of both of them. It may not be required as this encoding doesn't hinder any logical processing, as mentioned by previous commenter
There is no "unicode" encoding, since unicode is a different data type and I don't really see any reason unicode would be a problem, since you may always convert it to string doing e.g. foo.encode('utf-8').
However, if you really want to have string objects upfront you should probably create your own decoder class and use it while decoding JSON.

Categories