Python removing nested unicode 'u' sign from string

Python removing nested unicode 'u' sign from string - python

I have a unicode object which should represent a json but it contains the unicode u in it as part of the string value e.g. u'{u\'name\':u\'my_name\'}'
My goal is to be able to load this into a json object. Just using json.loads fails. I know this happens because of the u inside the string which are not part of an acceptable json format.
I, then, tired sanitizing the string using replace("u\'", "'"), encode('ascii', 'ignore') and other methods without success.
What finally worked was using ast.literal_eval but I'm worried about using it. I found a few sources online claiming its safe. But, I also found other sources claiming it's bad practice and one should avoid it.
Are there other methods I'm missing?

The unicode string is the result of unicode being called on a dictionary.
>>> d = {u'name': u'myname'}
>>> u = unicode(d)
>>> u
u"{u'name': u'myname'}"
If you control the code that's doing this, the best fix is to change it to call json.dumps instead.
>>> json.dumps(d)
'{"name": "myname"}'
If you don't control the creation of this object, you'll need to use ast.literal_eval to create the dictionary, as the unicode string is not valid json.
>>> json.loads(u)
Traceback (most recent call last):
...
ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
>>> ast.literal_eval(u)
{u'name': u'myname'}
The docs confirm that ast.literal_eval is safe:
can be used for safely evaluating strings containing Python values from untrusted sources
You could use eval instead, but as you don't control the creation of the object you cannot be certain that it has not been crafted by a malicious user, to cause damage to your system.

Related

Python - Issues with Unicode String from API Call

I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.
When I call the endpoint, the name prints out with the unicode attached to it:
>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))
>>> print(last_name)
"Mitrovi\u0107"
>>> print(type(last_name))
<class 'str'>
If I were to take copy and paste that output and put it in a variable on its own like so:
>>> print("Mitrovi\u0107")
Mitrović
>>> print(type("Mitrovi\u0107"))
<class 'str'>
Then it prints just fine?
What is wrong with the API endpoint call and the string that comes from it?

Well, you serialise the string with json.dumps() before printing it, that's why you get a different output.
Compare the following:
>>> print("Mitrović")
Mitrović
and
>>> print(json.dumps("Mitrović"))
"Mitrovi\u0107"
The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.
Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).
Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:
>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"

Count the number of characters in your string & I'll bet you'll notice that the result of json is 13 characters:
"M-i-t-r-o-v-i-\-u-0-1-0-7", or "Mitrovi\\u0107"
When you copy "Mitrovi\u0107" you're coping 8 characters and the '\u0107' is a single unicode character.
That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you're reading it as ascii first. Carefully look at exactly what you're receiving.

Python - How can I convert a special character to the unicode representation?

In a dictionary, I have the following value with equals signal:
{"appVersion":"o0u5jeWA6TwlJacNFnjiTA=="}
To be explicit, I need to replace the = for the unicode representation '\u003d' (basically the reverse process of [json.loads()][1]). How can I set the unicode value to a variable without store the value with two scapes (\\u003d)?.
I've tryed of different ways, including the enconde/decode, repr(), unichr(61), etc, and even searching a lot, cound't find anything that does this, all the ways give me the following final result (or the original result):
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
Since now, thanks for your attention.
EDIT
When I debug the code, it gives me the value of the variable with 2 escapes. The program will get this value and use it to do the following actions, including the extra escape. I'm using this code to construct a json by the json.dumps() and the result returned is a unicode with 2 escapes.
Follow a print of the final result after the JSON construction. I need to find a way to store the value in the var with just one escape.
I don't know if make difference, but I'm doing this to a custom BURP Plugin, manipulating some selected requests.
Here is an image of my POC, getting the value of the var.

The extra backslash is not actually added, The Python interpreter uses the repr() to indicate that it's a backslash not something like \t or \n when the string containing \ gets printed:
I hope this helps:
>>> t['appVersion'] = t["appVersion"].replace('=', '\u003d')
>>> t['appVersion']
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
>>> print(t['appVersion'])
o0u5jeWA6TwlJacNFnjiTA\u003d\u003d
>>> t['appVersion'] == 'o0u5jeWA6TwlJacNFnjiTA\u003d\u003d'
True

Python delete element from JSON causes unicode errors

I have a list of US counties that I downloaded from Wikipedia using import.io, but it produced several elements for each county that I want out of the document (e.g. the URL's).
I am really confused because I thought JSON docs were in Unicode, although I've seen similar questions/answers on this topic say just pop or delete the element. When I try to pop or delete I get an error saying you can't del unicode and there's no pop in unicode. What am I missing?
Example entry in the JSON Doc
`
"data":[{"state/_text":["Alabama"],
"county":["http://en.wikipedia.org/wiki/Autauga_County,_Alabama"],
"state":["http://en.wikipedia.org/wiki/Alabama"],
"state/_source":["/wiki/Alabama"],
"state/_title":["Alabama"],
"county/_title":["Autauga County, Alabama"],
"county/_text":["Autauga County"],
"county/_source":["/wiki/Autauga_County,_Alabama"],`
My Code:
`import json`
`countiesDoc = json.load(open("US_Counties.json"))
for element in countiesDoc:
del element["county"]`
`open("updated_US_Counties.json", "w").write(
json.dumps(countiesDoc, sort_keys=true, indent=4, separators=(',', ': '))
)`
Traceback:
`Traceback (most recent call last):
File "edit_us_counties.py", line 10, in <module>
del element["county"]
TypeError: 'unicode' object does not support item deletion`
`Process finished with exit code 1`

countiesDoc should be a Python dict after loading. Iterating over a dict returns the keys, which are strings; therefore, element is a string. Example:
import json
jstr = '''\
{
"element":"value",
"other":123
}
'''
doc = json.loads(jstr)
print('doc',type(doc))
for e in doc:
print('e',e,type(e))
Output:
doc <class 'dict'>
e element <class 'str'>
e other <class 'str'>
I don't know the format of your document, but you probably just want the following assuming county is a key:
del countiesDoc['county']

You're confusing a few things.
JSON can encode all kinds of things—numbers, booleans, strings, arrays or dictionaries of any of the above—into a big string.
The server uses JSON to encodes an array or dictionary to a big string, then sends it over the wire to your program. You then need to decode it to get back an array or dictionary (in Python terms, a list or dict). Until you do that, all you have is a string. And you can't pop or delete from a string.
So:
import json
j = <however you retrieve the JSON document>
obj = json.loads(j)
del obj['key_i_want_gone']
And of course if you want to send the modified dictionary back to the server, or write it to a text file, or whatever, you're probably going to need to re-encode it as JSON first:
j = json.dumps(obj)
<however you save or upload or whatever a JSON document>
The reason you're getting error messages about Unicode is that in Python 2.x, the name of the type that holds Unicode strings is unicode. So, when you call pop on that string, you're trying to call a method named unicode.pop, and there is no such method.
It's basically just a coincidence that the API you're using to fetch the document gives you a Unicode string, and JSON is defined as encoding to Unicode strings, and JSON can take Unicode strings as one of the things it can encode, and so on. (Well, not a coincidence. Most new APIs, document formats, etc. use Unicode because it's the best way to be able to handle most of the characters in most of the languages people care about. But the error has nothing to do with whether or not JSON's strings are Unicode or something different.)

Python, Need Long Value to be Integer

I'm trying to insert a unix timestamp using REST to a webservice. And when I convert the dictionary I get the value: 1392249600000L I need this value to be an integer.
So I tried int(1392249600000L) and I get 1392249600000L, still a long value.
The reason I need this is because the JSON webservice only accepts timestamsp with milliseconds in them, but when I pass the JSON value with the 'L' in it I get an invalid JSON Primative of value 1392249600000L error.
Can someone please help me resolve this? It seems like it should be so easy, but it's driving me crazy!

You should not be using Python representations when you are sending JSON data. Use the json module to represent integers instead:
>>> import json
>>> json.dumps(1392249600000L)
'1392249600000'
In any case, the L is only part of the string representation to make debugging easier, making it clear you have a long, not int value. Don't use Python string representations for network communications, in any case.
For example, if you have a list of Python values, the str() representation of that list will also use repr() representations of the contents of the list, resulting in L postfixes for long integers. But json.dumps() handles such cases properly too, and handle other types correctly too (like Python None to JSON null, Python True to JSON true, etc.):
>>> json.dumps([1392249600000L, True, None])
'[1392249600000, true, null]'

How do I get rid of the "u" from a decoded JSON object?

I have a dictionary of dictionaries in Python:
d = {"a11y_firesafety.html":{"lang:hi": {"div1": "http://a11y.in/a11y/idea/a11y_firesafety.html:hi"}, "lang:kn": {"div1": "http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn}}}
I have this in a JSON file and I encoded it using json.dumps(). Now when I decode it using json.loads() in Python I get a result like this:
temp = {u'a11y_firesafety.html': {u'lang:hi': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:hi'}, u'lang:kn': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn'}}}
My problem is with the "u" which signifies the Unicode encoding in front of every item in my temp (dictionary of dictionaries). How to get rid of that "u"?

Why do you care about the 'u' characters? They're just a visual indicator; unless you're actually using the result of str(temp) in your code, they have no effect on your code. For example:
>>> test = u"abcd"
>>> test == "abcd"
True
If they do matter for some reason, and you don't care about consequences like not being able to use this code in an international setting, then you could pass in a custom object_hook (see the json docs here) to produce dictionaries with string contents rather than unicode.

You could also use this:
import fileinput
fout = open("out.txt", 'a')
for i in fileinput.input("in.txt"):
str = i.replace("u\"","\"").replace("u\'","\'")
print >> fout,str
The typical json responses from standard websites have these two encoding representations - u' and u"
This snippet gets rid of both of them. It may not be required as this encoding doesn't hinder any logical processing, as mentioned by previous commenter

There is no "unicode" encoding, since unicode is a different data type and I don't really see any reason unicode would be a problem, since you may always convert it to string doing e.g. foo.encode('utf-8').
However, if you really want to have string objects upfront you should probably create your own decoder class and use it while decoding JSON.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.