decoding JSON data with backslash encoding - python

I have the following JSON data.
"[
\"msgType\": \"0\",
\"tid\": \"1\",
\"data\": \"[
{
\\\"EventName\\\": \\\"TExceeded\\\",
\\\"Severity\\\": \\\"warn\\\",
\\\"Subject\\\": \\\"Exceeded\\\",
\\\"Message\\\": \\\"tdetails: {
\\\\\\\"Message\\\\\\\": \\\\\\\"my page tooktoolong(2498ms: AT: 5ms,
BT: 1263ms,
CT: 1230ms),
andexceededthresholdof5ms\\\\\\\",
\\\\\\\"Referrer\\\\\\\": \\\\\\\"undefined\\\\\\\",
\\\\\\\"Session\\\\\\\": \\\\\\\"None\\\\\\\",
\\\\\\\"ResponseTime\\\\\\\": 0,
\\\\\\\"StatusCode\\\\\\\": 0,
\\\\\\\"Links\\\\\\\": 215,
\\\\\\\"Images\\\\\\\": 57,
\\\\\\\"Forms\\\\\\\": 2,
\\\\\\\"Platform\\\\\\\": \\\\\\\"Linuxx86_64\\\\\\\",
\\\\\\\"BrowserAppname\\\\\\\": \\\\\\\"Netscape\\\\\\\",
\\\\\\\"AppCodename\\\\\\\": \\\\\\\"Mozilla\\\\\\\",
\\\\\\\"CPUs\\\\\\\": 8,
\\\\\\\"Language\\\\\\\": \\\\\\\"en-GB\\\\\\\",
\\\\\\\"isEvent\\\\\\\": \\\\\\\"true\\\\\\\",
\\\\\\\"PageLatency\\\\\\\": 2498,
\\\\\\\"Threshold\\\\\\\": 5,
\\\\\\\"AT\\\\\\\": 5,
\\\\\\\"BT\\\\\\\": 1263,
\\\\\\\"CT\\\\\\\": 1230
}\\\",
\\\"EventTimestamp\\\": \\\"1432514783269\\\"
}
]\",
\"Timestamp\": \"1432514783269\",
\"AppName\": \"undefined\",
\"Group\": \"UndefinedGroup\"
]"
I want to make this JSON file into a single level of wrapping.i.e I want to remove the nested structure inside and copy that data over to the top level JSON structure. How can I do this?
If this strucutre is named json_data
I want to be able to access
json_data['Platform']
json_data[BrowserAppname']
json_data['Severity']
json_data['msgType']
Basically some kind of rudimentary normalization.What is the easiest way to do this using python

A generally unsafe but probably okay in this case solution would be:
import json
d = json.loads(json_string.replace('\\', ''))

I'm not sure what happened but this doesn't look like valid JSON.
You have some double-quotes escaped once, some twice, some three times etc.
You have key/value pairs inside of a list-like object []
tdetails is missing a trailing quote
Even if you fix the above you still have your data list quoted as a multi-line string which is invalid.
It appears to be that this "JSON" was constructed by hand, by someone with no knowledge of JSON.
You can try "massaging" the data into JSON with the following:
import re
x = re.sub(r'\\+', '', js_str)
x = re.sub(r'\n', '', js_str)
x = '{' + js_str.strip()[1:-1] + '}'
Which would make the string almost json like, but you still need to fix point #3.

Related

Converting data inside a JSON file into Python dict

I have a JSON file and the inside of it looks like this:
"{'reviewId': 'gp:AOqpTOGiJUWB2pk4jpWSuvqeXofM9B4LQQ4Iom1mNeGzvweEriNTiMdmHsxAJ0jaJiK7CbjJ_s7YEWKE2DA_Qzo', 'userName': '\u00c0ine Mongey', 'userImage': 'https://play-lh.googleusercontent.com/a-/AOh14GhUv3c6xHP4kvLSJLaRaydi6o2qxp6yZhaLeL8QmQ', 'content': \"Honestly a great game, it does take a while to get money at first, and they do make it easier to get money by watching ads. I'm glad they don't push it though, and the game is super relaxing and fun!\", 'score': 5, 'thumbsUpCount': 2, 'reviewCreatedVersion': '1.33.0', 'at': datetime.datetime(2021, 4, 23, 8, 20, 34), 'replyContent': None, 'repliedAt': None}"
I am trying to convert this into a dict and then to a pandas DataFrame. I tried this but it will just turn this into a string representation of a dict, not a dict itself:
with open('sampledict.json') as f:
dictdump = json.loads(f.read())
print(type(dictdump))
I feel like I am so close now but I can't find out what I miss to get a dict out of this. Any input will be greatly appreciated!
If I get your data format correctly, this will work:
with open('sampledict.json') as f:
d = json.load(f)
d = eval(d)
# Or this works as well
d = json.loads(f.read())
d = eval(d)
>>> d.keys()
['userName', 'userImage', 'repliedAt', 'score', 'reviewCreatedVersion', 'at', 'replyContent', 'content', 'reviewId', 'thumbsUpCount']
Are you sure that you have your source JSON correct? The JSON snippet you have provided is a string; it has a " at the start and end. So in its current form getting a string is correct behaviour.
Note also that it is a string representation of a Python dict rather than a JSON object. This is evidenced by the fact that the strings are denoted by single quotes rather than double, and the use of the Python keyword None rather than the JSON null.
If the JSON file were a representative of a plain object then the content would be something of the form:
{
"reviewId": "gp"AO...",
"userName": "...",
"replyContent": null,
"repliedAt": null
}
I.e. the first and last characters are curly braces, not double quotes.

Converting nested JSON into Python dictionary

I'm receiving a string server side which I then convert to JSON:
127.0.0.1:8000/devices/f751/?json={ "DeviceId":"192-2993-2993", "Date":"1/4/2019 9:52:2", "Location":"-1.000000000,-1.000000000", "Key":"{XXXX-XXXX-XXXX}", "Data":" { \"Value0\":\"{ \"ReferenceValue\":\"Elevation\", \"Prediction\":\"22.216558464\"}\", \"Value1\":\"{ \"ReferenceValue\":\"Wind Speed\", \"Prediction\":\"42.216558464\"}\" } "}
After conversion using json.loads() I get the following output:
updatedRequest = json.loads(jsonRequest)
updatedRequest
{'DeviceId': '192-2993-2993',
'Date': '1/4/2019 9:52:2',
'Location': '-1.000000000,-1.000000000',
'Key': '{XXXX-XXXX-XXXX}',
'Data': '{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'}
So far so good, I can access the Data value via updatedRequest['Data'].
updatedRequest['Data']
'{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'
My issue when attempting to convert this into a Python usable dictionary (e.g updatedRequest['Data']['Value0']['ReferenceValue']). Because there is an unknown number of 'Value' keys, I'm uncertain as to what the best procedure would be to move this into workable data.
You have received a JSON document with a nested JSON document, itself containing further JSON documents, inside one another like a Matryoshka doll.
Unfortunately, you can only decode one level, because the next level is broken. There should be \ escapes in front of the " quote characters used for the 3rd level of JSON documents, just like the second level quotes were escaped when it was embedded in the top-level JSON document. Those are missing so no JSON parser can decode it anymore. The delimiters around JSON strings have been derailed by stray, unescaped " characters that were meant to be part of a JSON string value.
You either need to repair the client sending this data, and discard these malformed values as an invalid request.
For completeness sake, a valid document would look like this:
>>> v0 = '''{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}'''
>>> v1 = '''{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'''
>>> data_value = json.dumps({'Value0': v0, 'Value1': v1})
>>> print(json.dumps({'Data': data_value, 'Date': '1/4/2019 9:52:2', 'DeviceId': '192-2993-2993', 'Key': '{XXXX-XXXX-XXXX}', 'Location': '-1.000000000,-1.000000000'}, indent=4))
{
"Data": "{\"Value0\": \"{ \\\"ReferenceValue\\\":\\\"Elevation\\\", \\\"Prediction\\\":\\\"22.216558464\\\"}\", \"Value1\": \"{ \\\"ReferenceValue\\\":\\\"Wind Speed\\\", \\\"Prediction\\\":\\\"42.216558464\\\"}\\\" }\"}",
"Date": "1/4/2019 9:52:2",
"DeviceId": "192-2993-2993",
"Key": "{XXXX-XXXX-XXXX}",
"Location": "-1.000000000,-1.000000000"
}
Note the \" and \\\" escapes in the Data value. On decoding, the string value for Data will have one level of escape sequences removed, forming " and \" sequences, where the " quotes are part of the JSON syntax and \" are part of the string values, which in turn can be decoded to " used in the innermost JSON document.
It really depends what you want to do with the data. You can loop through the 'Data' dictionary with:
for k,v in updatedRequest['Data'].items():
# do some stuff
This will allow you to process without having to deal with the variable number of items in this dictionary. Hard to say what is best without knowing exactly what you wish to do though!

How do I put a dictionary into JSON without the escape slash

I'm not sure what I am doing wrong. I have a dictionary that I want to convert to JSON. My problem is with the escape \
How do I put a dictionary into JSON without the escape \
Here is my code:
def printJSON(dump):
print(json.dumps(dump, indent=4, sort_keys=True))
data = {'number':7, 'second_number':44}
json_data = json.dumps(data)
printJSON(json_data)
The output is:
"{\"second_number\": 44, \"number\": 7}"
I want the output to look like this:
"{"second_number": 44, "number": 7}"
The reason is because you are dumping your JSON data twice. Once outside the function and another inside it. For reference:
>>> import json
>>> data = {'number':7, 'second_number':44}
# JSON dumped once, without `\`
>>> json.dumps(data)
'{"second_number": 44, "number": 7}'
# JSON dumped twice, with `\`
>>> json.dumps(json.dumps(data))
'"{\\"second_number\\": 44, \\"number\\": 7}"'
If you print the data dumped twice, you will see what you are getting currently, i.e:
>>> print json.dumps(json.dumps(data))
"{\"second_number\": 44, \"number\": 7}"
I had a slightly different problem that resulted in the same issue. My code had this:
requests.post('https://example.com/data', data=clinicListBody).text
When it should have had this
requests.post('https://example.com/data', data=clinicListBody).json()
.text was returning a string with strings inside it, which is why I was seeing escaped json in the saved file.

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.
you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

Parsing JSON failed

I am trying to parse this data (from the Viper malware analysis framework API specifically). I am having a hard time figure out the best way to do this. Ideally, I would just do a:
jsonObject.get("SSdeep")
... and I would get the value.
I don't think this is valid JSON unfortunately, and without editing the source of the project, how can I make this proper JSON or easily get these values?
[{
'data': {
'header': ['Key', 'Value'],
'rows': [
['Name', u 'splwow64.exe'],
['Tags', ''],
['Path', '/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['Size', 125952],
['Type', 'PE32+ executable (GUI) x86-64, for MS Windows'],
['Mime', 'application/x-dosexec'],
['MD5', '4b1d2cba1367a7b99d51b1295b3a1d57'],
['SHA1', 'caf8382df0dcb6e9fb51a5e277685b540632bf18'],
['SHA256', '8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['SHA512', '709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d'],
['SSdeep', ''],
['CRC32', '7106095E']
]
},
'type': 'table'
}]
Edit 1
Thank you! So I have tried this:
jsonObject = r.content.replace("'", "\"")
jsonObject = jsonObject.replace(" u", "")
and the output I have now is:
"[{"data": {"header": ["Key", "Value"], "rows": [["Name","splwow64.exe"], ["Tags", ""], ["Path", "/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["Size", 125952], ["Type", "PE32+ executable (GUI) x86-64, for MS Windows"], ["Mime", "application/x-dosexec"], ["MD5", "4b1d2cba1367a7b99d51b1295b3a1d57"], ["SHA1", "caf8382df0dcb6e9fb51a5e277685b540632bf18"], ["SHA256", "8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["SHA512", "709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d"], ["SSdeep", ""], ["CRC32", "7106095E"]]}, "type": "table"}]"
and now I'm getting this error:
File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 5 - line 1 column 716 (char 4 - 715)
Note: I'd really rather not do the find and replaces like that.. especially the " u" one, as this could have unintended consequences.
Edit 2:
Figured it out! Thank you everyone!
Here's what I ended up doing, as someone mentioned the original text from the server was a "list of dicts":
r = requests.post(url, data=data) #Make the server request
listObject = r.content #Grab the content (don't really need this line)
listObject = listObject[1:-1] #Get rid of the quotes
listObject = ast.literal_eval(listObject) #Create a list out of the literal characters of the string
dictObject = listObject[0] #My dict!
JSON specifies double quotes "s for strings, from the JSON standard
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
So you would need to replace all the single quotes with double quotes:
data.replace("'", '"')
There is also a spurious u in the Name field that will need to be removed.
However if the data is valid Python and you trust it you could try evaluating it, this worked with your original data (without the space after the u):
result = eval(data)
Or more safely:
result = ast.literal_eval(data)
Now you appear to have quotes "wrapping" the entire thing. Which is causing all the brackets to be strings. Remove the quotes at the start and end of the JSON.
Also, in JSON, start the structure with either '[' or '{' (usually '{'), not both.
No need to use eval(), just replace the malformed characters (use escape \ character) and parse it with json will be fine:
resp = r.content.replace(" u \'", " \'").replace("\'", "\"")
json.loads(resp)

Categories