Parsing JSON failed - python

I am trying to parse this data (from the Viper malware analysis framework API specifically). I am having a hard time figure out the best way to do this. Ideally, I would just do a:
jsonObject.get("SSdeep")
... and I would get the value.
I don't think this is valid JSON unfortunately, and without editing the source of the project, how can I make this proper JSON or easily get these values?
[{
'data': {
'header': ['Key', 'Value'],
'rows': [
['Name', u 'splwow64.exe'],
['Tags', ''],
['Path', '/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['Size', 125952],
['Type', 'PE32+ executable (GUI) x86-64, for MS Windows'],
['Mime', 'application/x-dosexec'],
['MD5', '4b1d2cba1367a7b99d51b1295b3a1d57'],
['SHA1', 'caf8382df0dcb6e9fb51a5e277685b540632bf18'],
['SHA256', '8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['SHA512', '709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d'],
['SSdeep', ''],
['CRC32', '7106095E']
]
},
'type': 'table'
}]
Edit 1
Thank you! So I have tried this:
jsonObject = r.content.replace("'", "\"")
jsonObject = jsonObject.replace(" u", "")
and the output I have now is:
"[{"data": {"header": ["Key", "Value"], "rows": [["Name","splwow64.exe"], ["Tags", ""], ["Path", "/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["Size", 125952], ["Type", "PE32+ executable (GUI) x86-64, for MS Windows"], ["Mime", "application/x-dosexec"], ["MD5", "4b1d2cba1367a7b99d51b1295b3a1d57"], ["SHA1", "caf8382df0dcb6e9fb51a5e277685b540632bf18"], ["SHA256", "8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["SHA512", "709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d"], ["SSdeep", ""], ["CRC32", "7106095E"]]}, "type": "table"}]"
and now I'm getting this error:
File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 5 - line 1 column 716 (char 4 - 715)
Note: I'd really rather not do the find and replaces like that.. especially the " u" one, as this could have unintended consequences.
Edit 2:
Figured it out! Thank you everyone!
Here's what I ended up doing, as someone mentioned the original text from the server was a "list of dicts":
r = requests.post(url, data=data) #Make the server request
listObject = r.content #Grab the content (don't really need this line)
listObject = listObject[1:-1] #Get rid of the quotes
listObject = ast.literal_eval(listObject) #Create a list out of the literal characters of the string
dictObject = listObject[0] #My dict!

JSON specifies double quotes "s for strings, from the JSON standard
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
So you would need to replace all the single quotes with double quotes:
data.replace("'", '"')
There is also a spurious u in the Name field that will need to be removed.
However if the data is valid Python and you trust it you could try evaluating it, this worked with your original data (without the space after the u):
result = eval(data)
Or more safely:
result = ast.literal_eval(data)

Now you appear to have quotes "wrapping" the entire thing. Which is causing all the brackets to be strings. Remove the quotes at the start and end of the JSON.
Also, in JSON, start the structure with either '[' or '{' (usually '{'), not both.

No need to use eval(), just replace the malformed characters (use escape \ character) and parse it with json will be fine:
resp = r.content.replace(" u \'", " \'").replace("\'", "\"")
json.loads(resp)

Related

Parsing data containing escaped quotes and separators in python

I have data that is structured like this:
1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah"
It always starts with a unix timestamp, but then I can't know how many other fields follow and how they are called.
The goal is to parse this into a dictionary as such:
{"timestamp": 1661171420,
"foo": "bar",
"test": 'This, is a "TEST"',
"count": 5,
"com": "foo, bar=blah"}
I'm having trouble parsing this, especially regarding the escaped quotes and commas in the values.
What would be the best way to parse this correctly? preferably without any 3rd party modules.
If changing the format of input data is not an option (JSON would be much easier to handle, but if it is an API as you say then you might be stuck with this) the following would work assuming the file follows given structure more or less. Not the cleanest solution, I agree, but it does the job.
import re
d = r'''1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah", fraction=-0.11'''.replace(r"\"", "'''")
string_pattern = re.compile(r'''(\w+)="([^"]*)"''')
matches = re.finditer(string_pattern, d)
parsed_data = {}
parsed_data['timestamp'] = int(d.partition(", ")[0])
for match in matches:
parsed_data[match.group(1)] = match.group(2).replace("'''", "\"")
number_pattern = re.compile(r'''(\w+)=([+-]?\d+(?:\.\d+)?)''')
matches = re.finditer(number_pattern, d)
for match in matches:
try:
parsed_data[match.group(1)] = int(match.group(2))
except ValueError:
parsed_data[match.group(1)] = float(match.group(2))
print(parsed_data)

How to fix missing double quotes issue when parsing JSON data?

I am running a piece of code in Python3 where I am consuming JSON data from the source. I don't have control over the source. While reading the json data I am getting following error:
simplejson.errors.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2
Here is the code
import logging
import simplejson as json
logging.basicConfig(level=logging.INFO)
consumer = KafkaConsumer(
bootstrap_servers='localhost:9092',
api_version=(1,0,0))
consumer.subscribe(['Test_Topic-1'])
for message in consumer:
msg_str=message.value
y = json.loads(msg_str)
print(y["city_name"])
As I can not change the source, I need to fix it at my end. I found out this post helpful as my data contains the timestamps with : in it: How to Fix JSON Key Values without double-quotes?
But it also fails for some values in my json data as those values contain : in it. e.g.
address:"1600:3050:rf02:hf64:h000:0000:345e:d321"
Is there any way where I can add double quotes to keys in my json data?
You can try to use module dirtyjson - it can fix some mistakes.
import dirtyjson
d = dirtyjson.loads('{address:"1600:3050:rf02:hf64:h000:0000:345e:d321"}')
print( d['address'] )
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d['abc'] )
It creates AttributedDict so it may need dict() to create normal dictionary
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d )
print( dict(d) )
Result:
AttributedDict([('abc', '1:2:3:4'), ('efg', '5:6:7:8'), ('hij', 'foo')])
{'abc': '1:2:3:4', 'efg': '5:6:7:8', 'hij': 'foo'}
I think your problem is that you have strings like this:
{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}
which are not valid JSON. You could try to repair it with a regular expression substitution:
import re
jtxt_bad ='{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"\n}'
jtxt = re.sub(r'\b([a-zA-Z]+):("[^"]+"[,\n}])', r'"\1":\2', jtxt_bad)
print(f'Original: {jtxt_bad}\nRepaired: {jtxt}')
The output of this is:
Original: {abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"
}
Repaired: {"abc":"1:2:3:4", "efg":"5:6:7:8", "hij":"foo", "klm":"bar"
}
The regular expression \b([a-zA-Z]+):("[^"]+"[,\}]) means: boundary, followed by one or more letters, followed by a :, followed by double-quoted string, followed by one of ,, }, \n. However, this will fail if there is a quote inside the string, such as "1:\"2:3".

Converting nested JSON into Python dictionary

I'm receiving a string server side which I then convert to JSON:
127.0.0.1:8000/devices/f751/?json={ "DeviceId":"192-2993-2993", "Date":"1/4/2019 9:52:2", "Location":"-1.000000000,-1.000000000", "Key":"{XXXX-XXXX-XXXX}", "Data":" { \"Value0\":\"{ \"ReferenceValue\":\"Elevation\", \"Prediction\":\"22.216558464\"}\", \"Value1\":\"{ \"ReferenceValue\":\"Wind Speed\", \"Prediction\":\"42.216558464\"}\" } "}
After conversion using json.loads() I get the following output:
updatedRequest = json.loads(jsonRequest)
updatedRequest
{'DeviceId': '192-2993-2993',
'Date': '1/4/2019 9:52:2',
'Location': '-1.000000000,-1.000000000',
'Key': '{XXXX-XXXX-XXXX}',
'Data': '{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'}
So far so good, I can access the Data value via updatedRequest['Data'].
updatedRequest['Data']
'{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'
My issue when attempting to convert this into a Python usable dictionary (e.g updatedRequest['Data']['Value0']['ReferenceValue']). Because there is an unknown number of 'Value' keys, I'm uncertain as to what the best procedure would be to move this into workable data.
You have received a JSON document with a nested JSON document, itself containing further JSON documents, inside one another like a Matryoshka doll.
Unfortunately, you can only decode one level, because the next level is broken. There should be \ escapes in front of the " quote characters used for the 3rd level of JSON documents, just like the second level quotes were escaped when it was embedded in the top-level JSON document. Those are missing so no JSON parser can decode it anymore. The delimiters around JSON strings have been derailed by stray, unescaped " characters that were meant to be part of a JSON string value.
You either need to repair the client sending this data, and discard these malformed values as an invalid request.
For completeness sake, a valid document would look like this:
>>> v0 = '''{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}'''
>>> v1 = '''{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'''
>>> data_value = json.dumps({'Value0': v0, 'Value1': v1})
>>> print(json.dumps({'Data': data_value, 'Date': '1/4/2019 9:52:2', 'DeviceId': '192-2993-2993', 'Key': '{XXXX-XXXX-XXXX}', 'Location': '-1.000000000,-1.000000000'}, indent=4))
{
"Data": "{\"Value0\": \"{ \\\"ReferenceValue\\\":\\\"Elevation\\\", \\\"Prediction\\\":\\\"22.216558464\\\"}\", \"Value1\": \"{ \\\"ReferenceValue\\\":\\\"Wind Speed\\\", \\\"Prediction\\\":\\\"42.216558464\\\"}\\\" }\"}",
"Date": "1/4/2019 9:52:2",
"DeviceId": "192-2993-2993",
"Key": "{XXXX-XXXX-XXXX}",
"Location": "-1.000000000,-1.000000000"
}
Note the \" and \\\" escapes in the Data value. On decoding, the string value for Data will have one level of escape sequences removed, forming " and \" sequences, where the " quotes are part of the JSON syntax and \" are part of the string values, which in turn can be decoded to " used in the innermost JSON document.
It really depends what you want to do with the data. You can loop through the 'Data' dictionary with:
for k,v in updatedRequest['Data'].items():
# do some stuff
This will allow you to process without having to deal with the variable number of items in this dictionary. Hard to say what is best without knowing exactly what you wish to do though!

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.
you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

decoding JSON data with backslash encoding

I have the following JSON data.
"[
\"msgType\": \"0\",
\"tid\": \"1\",
\"data\": \"[
{
\\\"EventName\\\": \\\"TExceeded\\\",
\\\"Severity\\\": \\\"warn\\\",
\\\"Subject\\\": \\\"Exceeded\\\",
\\\"Message\\\": \\\"tdetails: {
\\\\\\\"Message\\\\\\\": \\\\\\\"my page tooktoolong(2498ms: AT: 5ms,
BT: 1263ms,
CT: 1230ms),
andexceededthresholdof5ms\\\\\\\",
\\\\\\\"Referrer\\\\\\\": \\\\\\\"undefined\\\\\\\",
\\\\\\\"Session\\\\\\\": \\\\\\\"None\\\\\\\",
\\\\\\\"ResponseTime\\\\\\\": 0,
\\\\\\\"StatusCode\\\\\\\": 0,
\\\\\\\"Links\\\\\\\": 215,
\\\\\\\"Images\\\\\\\": 57,
\\\\\\\"Forms\\\\\\\": 2,
\\\\\\\"Platform\\\\\\\": \\\\\\\"Linuxx86_64\\\\\\\",
\\\\\\\"BrowserAppname\\\\\\\": \\\\\\\"Netscape\\\\\\\",
\\\\\\\"AppCodename\\\\\\\": \\\\\\\"Mozilla\\\\\\\",
\\\\\\\"CPUs\\\\\\\": 8,
\\\\\\\"Language\\\\\\\": \\\\\\\"en-GB\\\\\\\",
\\\\\\\"isEvent\\\\\\\": \\\\\\\"true\\\\\\\",
\\\\\\\"PageLatency\\\\\\\": 2498,
\\\\\\\"Threshold\\\\\\\": 5,
\\\\\\\"AT\\\\\\\": 5,
\\\\\\\"BT\\\\\\\": 1263,
\\\\\\\"CT\\\\\\\": 1230
}\\\",
\\\"EventTimestamp\\\": \\\"1432514783269\\\"
}
]\",
\"Timestamp\": \"1432514783269\",
\"AppName\": \"undefined\",
\"Group\": \"UndefinedGroup\"
]"
I want to make this JSON file into a single level of wrapping.i.e I want to remove the nested structure inside and copy that data over to the top level JSON structure. How can I do this?
If this strucutre is named json_data
I want to be able to access
json_data['Platform']
json_data[BrowserAppname']
json_data['Severity']
json_data['msgType']
Basically some kind of rudimentary normalization.What is the easiest way to do this using python
A generally unsafe but probably okay in this case solution would be:
import json
d = json.loads(json_string.replace('\\', ''))
I'm not sure what happened but this doesn't look like valid JSON.
You have some double-quotes escaped once, some twice, some three times etc.
You have key/value pairs inside of a list-like object []
tdetails is missing a trailing quote
Even if you fix the above you still have your data list quoted as a multi-line string which is invalid.
It appears to be that this "JSON" was constructed by hand, by someone with no knowledge of JSON.
You can try "massaging" the data into JSON with the following:
import re
x = re.sub(r'\\+', '', js_str)
x = re.sub(r'\n', '', js_str)
x = '{' + js_str.strip()[1:-1] + '}'
Which would make the string almost json like, but you still need to fix point #3.

Categories