Converting data inside a JSON file into Python dict - python

I have a JSON file and the inside of it looks like this:
"{'reviewId': 'gp:AOqpTOGiJUWB2pk4jpWSuvqeXofM9B4LQQ4Iom1mNeGzvweEriNTiMdmHsxAJ0jaJiK7CbjJ_s7YEWKE2DA_Qzo', 'userName': '\u00c0ine Mongey', 'userImage': 'https://play-lh.googleusercontent.com/a-/AOh14GhUv3c6xHP4kvLSJLaRaydi6o2qxp6yZhaLeL8QmQ', 'content': \"Honestly a great game, it does take a while to get money at first, and they do make it easier to get money by watching ads. I'm glad they don't push it though, and the game is super relaxing and fun!\", 'score': 5, 'thumbsUpCount': 2, 'reviewCreatedVersion': '1.33.0', 'at': datetime.datetime(2021, 4, 23, 8, 20, 34), 'replyContent': None, 'repliedAt': None}"
I am trying to convert this into a dict and then to a pandas DataFrame. I tried this but it will just turn this into a string representation of a dict, not a dict itself:
with open('sampledict.json') as f:
dictdump = json.loads(f.read())
print(type(dictdump))
I feel like I am so close now but I can't find out what I miss to get a dict out of this. Any input will be greatly appreciated!

If I get your data format correctly, this will work:
with open('sampledict.json') as f:
d = json.load(f)
d = eval(d)
# Or this works as well
d = json.loads(f.read())
d = eval(d)
>>> d.keys()
['userName', 'userImage', 'repliedAt', 'score', 'reviewCreatedVersion', 'at', 'replyContent', 'content', 'reviewId', 'thumbsUpCount']

Are you sure that you have your source JSON correct? The JSON snippet you have provided is a string; it has a " at the start and end. So in its current form getting a string is correct behaviour.
Note also that it is a string representation of a Python dict rather than a JSON object. This is evidenced by the fact that the strings are denoted by single quotes rather than double, and the use of the Python keyword None rather than the JSON null.
If the JSON file were a representative of a plain object then the content would be something of the form:
{
"reviewId": "gp"AO...",
"userName": "...",
"replyContent": null,
"repliedAt": null
}
I.e. the first and last characters are curly braces, not double quotes.

Related

Unable to iterate through JSON data

I'm trying to loop through JSON data to find values for specific keys. My data is coming from a http request and the data looks like:
{'1': {'manufacturername': 'SVLJ',
'modelid': 'TCL014',
'name': 'Fling'},
'10': {'manufacturername': 'SONY',
'modelid': 'BLL4554',
'name': 'ACQ'}}
My current goal is to loop through each item number (1, 10, etc..) and get the value for light ('fling', 'acq', etc..). My latest attempt is:
import requests
RESOURCE_URL = 'xxx/xxx/'
def get_json(url):
raw_response = requests.get(url)
data = raw_response.json()
return data
def get_SMR():
url = "{}SMR/".format(RESOURCE_URL)
return get_json(url)
smr_json = get_SMR()
for SMR in smr_json:
print(SMR['name'])
When I try running this, I get the error:
TypeError: string indices must be integers
I've also tried importing the json library, and using json.loads(raw_response.text); however, it's still being recognized as a string, rather than an iterable json object (that can be referenced by key). Any and all insight would be greatly appreciated.
When you are doing for SMR in smr_json:, you are iterating over the keys of the dictionary. In other words, SMR is a string, which does not allow indexing by a string:
In [1]: SMR = 'test'
In [2]: SMR['string']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
TypeError: string indices must be integers
You've meant to iterate over both the keys and values:
for key, SMR in smr_json.items():
print(SMR['name'])
Or, just values:
for SMR in smr_json.values():
print(SMR['name'])
You are probably getting a string because that is not valid JSON. JSON requires " for strings, not '.
See json.org:
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.
I think that the problem is in JSON file. Single quotes are not allowed.
I'd first replace the single quotes ' with the double quotes " , to have something like this:
{
"1": {
"manufacturername": "SVLJ",
"modelid": "TCL014",
"name": "Fling"
},
"10": {
"manufacturername": "SONY",
"modelid": "BLL4554",
"name": "ACQ"
}
}

How do I put a dictionary into JSON without the escape slash

I'm not sure what I am doing wrong. I have a dictionary that I want to convert to JSON. My problem is with the escape \
How do I put a dictionary into JSON without the escape \
Here is my code:
def printJSON(dump):
print(json.dumps(dump, indent=4, sort_keys=True))
data = {'number':7, 'second_number':44}
json_data = json.dumps(data)
printJSON(json_data)
The output is:
"{\"second_number\": 44, \"number\": 7}"
I want the output to look like this:
"{"second_number": 44, "number": 7}"
The reason is because you are dumping your JSON data twice. Once outside the function and another inside it. For reference:
>>> import json
>>> data = {'number':7, 'second_number':44}
# JSON dumped once, without `\`
>>> json.dumps(data)
'{"second_number": 44, "number": 7}'
# JSON dumped twice, with `\`
>>> json.dumps(json.dumps(data))
'"{\\"second_number\\": 44, \\"number\\": 7}"'
If you print the data dumped twice, you will see what you are getting currently, i.e:
>>> print json.dumps(json.dumps(data))
"{\"second_number\": 44, \"number\": 7}"
I had a slightly different problem that resulted in the same issue. My code had this:
requests.post('https://example.com/data', data=clinicListBody).text
When it should have had this
requests.post('https://example.com/data', data=clinicListBody).json()
.text was returning a string with strings inside it, which is why I was seeing escaped json in the saved file.

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.
you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

decoding JSON data with backslash encoding

I have the following JSON data.
"[
\"msgType\": \"0\",
\"tid\": \"1\",
\"data\": \"[
{
\\\"EventName\\\": \\\"TExceeded\\\",
\\\"Severity\\\": \\\"warn\\\",
\\\"Subject\\\": \\\"Exceeded\\\",
\\\"Message\\\": \\\"tdetails: {
\\\\\\\"Message\\\\\\\": \\\\\\\"my page tooktoolong(2498ms: AT: 5ms,
BT: 1263ms,
CT: 1230ms),
andexceededthresholdof5ms\\\\\\\",
\\\\\\\"Referrer\\\\\\\": \\\\\\\"undefined\\\\\\\",
\\\\\\\"Session\\\\\\\": \\\\\\\"None\\\\\\\",
\\\\\\\"ResponseTime\\\\\\\": 0,
\\\\\\\"StatusCode\\\\\\\": 0,
\\\\\\\"Links\\\\\\\": 215,
\\\\\\\"Images\\\\\\\": 57,
\\\\\\\"Forms\\\\\\\": 2,
\\\\\\\"Platform\\\\\\\": \\\\\\\"Linuxx86_64\\\\\\\",
\\\\\\\"BrowserAppname\\\\\\\": \\\\\\\"Netscape\\\\\\\",
\\\\\\\"AppCodename\\\\\\\": \\\\\\\"Mozilla\\\\\\\",
\\\\\\\"CPUs\\\\\\\": 8,
\\\\\\\"Language\\\\\\\": \\\\\\\"en-GB\\\\\\\",
\\\\\\\"isEvent\\\\\\\": \\\\\\\"true\\\\\\\",
\\\\\\\"PageLatency\\\\\\\": 2498,
\\\\\\\"Threshold\\\\\\\": 5,
\\\\\\\"AT\\\\\\\": 5,
\\\\\\\"BT\\\\\\\": 1263,
\\\\\\\"CT\\\\\\\": 1230
}\\\",
\\\"EventTimestamp\\\": \\\"1432514783269\\\"
}
]\",
\"Timestamp\": \"1432514783269\",
\"AppName\": \"undefined\",
\"Group\": \"UndefinedGroup\"
]"
I want to make this JSON file into a single level of wrapping.i.e I want to remove the nested structure inside and copy that data over to the top level JSON structure. How can I do this?
If this strucutre is named json_data
I want to be able to access
json_data['Platform']
json_data[BrowserAppname']
json_data['Severity']
json_data['msgType']
Basically some kind of rudimentary normalization.What is the easiest way to do this using python
A generally unsafe but probably okay in this case solution would be:
import json
d = json.loads(json_string.replace('\\', ''))
I'm not sure what happened but this doesn't look like valid JSON.
You have some double-quotes escaped once, some twice, some three times etc.
You have key/value pairs inside of a list-like object []
tdetails is missing a trailing quote
Even if you fix the above you still have your data list quoted as a multi-line string which is invalid.
It appears to be that this "JSON" was constructed by hand, by someone with no knowledge of JSON.
You can try "massaging" the data into JSON with the following:
import re
x = re.sub(r'\\+', '', js_str)
x = re.sub(r'\n', '', js_str)
x = '{' + js_str.strip()[1:-1] + '}'
Which would make the string almost json like, but you still need to fix point #3.

Parsing JSON failed

I am trying to parse this data (from the Viper malware analysis framework API specifically). I am having a hard time figure out the best way to do this. Ideally, I would just do a:
jsonObject.get("SSdeep")
... and I would get the value.
I don't think this is valid JSON unfortunately, and without editing the source of the project, how can I make this proper JSON or easily get these values?
[{
'data': {
'header': ['Key', 'Value'],
'rows': [
['Name', u 'splwow64.exe'],
['Tags', ''],
['Path', '/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['Size', 125952],
['Type', 'PE32+ executable (GUI) x86-64, for MS Windows'],
['Mime', 'application/x-dosexec'],
['MD5', '4b1d2cba1367a7b99d51b1295b3a1d57'],
['SHA1', 'caf8382df0dcb6e9fb51a5e277685b540632bf18'],
['SHA256', '8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['SHA512', '709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d'],
['SSdeep', ''],
['CRC32', '7106095E']
]
},
'type': 'table'
}]
Edit 1
Thank you! So I have tried this:
jsonObject = r.content.replace("'", "\"")
jsonObject = jsonObject.replace(" u", "")
and the output I have now is:
"[{"data": {"header": ["Key", "Value"], "rows": [["Name","splwow64.exe"], ["Tags", ""], ["Path", "/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["Size", 125952], ["Type", "PE32+ executable (GUI) x86-64, for MS Windows"], ["Mime", "application/x-dosexec"], ["MD5", "4b1d2cba1367a7b99d51b1295b3a1d57"], ["SHA1", "caf8382df0dcb6e9fb51a5e277685b540632bf18"], ["SHA256", "8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["SHA512", "709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d"], ["SSdeep", ""], ["CRC32", "7106095E"]]}, "type": "table"}]"
and now I'm getting this error:
File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 5 - line 1 column 716 (char 4 - 715)
Note: I'd really rather not do the find and replaces like that.. especially the " u" one, as this could have unintended consequences.
Edit 2:
Figured it out! Thank you everyone!
Here's what I ended up doing, as someone mentioned the original text from the server was a "list of dicts":
r = requests.post(url, data=data) #Make the server request
listObject = r.content #Grab the content (don't really need this line)
listObject = listObject[1:-1] #Get rid of the quotes
listObject = ast.literal_eval(listObject) #Create a list out of the literal characters of the string
dictObject = listObject[0] #My dict!
JSON specifies double quotes "s for strings, from the JSON standard
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
So you would need to replace all the single quotes with double quotes:
data.replace("'", '"')
There is also a spurious u in the Name field that will need to be removed.
However if the data is valid Python and you trust it you could try evaluating it, this worked with your original data (without the space after the u):
result = eval(data)
Or more safely:
result = ast.literal_eval(data)
Now you appear to have quotes "wrapping" the entire thing. Which is causing all the brackets to be strings. Remove the quotes at the start and end of the JSON.
Also, in JSON, start the structure with either '[' or '{' (usually '{'), not both.
No need to use eval(), just replace the malformed characters (use escape \ character) and parse it with json will be fine:
resp = r.content.replace(" u \'", " \'").replace("\'", "\"")
json.loads(resp)

Categories