Sorting a JSON Object - python

I'm new to JSON and Python, and I'm attempting to load a json file that I created from disk to manipulate and output to an xml file. I've gotten most of it figured out, except, I want to 'sort' the JSON file after I load it by a certain value.
Example of json file:
{
"0" : {
"name": "John Doe",
"finished": "4",
"logo": "http://someurl.com/icon.png"
},
"1" : {
"name": "Jane Doe",
"finished": "10",
"logo": "http://anotherurl.com/icon.png"
},
"2" : {
"name": "Jacob Smith",
"finished": "3",
"logo": "http://example.com/icon.png"
}
}
What I want to do is sort 'tree' by the 'finished' key.
JSONFILE = "file.json"
with open(CHANS) as json_file:
tree = json.load(json_file)

Depends on how do you "consume" the tree dictionary. are you using tree.keys(), tree.values() or tree.items()?
tree.keys()
ordered_keys = sorted(tree.keys(), key=lambda k: int(tree[k]['finished']))
tree.values()
ordered_keys = sorted(tree.values(), key=lambda v: int(v['finished']))
tree.items()
ordered_keys = sorted(tree.items(), key=lambda t: int(t[1]['finished']))
You only keep in mind that JSON is what's inside the actual file, the result of json.load() is just a Python value/object, so just work with them.
If you are walking over the sorted dictionary once, the above snippets will work just fine. However if you need to access it multiple times, then I would follow ~Jean-François suggestion and use OrderedDict, with something like:
from collections import OrderedDict
tree = OrderedDict(sorted(tree.items(), key=lambda t: int(t[1]['finished'])))
That way the sorting operation (arguably the most expensive) is done just once.

Related

Python: Identify the duplicate JSON values and generate an output file with sets of duplicate input rows, one row per set of duplicates

I am trying to find duplicated JSON objects in a 30GB jsonlines file.
Given a JSON object A that look like this:
{
"data": {
"cert_index": 691749790,
"cert_link": "http://ct.googleapis.com/icarus/ct/v1/get-entries?start=691749790&end=691749790",
"chain": [{...},{...}],
"leaf_cert": {
"all_domains": [...],
"as_der": "MIIFcjCCBFqgAwIBAgISBDD2+d1gP36/+9uUveS...",
"extensions": {...},
"fingerprint": "0C:E4:AF:24:F1:AE:B1:09:B0:42:67:CB:F8:FC:B6:AF:1C:07:D6:5B",
"not_after": 1573738488,
"not_before": 1565962488,
"serial_number": "430F6F9DD603F7EBFFBDB94BDE4BBA4EC9A",
"subject": {...}
},
"seen": 1565966152.750253,
"source": {
"name": "Google 'Icarus' log",
"url": "ct.googleapis.com/icarus/"
},
"update_type": "PrecertLogEntry"
},
"message_type": "certificate_update"
}
How can I generate an output file where each row looks like this:
{"fingerprint":"0C:E4:AF:24:F1:AE:B1:09:B0:42:67:CB:F8:FC:B6:AF:1C:07:D6:5B", "certificates":[A, B, C]}
Here A, B, and C are the full JSON object for each of the duplicates.
You need to use an array with your information. And before adding a new JSON, check if the fingerprint is already in the array. For example:
currentFingerprint = myJson['data']['leaf_cert']['fingerprint']
for elem in arrayOfFingerprints:
if elem['fingerprint'] == currentFingerprint:
elem['certificates'].append(myJson)
break
else:
arrayOfFingerprints.append({'fingerprint': currentFingerprint, 'certificates': [myJson]}
I'm going to assume that you have already read the file and created a list of dicts.
from collections import defaultdict
import json
d = defaultdict(list)
for jobj in file:
d[jobj['data']['leaf_cert']['fingerprint']].append(jobj)
with open('file.txt', 'w') as out:
for k,v in d:
json.dump({"fingerprint":k, "certificates":v})

Reading and parsing JSON with ijson [duplicate]

I have the following data in my JSON file:
{
"first": {
"name": "James",
"age": 30
},
"second": {
"name": "Max",
"age": 30
},
"third": {
"name": "Norah",
"age": 30
},
"fourth": {
"name": "Sam",
"age": 30
}
}
I want to print the top-level key and object as follows:
import json
import ijson
fname = "data.json"
with open(fname) as f:
raw_data = f.read()
data = json.loads(raw_data)
for k in data.keys():
print k, data[k]
OUTPUT:
second {u'age': 30, u'name': u'Max'}
fourth {u'age': 30, u'name': u'Sam'}
third {u'age': 30, u'name': u'Norah'}
first {u'age': 30, u'name': u'James'}
So, far so good. However if I want to this same thing for a huge file, I would have to read it all in-memory. This very slow and requires lots of memory.
I want use an incremental JSON parser ( ijson in this case ) to achieve what I described earlier:
The above code was taken from: No access to top level elements with ijson?
with open(fname) as f:
json_obj = ijson.items(f,'').next() # '' loads everything as only one object.
for (key, value) in json_obj.items():
print key + " -> " + str(value)
This is not suitable either, because it also reads the whole file in memory. This not truly incremental.
How can I do incremental parsing of top-level keys and corresponding objects, of a JSON file in Python?
Since essentially json files are text files, consider stripping the top level as string. Basically, use a read file iterable approach where you concatenate a string with each line and then break out of the loop once the string contains the double braces }} signaling the end of the top level. Of course the double brace condition must strip out spaces and line breaks.
toplevelstring = ''
with open('data.json') as f:
for line in f:
if not '}}' in toplevelstring.replace('\n', '').replace('\s+',''):
toplevelstring = toplevelstring + line
else:
break
data = json.loads(toplevelstring)
Now if your larger json is wrapped in square brackets or other braces, still run above routine but add the below line to slice out first character, [, and last two characters for comma and line break after top level's final brace:
[{
"first": {
"name": "James",
"age": 30
},
"second": {
"name": "Max",
"age": 30
},
"third": {
"name": "Norah",
"age": 30
},
"fourth": {
"name": "Sam",
"age": 30
}
},
{
"data1": {
"id": "AAA",
"type": 55
},
"data2": {
"id": "BBB",
"type": 1601
},
"data3": {
"id": "CCC",
"type": 817
}
}]
...
toplevelstring = toplevelstring[1:-2]
data = json.loads(toplevelstring)
Since version 2.6 ijson comes with a kvitems function that achieves exactly this.
Answer from github issue [file name changed]
import ijson
from ijson.common import ObjectBuilder
def objects(file):
key = '-'
for prefix, event, value in ijson.parse(file):
if prefix == '' and event == 'map_key': # found new object at the root
key = value # mark the key value
builder = ObjectBuilder()
elif prefix.startswith(key): # while at this key, build the object
builder.event(event, value)
if event == 'end_map': # found the end of an object at the current key, yield
yield key, builder.value
for key, value in objects(open('data.json', 'rb')):
print(key, value)

Key order while writing a JSON object in Python [duplicate]

This question already has answers here:
Items in JSON object are out of order using "json.dumps"?
(8 answers)
Closed 4 years ago.
I need to show an output like below as a result of a webservice call
{
"predictions": [
{
"id": 18009,
"cuisine": "italian",
"probability": 0.17846838753494407
},
{
"id": 28583,
"cuisine": "italian",
"probability": 0.1918703125538735
}
]
}
I have the below code to create the object:
json_data = []
for i in range (0, len(predicted_labels)):
data = {}
data['id'] = int(test['id'][i])
data['cuisine'] = categoricalTransformer[predicted_labels[i]]
item = predicted_probability[i]
data['probability'] = item[predicted_labels[i]]
json_data.append(data)
json_return = {"predictions":json_data}
return jsonify(json_return)
which reorders the key alphabetically as shown below.
{
"predictions": [
{
"cuisine": "italian",
"id": 18009,
"probability": 0.17846838753494407
},
{
"cuisine": "italian",
"id": 28583,
"probability": 0.1918703125538735
}
]
}
What should I do?
Rebuild your new dictionary with a collections.defaultdict(), and append ordered dictionaries with a collections.OrderedDict():
from collections import OrderedDict
from collections import defaultdict
from json import dumps
data = {
"predictions": [
{"id": 18009, "cuisine": "italian", "probability": 0.17846838753494407},
{"id": 28583, "cuisine": "italian", "probability": 0.1918703125538735},
]
}
key_order = ["id", "cuisine", "probability"]
result = defaultdict(list)
for dic in data["predictions"]:
ordered = OrderedDict((key, dic.get(key)) for key in key_order)
result["predictions"].append(ordered)
print(dumps(result))
# {"predictions": [{"id": 18009, "cuisine": "italian", "probability": 0.17846838753494407}, {"id": 28583, "cuisine": "italian", "probability": 0.1918703125538735}]}
json.dumps() here serializes the dictionary into a JSON formatted string.
Note: If you are using Python3.6+, you can use normal dictionaries instead of OrderedDict(), since insertion order of keys is remembered. However, in 3.6 it is an implementation feature, whereas in 3.7 it is a language feature. You read more about this at Are dictionaries ordered in Python 3.6+?.
You can probably rely on this feature for applications that have a minimum requirement of Python3.6. Furthermore, its probably safer to user OrderedDict() anyways, so your code can be backwards compatible for all python versions.
You can look into Ordered Dictionary. It remembers the insertion order, and you would be able to organise your json via this collection this way.
Example:
datadict = {'a':'a', 'c':'c', 'b':'b'}
import collections
print (collections.OrderedDict(datadict))
#OrderedDict([('a':'a'), ('c':'c'), ('b':'b')])
x = collections.OrderedDict(datadict)
print (dict(x))
#{'a':'a', 'c':'c', 'b':'b'}
print (x == datadict)
#True
print (collections.OrderedDict(sorted(datadict.items(), key=lambda t:t[0])))
#OrderedDict([('a':'a'), ('b':'b'), ('c':'c')])
z = collections.OrderedDict(sorted(datadict.items(), key=lambda t:t[0]))
print (dict(z))
#{'a':'a', 'b':'b', 'c':'c'}
print (z == datadict)
#True
Using this, you can convert your json_return dict object into an OrderedDict with your desired insertion order and then return that OrderedDict object or convert it back into a dict and return that dict

List Indices in json in Python

I've got a json file that I've pulled from a web service and am trying to parse it. I see that this question has been asked a whole bunch, and I've read whatever I could find, but the json data in each example appears to be very simplistic in nature. Likewise, the json example data in the python docs is very simple and does not reflect what I'm trying to work with. Here is what the json looks like:
{"RecordResponse": {
"Id": blah
"Status": {
"state": "complete",
"datetime": "2016-01-01 01:00"
},
"Results": {
"resultNumber": "500",
"Summary": [
{
"Type": "blah",
"Size": "10000000000",
"OtherStuff": {
"valueOne": "first",
"valueTwo": "second"
},
"fieldIWant": "value i want is here"
The code block in question is:
jsonFile = r'C:\Temp\results.json'
with open(jsonFile, 'w') as dataFile:
json_obj = json.load(dataFile)
for i in json_obj["Summary"]:
print(i["fieldIWant"])
Not only am I not getting into the field I want, but I'm also getting a key error on trying to suss out "Summary".
I don't know how the indices work within the array; once I even get into the "Summary" field, do I have to issue an index manually to return the value from the field I need?
The example you posted is not valid JSON (no commas after object fields), so it's hard to dig in much. If it's straight from the web service, something's messed up. If you did fix it with proper commas, the "Summary" key is within the "Results" object, so you'd need to change your loop to
with open(jsonFile, 'w') as dataFile:
json_obj = json.load(dataFile)
for i in json_obj["Results"]["Summary"]:
print(i["fieldIWant"])
If you don't know the structure at all, you could look through the resulting object recursively:
def findfieldsiwant(obj, keyname="Summary", fieldname="fieldIWant"):
try:
for key,val in obj.items():
if key == keyname:
return [ d[fieldname] for d in val ]
else:
sub = findfieldsiwant(val)
if sub:
return sub
except AttributeError: #obj is not a dict
pass
#keyname not found
return None

python - exporting dictionary(array) to json

I have an array of dictionaries like so:
myDict[0] = {'date':'today', 'status': 'ok'}
myDict[1] = {'date':'yesterday', 'status': 'bad'}
and I'm trying to export this array to a json file where each dictionary is its own entry. The problem is when I try to run:
dump(myDict, open("test.json", "w"))
It outputs a json file with a number prefix before each entry
{"0": {"date": "today", "status": "ok"}, "1": {"date": "yesterday", "status": "bad"} }
which apparently isn't legal json since my json parser (protovis) is giving me error messages
Any ideas?
Thanks
Use a list instead of a dictionary; you probably used:
myDict = {}
myDict[0] = {...}
You should use:
myList = []
myList.append({...}
P.S.: It seems valid json to me anyways, but it is an object and not a list; maybe this is the reason why your parser is complaining
You should use a JSON serializer...
Also, an array of dictionaries would better serialize to something like this:
[
{
"date": "today",
"status": "ok"
},
{
"date": "yesterday",
"status": "bad"
}
]
That is, you should just use a JavaScript array.

Categories