Reading and parsing JSON with ijson [duplicate] - python

I have the following data in my JSON file:
{
"first": {
"name": "James",
"age": 30
},
"second": {
"name": "Max",
"age": 30
},
"third": {
"name": "Norah",
"age": 30
},
"fourth": {
"name": "Sam",
"age": 30
}
}
I want to print the top-level key and object as follows:
import json
import ijson
fname = "data.json"
with open(fname) as f:
raw_data = f.read()
data = json.loads(raw_data)
for k in data.keys():
print k, data[k]
OUTPUT:
second {u'age': 30, u'name': u'Max'}
fourth {u'age': 30, u'name': u'Sam'}
third {u'age': 30, u'name': u'Norah'}
first {u'age': 30, u'name': u'James'}
So, far so good. However if I want to this same thing for a huge file, I would have to read it all in-memory. This very slow and requires lots of memory.
I want use an incremental JSON parser ( ijson in this case ) to achieve what I described earlier:
The above code was taken from: No access to top level elements with ijson?
with open(fname) as f:
json_obj = ijson.items(f,'').next() # '' loads everything as only one object.
for (key, value) in json_obj.items():
print key + " -> " + str(value)
This is not suitable either, because it also reads the whole file in memory. This not truly incremental.
How can I do incremental parsing of top-level keys and corresponding objects, of a JSON file in Python?

Since essentially json files are text files, consider stripping the top level as string. Basically, use a read file iterable approach where you concatenate a string with each line and then break out of the loop once the string contains the double braces }} signaling the end of the top level. Of course the double brace condition must strip out spaces and line breaks.
toplevelstring = ''
with open('data.json') as f:
for line in f:
if not '}}' in toplevelstring.replace('\n', '').replace('\s+',''):
toplevelstring = toplevelstring + line
else:
break
data = json.loads(toplevelstring)
Now if your larger json is wrapped in square brackets or other braces, still run above routine but add the below line to slice out first character, [, and last two characters for comma and line break after top level's final brace:
[{
"first": {
"name": "James",
"age": 30
},
"second": {
"name": "Max",
"age": 30
},
"third": {
"name": "Norah",
"age": 30
},
"fourth": {
"name": "Sam",
"age": 30
}
},
{
"data1": {
"id": "AAA",
"type": 55
},
"data2": {
"id": "BBB",
"type": 1601
},
"data3": {
"id": "CCC",
"type": 817
}
}]
...
toplevelstring = toplevelstring[1:-2]
data = json.loads(toplevelstring)

Since version 2.6 ijson comes with a kvitems function that achieves exactly this.

Answer from github issue [file name changed]
import ijson
from ijson.common import ObjectBuilder
def objects(file):
key = '-'
for prefix, event, value in ijson.parse(file):
if prefix == '' and event == 'map_key': # found new object at the root
key = value # mark the key value
builder = ObjectBuilder()
elif prefix.startswith(key): # while at this key, build the object
builder.event(event, value)
if event == 'end_map': # found the end of an object at the current key, yield
yield key, builder.value
for key, value in objects(open('data.json', 'rb')):
print(key, value)

Related

am getting identical sha256 for each json file in python

I am in a huge hashing crisis. Using the chip-0007's default format I generatedfew JSON files. Using these files I have been trying to generate sha256 hash value. And I expect a unique hash value for each file.
However, python code isn't doing so. I thought there might be some issue with JSON file but, it is not. Something is to do with sha256 code.
All the json files ->
JSON File 1
{ "format": "CHIP-0007", "name": "adewale-the-amebo", "description": "Adewale always wants to be in everyone's business.", "attributes": [ { "trait_type": "Gender", "value": "male" } ], "collection": { "name": "adewale-the-amebo Collection", "id": "1" } }
JSON File 2
{ "format": "CHIP-0007", "name": "alli-the-queeny", "description": "Alli is an LGBT Stan.", "attributes": [ { "trait_type": "Gender", "value": "male" } ], "collection": { "name": "alli-the-queeny Collection", "id": "2" } }
JSON File 3
{ "format": "CHIP-0007", "name": "aminat-the-snnobish", "description": "Aminat never really wants to talk to anyone.", "attributes": [ { "trait_type": "Gender", "value": "female" } ], "collection": { "name": "aminat-the-snnobish Collection", "id": "3" } }
Sample CSV File:
Series Number,Filename,Description,Gender
1,adewale-the-amebo,Adewale always wants to be in everyone's business.,male
2,alli-the-queeny,Alli is an LGBT Stan.,male
3,aminat-the-snnobish,Aminat never really wants to talk to anyone.,female
Python CODE
TODO 2 : Generate a JSON file per entry in team's sheet in CHIP-0007's default format
new_jsonFile = f"{row[1]}.json"
json_data = {}
json_data["format"] = "CHIP-0007"
json_data["name"] = row[1]
json_data["description"] = row[2]
attribute_data = {}
attribute_data["trait_type"] = "Gender" # gender
attribute_data["value"] = row[3] # "value/male/female"
json_data["attributes"] = [attribute_data]
collection_data = {}
collection_data["name"] = f"{row[1]} Collection"
collection_data["id"] = row[0] # "ID of the NFT collection"
json_data["collection"] = collection_data
filepath = f"Json_Files/{new_jsonFile}"
with open(filepath, 'w') as f:
json.dump(json_data, f, indent=2)
C += 1
sha256_hash = sha256_gen(filepath)
temp.append(sha256_hash)
NEW.append(temp)
# TODO 3 : Calculate sha256 of the each entry
def sha256_gen(fn):
return hashlib.sha256(open(fn, 'rb').read()).hexdigest()
How can I generate a unique sha256 hash for each JSON?
I tried reading in byte blocks. That is also not working out. After many trials, I am going nowhere. Sharing the unexpected outputs of each JSON file:
[ All hashes are identical ]
Unexpected SHA256 output:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Expected:
Unique Hash value. Different from each other
Because of output buffering, you're calling sha256_gen(filepath) before anything is written to the file, so you're getting the hash of an empty file. You should do that outside the with, so that the JSON file is closed and the buffer is flushed.
with open(filepath, 'w') as f:
json.dump(json_data, f, indent=2)
C += 1
sha256_hash = sha256_gen(filepath)
temp.append(sha256_hash)
NEW.append(temp)

string indices must be integers when using python with json

I want to print the ip addresses from jobs.json but I am getting the error 'string indices must be integers'
Here is my python code:
import json
f = open('jobs.json')
data = json.load(f)
f.close()
for item in data["Jobs"]:
print(item["ip"])
And here is the Jobs.json file:
{
"Jobs": {
"Carpenter": {
"ip": "123.1432.515",
"address": ""
},
"Electrician": {
"ip": "643.452.234",
"address": "mini-iad.com"
},
"Plumber": {
"ip": "151.101.193",
"Address": "15501 Birch St"
},
"Mechanic": {
"ip": "218.193.942",
"Address": "Yellow Brick Road"
}
}
data["Company"] is a dictionary, so you're iterating over the keys (which are strings). Use data["Company"].values():
import json
with open("company.json", "r") as f_in:
data = json.load(f_in)
for item in data["Company"].values():
print(item["ip"])
Prints:
142.250.115.139
151.101.193
data["Company"] returns a dictionary. When iterating over that, you will get string keys for item, since that's what you get by default when iterating over a dictionary. Then you try to do item["ip"], where item is "Google" for example, which causes your error.
You want to iterate the values of the dictionary instead:
for item in data["Company"].values():
print(item["ip"])

How to merge non-fixed key json multilines into one json abstractly

If I have a heavy json file that have 30m entries like that
{"id":3,"price":"231","type":"Y","location":"NY"}
{"id":4,"price":"321","type":"N","city":"BR"}
{"id":5,"price":"354","type":"Y","city":"XE","location":"CP"}
--snip--
{"id":30373779,"price":"121","type":"N","city":"SR","location":"IU"}
{"id":30373780,"price":"432","type":"Y","location":"TB"}
{"id":30373780,"price":"562","type":"N","city":"CQ"}
how I can only abstract the location and the city and parse it into one json like that in python:
{
"orders":{
3:{
"location":"NY"
},
4:{
"city":"BR"
},
5:{
"city":"XE",
"location":"CP"
},
30373779:{
"city":"SR",
"location":"IU"
},
30373780:{
"location":"TB"
},
30373780:{
"city":"CQ"
}
}
}
P.S: beatufy the syntax is not necessary.
Assuming your input file is actually in jsonlines format, then you can read each line, extract the city and location keys from the dict and then append those to a new dict:
import json
from collections import defaultdict
orders = { 'orders' : defaultdict(dict) }
with open('orders.txt', 'r') as f:
for line in f:
o = json.loads(line)
id = o['id']
if 'location' in o:
orders['orders'][id]['location'] = o['location']
if 'city' in o:
orders['orders'][id]['city'] = o['city']
print(orders)
Output for your sample data (note it has two 30373780 id values, so the values get merged into one dict):
{
"orders": {
"3": {
"location": "NY"
},
"4": {
"city": "BR"
},
"5": {
"location": "CP",
"city": "XE"
},
"30373779": {
"location": "IU",
"city": "SR"
},
"30373780": {
"location": "TB",
"city": "CQ"
}
}
}
As you've said that your file is pretty big and you probably don't want to keep all entries in memory here is the way to consume source file line by line and write output immediately:
import json
with open(r"in.jsonp") as i_f, open(r"out.json", "w") as o_f:
o_f.write('{"orders":{')
for i in i_f:
i_obj = json.loads(i)
o_f.write(f'{i_obj["id"]}:')
o_obj = {}
if location := i_obj.get("location"):
o_obj["location"] = location
if city := i_obj.get("city"):
o_obj["city"] = city
json.dump(o_obj, o_f)
o_f.write(",")
o_f.write('}}')
It will generate semi-valid JSON object in same format you've provided in your question.

How to convert nested JSON data to CSV using python?

I have a file consisting of an array containing over 5000 objects. However, I am having trouble converting one particular part of my JSON file into the appropriate columns in CSV format.
Below is an example version of my data file:
{
"Result": {
"Example 1": {
"Type1": [
{
"Owner": "Name1 Example",
"Description": "Description1 Example",
"Email": "example1_email#email.com",
"Phone": "(123) 456-7890"
}
]
},
"Example 2": {
"Type1": [
{
"Owner": "Name2 Example",
"Description": "Description2 Example",
"Email": "example2_email#email.com",
"Phone": "(111) 222-3333"
}
]
}
}
}
Here is my current code:
import csv
import json
json_file='example.json'
with open(json_file, 'r') as json_data:
x = json.load(json_data)
f = csv.writer(open("example.csv", "w"))
f.writerow(["Address","Type","Owner","Description","Email","Phone"])
for key in x["Result"]:
type = "Type1"
f.writerow([key,
type,
x["Result"][key]["Type1"]["Owner"],
x["Result"][key]["Type1"]["Description"],
x["Result"][key]["Type1"]["Email"],
x["Result"][key]["Type1"]["Phone"]])
My problem is that I'm encountering this issue:
Traceback (most recent call last):
File "./convert.py", line 18, in <module>
x["Result"][key]["Type1"]["Owner"],
TypeError: list indices must be integers or slices, not str
When I try to substitute the last array such as "Owner" to an integer value, I receive this error: IndexError: list index out of range.
When I strictly change the f.writerow function to
f.writerow([key,
type,
x["Result"][key]["Type1"]])
I receive the results in a column, but it merges everything into one column, which makes sense. Picture of the output: https://imgur.com/a/JpDkaAT
I would like the results to be separated based on the label into individual columns instead of being merged into one. Could anyone assist?
Thank you!
Type1 in your data structure is a list, not a dict. So you need to iterate over it instead of referencing by key.
for key in x["Result"]:
# key is now "Example 1" etc.
type1 = x["Result"][key]["Type1"]
# type1 is a list, not a dict
for i in type1:
f.writerow([key,
"Type1",
type1["Owner"],
type1["Description"],
type1["Email"],
type1["Phone"]])
The inner for loop ensure that you're protected from the assumption that "Type1" only ever has one item in the list.
It's definately not the best example, but I'm to sleepy to optimize it.
import csv
def json_to_csv(obj, res):
for k, v in obj.items():
if isinstance(v, dict):
res.append(k)
json_to_csv(v, res)
elif isinstance(v, list):
res.append(k)
for el in v:
json_to_csv(el, res)
else:
res.append(v)
obj = {
"Result": {
"Example 1": {
"Type1": [
{
"Owner": "Name1 Example",
"Description": "Description1 Example",
"Email": "example1_email#email.com",
"Phone": "(123) 456-7890"
}
]
},
"Example 2": {
"Type1": [
{
"Owner": "Name2 Example",
"Description": "Description2 Example",
"Email": "example2_email#email.com",
"Phone": "(111) 222-3333"
}
]
}
}
}
with open("out.csv", "w+") as f:
writer = csv.writer(f)
writer.writerow(["Address","Type","Owner","Description","Email","Phone"])
for k, v in obj["Result"].items():
row = [k]
json_to_csv(v, row)
writer.writerow(row)
Figured it out!
I changed the f.writerow function to the following:
for key in x["Result"]:
type = "Type1"
f.writerow([key,
type,
x["Result"][key]["Type1"][0]["Owner"],
x["Result"][key]["Type1"][0]["Email"]])
...
This allowed me reference the keys within the object. Hopefully this helps someone down the line!

Sorting a JSON Object

I'm new to JSON and Python, and I'm attempting to load a json file that I created from disk to manipulate and output to an xml file. I've gotten most of it figured out, except, I want to 'sort' the JSON file after I load it by a certain value.
Example of json file:
{
"0" : {
"name": "John Doe",
"finished": "4",
"logo": "http://someurl.com/icon.png"
},
"1" : {
"name": "Jane Doe",
"finished": "10",
"logo": "http://anotherurl.com/icon.png"
},
"2" : {
"name": "Jacob Smith",
"finished": "3",
"logo": "http://example.com/icon.png"
}
}
What I want to do is sort 'tree' by the 'finished' key.
JSONFILE = "file.json"
with open(CHANS) as json_file:
tree = json.load(json_file)
Depends on how do you "consume" the tree dictionary. are you using tree.keys(), tree.values() or tree.items()?
tree.keys()
ordered_keys = sorted(tree.keys(), key=lambda k: int(tree[k]['finished']))
tree.values()
ordered_keys = sorted(tree.values(), key=lambda v: int(v['finished']))
tree.items()
ordered_keys = sorted(tree.items(), key=lambda t: int(t[1]['finished']))
You only keep in mind that JSON is what's inside the actual file, the result of json.load() is just a Python value/object, so just work with them.
If you are walking over the sorted dictionary once, the above snippets will work just fine. However if you need to access it multiple times, then I would follow ~Jean-François suggestion and use OrderedDict, with something like:
from collections import OrderedDict
tree = OrderedDict(sorted(tree.items(), key=lambda t: int(t[1]['finished'])))
That way the sorting operation (arguably the most expensive) is done just once.

Categories