How to create very large Json object in Python - python

I have a json file like this:
{
"students": [
{
"name": "Jack",
"age": "12",
"class": "8",
"start_date": "2021-01-01",
"score": {
"Eng": "90",
"Math": "90",
"History": "91",
"Art": "80"
},
"friend": {},
"talents": [
"dance"
]
}
],
"school": [
{
"city": "LA",
"state": "CA",
"country": "US",
}
]
}
Now I want to modify this file and add new students' info into it
like this: (Add more similar json objects into students with minimum change)
{
"students": [
{
"name": "Jack",
"age": "12",
"class": "8",
"start_date": "2021-01-01",
"score": {
"Eng": "90",
"Math": "90",
"History": "91",
"Art": "80"
},
"friend": {},
"talents": [
"dance"
]
},
{
"name": "David",
"age": "12",
"class": "8",
"start_date": "2021-02-01",
"score": {
"Eng": "92",
"Math": "90",
"History": "95",
"Art": "70"
},
"friend": {},
"talents": [
"skate"
]
},
... ...
],
"school": [
{
"city": "LA",
"state": "CA",
"country": "US",
}
]
}
Is there a way to do this in large scale? Since I need to append >1000 new students.
And since this is a large json and only do minimum change, so I don't want specify each key and value for >1000 times.

Related

Update nested JSON with name of json file name

I'm wondering if you could help me with filling jsons with their original filenames.
Here is a sample of json:
jsv is a list of jsons (the first main key is number of document (document_0, document_1 ...)
jsv =
[
{
{
"document_0":{
"id":111,
"laboratory":"xxx",
"document_type":"xxx",
"language":"pl",
"creation_date":"09-12-2022",
"source_filename":"None",
"version":"0.1",
"exams_ocr_avg_confidence":0.0,
"patient_data":{
"first_name":"YYYY",
"surname":"YYYY",
"pesel":"12345678901",
"birth_date":"1111-22-22",
"sex":"F",
"age":"None"
},
"exams":[
{
"name":"xx",
"sampling_date":"2020-11-30",
"comment":"None",
"confidence":97,
"result":"222",
"unit":"ml",
"norm":"None",
"material":"None",
"icd9":"uuuuu"
},
{
"document_1":{
"id":111,
"laboratory":"xxx",
"document_type":"xxx",
"language":"pl",
"creation_date":"09-12-2022",
"source_filename":"None",
"version":"0.1",
"exams_ocr_avg_confidence":0.0,
"patient_data":{
"first_name":"YYYY",
"surname":"YYYY",
"pesel":"12345678901",
"birth_date":"1111-22-22",
"sex":"F",
"age":"None"
},
"exams":[
{
"name":"xx",
"sampling_date":"2020-11-30",
"comment":"None",
"confidence":97,
"result":"222",
"unit":"ml",
"norm":"None",
"material":"None",
"icd9":"uuuuu"
}
}
]
And inside of this json there is a key: source_filename which I want to update with real name of json file name
my folder with files as an example:
'11111.pdf.json',
'11112.pdf.json',
'11113.pdf.json',
'11114.pdf.json',
'11115.pdf.json'
What I want to achieve:
jsv =
[
{
{
"document_0":{
"id":111,
"laboratory":"xxx",
"document_type":"xxx",
"language":"pl",
"creation_date":"09-12-2022",
"source_filename":"11111.pdf.json",
"version":"0.1",
"exams_ocr_avg_confidence":0.0,
"patient_data":{
"first_name":"YYYY",
"surname":"YYYY",
"pesel":"12345678901",
"birth_date":"1111-22-22",
"sex":"F",
"age":"None"
},
"exams":[
{
"name":"xx",
"sampling_date":"2222-22-22",
"comment":"None",
"confidence":22,
"result":"222",
"unit":"ml",
"norm":"None",
"material":"None",
"icd9":"uuuuu"
},
{
"document_1":{
"id":111,
"laboratory":"xxx",
"document_type":"xxx",
"language":"pl",
"creation_date":"22-22-2222",
"source_filename":"11111.pdf.json",
"version":"0.1",
"exams_ocr_avg_confidence":0.0,
"patient_data":{
"first_name":"YYYY",
"surname":"YYYY",
"pesel":"12345678901",
"birth_date":"1111-22-22",
"sex":"F",
"age":"None"
},
"exams":[
{
"name":"xx",
"sampling_date":"2222-11-22",
"comment":"None",
"confidence":22,
"result":"222",
"unit":"ml",
"norm":"None",
"material":"None",
"icd9":"uuuuu"
}
}
]
document_0 and document_1 are with the same filename
what I've managed to get:
dir_name = 'path_name'
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(dir_name) if isfile(join(dir_name, f))]
only_files which is a list of filenames of my jsons.
Now I was thinking to maybe update somehow my jsv with it in a loop?
But I'm also looking for a method which will be very efficient due to large amount of data I have to process
EDIT:
I've managed to do it with a for loop, but maybe there is more effective way:
for i in range(len(jsv)):
if (type(jsv[i]) == dict):
jsv[i]["document_0"].update({"source_filename": onlyfiles[i]})
else:
print(onlyfiles[i])
If your jsv is:
jsv = [
{
"document_0": {
"id": 111,
"laboratory": "xxx",
"document_type": "xxx",
"language": "pl",
"creation_date": "09-12-2022",
"source_filename": "None",
"version": "0.1",
"exams_ocr_avg_confidence": 0.0,
"patient_data": {
"first_name": "YYYY",
"surname": "YYYY",
"pesel": "12345678901",
"birth_date": "1111-22-22",
"sex": "F",
"age": "None",
},
"exams": [
{
"name": "xx",
"sampling_date": "2020-11-30",
"comment": "None",
"confidence": 97,
"result": "222",
"unit": "ml",
"norm": "None",
"material": "None",
"icd9": "uuuuu",
},
],
}
},
{
"document_1": {
"id": 111,
"laboratory": "xxx",
"document_type": "xxx",
"language": "pl",
"creation_date": "09-12-2022",
"source_filename": "None",
"version": "0.1",
"exams_ocr_avg_confidence": 0.0,
"patient_data": {
"first_name": "YYYY",
"surname": "YYYY",
"pesel": "12345678901",
"birth_date": "1111-22-22",
"sex": "F",
"age": "None",
},
"exams": [
{
"name": "xx",
"sampling_date": "2020-11-30",
"comment": "None",
"confidence": 97,
"result": "222",
"unit": "ml",
"norm": "None",
"material": "None",
"icd9": "uuuuu",
},
],
},
},
]
In Python, you can do something like this:
arq = ['11111.pdf.json', '11112.pdf.json']
if len(arq) == len(jsv):
for i, json in enumerate(jsv):
for key in enumerate(json.keys()):
json[key[1]]['source_filename'] = arq[i]
Need to check if the length of files list is the same of the jsv list!
result this jsv:
[
{
"document_0": {
"id": 111,
"laboratory": "xxx",
"document_type": "xxx",
"language": "pl",
"creation_date": "09-12-2022",
"source_filename": "11111.pdf.json",
"version": "0.1",
"exams_ocr_avg_confidence": 0.0,
"patient_data": {
"first_name": "YYYY",
"surname": "YYYY",
"pesel": "12345678901",
"birth_date": "1111-22-22",
"sex": "F",
"age": "None",
},
"exams": [
{
"name": "xx",
"sampling_date": "2020-11-30",
"comment": "None",
"confidence": 97,
"result": "222",
"unit": "ml",
"norm": "None",
"material": "None",
"icd9": "uuuuu",
}
],
}
},
{
"document_1": {
"id": 222,
"laboratory": "xxx",
"document_type": "xxx",
"language": "pl",
"creation_date": "09-12-2022",
"source_filename": "11112.pdf.json",
"version": "0.1",
"exams_ocr_avg_confidence": 0.0,
"patient_data": {
"first_name": "YYYY",
"surname": "YYYY",
"pesel": "12345678901",
"birth_date": "1111-22-22",
"sex": "F",
"age": "None",
},
"exams": [
{
"name": "xx",
"sampling_date": "2020-11-30",
"comment": "None",
"confidence": 97,
"result": "222",
"unit": "ml",
"norm": "None",
"material": "None",
"icd9": "uuuuu",
}
],
}
},
]

Explode json without pandas

I have a JSON object:
{
"data": {
"geography": [
{
"id": "1",
"state": "USA",
"properties": [
{
"code": "CMD-01",
"value": "34"
},
{
"code": "CMD-02",
"value": "24"
}
]
},
{
"id": "2",
"state": "Canada",
"properties": [
{
"code": "CMD-04",
"value": "50"
},
{
"code": "CMD-05",
"value": "60"
}
]
}
]
}
}
I want to get the result as a new JSON, but without using pandas (and all those explode, flatten and normalize functions...). Is there any option to get this structure without using pandas or having an Out of memory issue?
The output should be:
{ "id": "1",
"state": "USA",
"code": "CMD-01",
"value": "34"
},
{ "id": "1",
"state": "USA",
"code": "CMD-02",
"value": "24",
},
{ "id": "2",
"state": "Canada",
"code": "CMD-04",
"value": "50"
},
{ "id": "2",
"state": "Canada",
"code": "CMD-05",
"value": "60"
},
You can simply loop over the list associated with "geography" and build new dictionaries that you will add to a newly created list:
dict_in = {
"data": {
"geography": [
{
"id": "1",
"state": "USA",
"properties": [
{
"code": "CMD-01",
"value": "34"
},
{
"code": "CMD-02",
"value": "24"
}
]
},
{
"id": "2",
"state": "Canada",
"properties": [
{
"code": "CMD-04",
"value": "50"
},
{
"code": "CMD-05",
"value": "60"
}
]
}
]
}
}
import json
rec_out = []
for obj in dict_in["data"]["geography"]:
for prop in obj["properties"]:
dict_out = {
"id": obj["id"],
"state": obj["state"]
}
dict_out.update(prop)
rec_out.append(dict_out)
print(json.dumps(rec_out, indent=4))
Output:
[
{
"id": "1",
"state": "USA",
"code": "CMD-01",
"value": "34"
},
{
"id": "1",
"state": "USA",
"code": "CMD-02",
"value": "24"
},
{
"id": "2",
"state": "Canada",
"code": "CMD-04",
"value": "50"
},
{
"id": "2",
"state": "Canada",
"code": "CMD-05",
"value": "60"
}
]

What is the best way for me to iterate over this dataset to return all matching values from another key value pair if I match a separate key?

I want to be able to search through this list (see bottom of post) of dicts (I think that is what this particular arrangement is called) to search for an ['address'] that matches '0xd2'. If that match is found, I want to return/print all the corresponding ['id']s.
So in this case I would like to return:
632, 315, 432, 100
I'm able to extract individual values like this:
none = None
print(my_dict['result'][2]["id"])
432
I'm struggling with how to get a loop to do this properly.
{
"total": 4,
"page": 0,
"page_size": 100,
"result": [
{
"address": "0xd2",
"id": "632",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0xd2",
"id": "315",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0xd2",
"id": "432",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0x44",
"id": "100",
"amount": "1",
"name": "Suicide Squad",
"group": "DC",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
}
],
"status": "SYNCED"
}
Welcome to StackOverflow.
You can try list comprehension:
[res["id"] for res in my_dict["result"] if res["address"] == "0xd2"]
If you'd like to use a for loop:
l = []
for res in my_dict["result"]:
if res["address"] == "0xd2":
l.append(res["id"])
You can use a list comprehension.
import json
json_string = """{
"total": 4,
"page": 0,
"page_size": 100,
"result": [
{
"address": "0xd2",
"id": "632",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0xd2",
"id": "315",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0xd2",
"id": "432",
"amount": "1",
"name": "Avengers",
"group": "Marvel",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
},
{
"address": "0x44",
"id": "100",
"amount": "1",
"name": "Suicide Squad",
"group": "DC",
"uri": "https://google.com/",
"metadata": null,
"synced_at": "2022-05-26T22:52:34.113Z",
"last_sync": "2022-05-26T22:52:34.113Z"
}
],
"status": "SYNCED"
}"""
json_dict = json.loads(json_string)
result = [elem['id'] for elem in json_dict['result'] if elem['address'] == '0xd2']
print(result)
Output:
['632', '315', '432']
This would store the associated ids in the list:
ids=[]
for r in dataset.get('result'):
if r.get('address')=='0xd2':
ids.append(r.get('id'))

Getting all the Keys from JSON Object?

Goal: To create a script that will take in nested JSON object as input and output a CSV file with all keys as rows in the CSV?
Example:
{
"Document": {
"DocumentType": 945,
"Version": "V007",
"ClientCode": "WI",
"Shipment": [
{
"ShipmentHeader": {
"ShipmentID": 123456789,
"OrderChannel": "Shopify",
"CustomerNumber": 234234,
"VendorID": "2343SDF",
"ShipViaCode": "FEDX2D",
"AsnDate": "2018-01-27",
"AsnTime": "09:30:47-08:00",
"ShipmentDate": "2018-01-23",
"ShipmentTime": "09:30:47-08:00",
"MBOL": 12345678901234568,
"BOL": 12345678901234566,
"ShippingNumber": "1ZTESTTEST",
"LoadID": 321456987,
"ShipmentWeight": 10,
"ShipmentCost": 2.3,
"CartonsTotal": 2,
"CartonPackagingCode": "CTN25",
"OrdersTotal": 2
},
"References": [
{
"Reference": {
"ReferenceQualifier": "TST",
"ReferenceText": "Testing text"
}
}
],
"Addresses": {
"Address": [
{
"AddressLocationQualifier": "ST",
"LocationNumber": 23234234,
"Name": "John Smith",
"Address1": "123 Main St",
"Address2": "Suite 12",
"City": "Hometown",
"State": "WA",
"Zip": 92345,
"Country": "USA"
},
{
"AddressLocationQualifier": "BT",
"LocationNumber": 2342342,
"Name": "Jane Smith",
"Address1": "345 Second Ave",
"Address2": "Building 32",
"City": "Sometown",
"State": "CA",
"Zip": "23665-0987",
"Country": "USA"
}
]
},
"Orders": {
"Order": [
{
"OrderHeader": {
"PurchaseOrderNumber": 23456342,
"RetailerPurchaseOrderNumber": 234234234,
"RetailerOrderNumber": 23423423,
"CustomerOrderNumber": 234234234,
"Department": 3333,
"Division": 23423,
"OrderWeight": 10.23,
"CartonsTotal": 2,
"QTYOrdered": 12,
"QTYShipped": 23
},
"Cartons": {
"Carton": [
{
"SSCC18": 12345678901234567000,
"TrackingNumber": "1ZTESTTESTTEST",
"CartonContentsQty": 10,
"CartonWeight": 10.23,
"LineItems": {
"LineItem": [
{
"LineNumber": 1,
"ItemNumber": 1234567890,
"UPC": 9876543212,
"QTYOrdered": 34,
"QTYShipped": 32,
"QTYUOM": "EA",
"Description": "Shoes",
"Style": "Tall",
"Size": 9.5,
"Color": "Bllack",
"RetailerItemNumber": 2342333,
"OuterPack": 10
},
{
"LineNumber": 2,
"ItemNumber": 987654321,
"UPC": 7654324567,
"QTYOrdered": 12,
"QTYShipped": 23,
"QTYUOM": "EA",
"Description": "Sunglasses",
"Style": "Short",
"Size": 10,
"Color": "White",
"RetailerItemNumber": 565465456,
"OuterPack": 12
}
]
}
}
]
}
}
]
}
}
]
}
}
In the above JSON Object, I want all the keys (nested included) in a List (Duplicates can be removed by using a set Data Structure). If Nested Key Occurs like in actual JSON they can be keys multiple times in the CSV !
I personally feel that recursion is a perfect application for this type of problem if the amount of nests you will encounter is unpredictable. Here I have written an example in Python of how you can utilise recursion to extract all keys. Cheers.
import json
row = ""
def extract_keys(data):
global row
if isinstance(data, dict):
for key, value in data.items():
row += key + "\n"
extract_keys(value)
elif isinstance(data, list):
for element in data:
extract_keys(element)
# MAIN
with open("input.json", "r") as rfile:
dicts = json.load(rfile)
extract_keys(dicts)
with open("output.csv", "w") as wfile:
wfile.write(row)

How to use S3 Select for Nested Parquet Objects

I have dumped data into a parquet file.
When I use
SELECT * FROM s3object s LIMIT 1
it gives me the following result.
{
"name": "John",
"age": "45",
"country": "USA",
"experience": [{
"company": {
"name": "ABC",
"years": "10",
"position": "Manager"
}
},
{
"company": {
"name": "BBC",
"years": "2",
"position": "Assistant"
}
}
]
}
I want to filter the result where company.name = "ABC"
so, the output should be looks like following.
{
"name": "John",
"age": "45",
"country": "USA",
"experience": [{
"company": {
"name": "ABC",
"years": "10",
"position": "Manager"
}
}
]
}
or this
{
"name": "John",
"age": "45",
"country": "USA",
"experience.company.name": "ABC",
"experience.company.years": "10",
"experience.company.position": "Manager"
}
Any support is highly appreciated.
Thanks.

Categories