I am trying to convert an Excel to nested JSON using Python where the repeated values go in as an array of elements.
Ex: structure of CSV
Manufacturer,oilType,viscosity
shell,superOil,1ova
shell,superOil,2ova
shell,normalOil,1ova
bp, power, 10bba
Should be displayed in JSON (expected output) as
elements: [
{
"Manufacturer": "shell",
"details": [
{
"OilType": "superOil",
"Viscosity": [
"1ova",
"2ova"
]
},
{
"OilType": "normalOil",
"Viscosity": [
"1ova"
]
}
]
},
{
"Manufacturer": "bp",
"details": [
{
"OilType": "power",
"Viscosity": [
"10bba"
]
}
]
}
]
I have currently converted the CSV into JSON using openpyxl and the values are displayed for each of the headers in format like (Current output)
[{Manufacturer: "shell", oilType: "superOil", Viscosity:"1ova"},{...},{...},...]
Please help in getting the expected output.
Hi and welcome to StackOverflow.
Your question has actually nothing to do with openpyxl because you don't need to save into an Excel file.
You can do thought:
Load the csv (or Excel) into a pandas DataFrame
Group by Manufacturer and oil type
Dump into the format you want
Transform to JSON (either string or file)
In practice, that gives something like that:
import json
import pandas as pd
df = pd.read_csv("oil.csv") # or read_excel if this is an Excel
oils = df.groupby(["Manufacturer", "oilType"]).aggregate(pd.Series.to_list)
elements = [
{
"Manufacturer": manufacturer,
"Details": [
{"OilType": o, "Viscosity": v}
for o, v in data.droplevel(0).viscosity.items()
],
}
for manufacturer, data in oils.groupby(level="Manufacturer")
]
with open("oil.json", "w") as f:
json.dump({"elements": elements}, f)
For information, oils would look like this:
viscosity
Manufacturer oilType
bp power [10bba]
shell normalOil [1ova]
superOil [1ova, 2ova]
Related
I want to convert csv file to json file.
I have large data in csv file.
CSV Column Structure
This is my column structure in csv file . I has 200+ records.
id.oid libId personalinfo.Name personalinfo.Roll_NO personalinfo.addr personalinfo.marks.maths personalinfo.marks.physic clginfo.clgName clginfo.clgAddr clginfo.haveCert clginfo.certNo clginfo.certificates.cert_name_1 clginfo.certificates.cert_no_1 clginfo.certificates.cert_exp_1 clginfo.certificates.cert_name_2 clginfo.certificates.cert_no_2 clginfo.certificates.cert_exp_2 clginfo.isDept clginfo.NoofDept clginfo.DeptDetails.DeptName_1 clginfo.DeptDetails.location_1 clginfo.DeptDetails.establish_date_1 _v updatedAt.date
Expected Json
[{
"id":
{
"$oid": "00001"
},
"libId":11111,
"personalinfo":
{
"Name":"xyz",
"Roll_NO":101,
"addr":"aa bb cc ddd",
"marks":
[
"maths":80,
"physic":90
.....
]
},
"clginfo"
{
"clgName":"pqr",
"clgAddr":"qwerty",
"haveCert":true, //this is boolean true or false
"certNo":1, //this could be 1-10
"certificates":
[
{
"cert_name_1":"xxx",
"cert_no_1":12345,
"cert_exp.1":"20/2/20202"
},
{
"cert_name_2":"xxx",
"cert_no_2":12345,
"cert_exp_2":"20/2/20202"
},
......//could be up to 10
],
"isDept":true, //this is boolean true or false
"NoofDept":1 , //this could be 1-10
"DeptDetails":
[
{
"DeptName_1":"yyy",
"location_1":"zzz",
"establish_date_1":"1/1/1919"
},
......//up to 10 records
]
},
"__v": 1,
"updatedAt":
{
"$date": "2022-02-02T13:35:59.843Z"
}
}]
I have tried using pandas but I'm getting output as
My output
[{
"id.$oid": "00001",
"libId":11111,
"personalinfo.Name":"xyz",
"personalinfo.Roll_NO":101,
"personalinfo.addr":"aa bb cc ddd",
"personalinfo.marks.maths":80,
"personalinfo.marks.physic":90,
"clginfo.clgName":"pqr",
"clginfo.clgAddr":"qwerty",
"clginfo.haveCert":true,
"clginfo.certNo":1,
"clginfo.certificates.cert_name_1":"xxx",
"clginfo.certificates.cert_no_1":12345,
"clginfo.certificates.cert_exp.1":"20/2/20202"
"clginfo.certificates.cert_name_2":"xxx",
"clginfo.certificates.cert_no_2":12345,
"clginfo.certificates.cert_exp_2":"20/2/20202"
"clginfo.isDept":true,
"clginfo.NoofDept":1 ,
"clginfo.DeptDetails.DeptName_1":"yyy",
"clginfo.DeptDetails.location_1":"zzz",
"eclginfo.DeptDetails.stablish_date_1":"1/1/1919",
"__v": 1,
"updatedAt.$date": "2022-02-02T13:35:59.843Z",
}]
I am new to python I only know the basic Please help me getting this output.
200+ records is really tiny, so even naive solution is good.
It can't be totally generic because I don't see how it can be seen from the headers that certificates is a list, unless we rely on all names under certificates having _N at the end.
Proposed solution using only basic python:
read header row - split all column names by period. Iterate over resulting list and create nested dicts with appropriate keys and dummy values (if you want to handle lists: create array if current key ends with _N and use N as an index)
for all rows:
clone dictionary with dummy values
for each column use split keys from above to put the value into the corresponding dict. same solution from above for lists.
append the dictionary to list of rows
I used to have a simple row and column format in table and was reading it by pandassql .
but if you have structure like below and want to get age>10 from this , how do I get it using pandassql?
[ {
"response":{
"version":"1.1",
"token":"dsfgf",
"body":{
"customer":{
"customer_id":"1234567",
"verified":"true"
},
"contact":{
"email":"mr#abc.com",
"mobile_number":"0123456789"
},
"personal":{
"gender": "m",
"title":"Dr.",
"last_name":"Muster",
"first_name":"Max",
"family_status":"single",
"dob":"1985-12-23",
}
}
} ]
This is answered here:
Pandas read nested json
You can use json_normalize
There is also a full description of the dame issue here:
https://medium.com/swlh/converting-nested-json-structures-to-pandas-dataframes-e8106c59976e
I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.
I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?
it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.
There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287
You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287
You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287