Convert a MultiIndex pandas DataFrame to a nested JSON

Convert a MultiIndex pandas DataFrame to a nested JSON - python

I have the following Dataframe with MultiIndex rows in pandas.
time available_slots status
month day
1 1 10:00:00 1 AVAILABLE
1 12:00:00 1 AVAILABLE
1 14:00:00 1 AVAILABLE
1 16:00:00 1 AVAILABLE
1 18:00:00 1 AVAILABLE
2 10:00:00 1 AVAILABLE
... ... ... ...
2 28 12:00:00 1 AVAILABLE
28 14:00:00 1 AVAILABLE
28 16:00:00 1 AVAILABLE
28 18:00:00 1 AVAILABLE
28 20:00:00 1 AVAILABLE
And I need to transform it to a hierarchical nested JSON as this:
[
{
"month": 1,
"days": [
{
"day": 1,
"slots": [
{
"time": "10:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "12:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
...
]
},
{
"day": 2,
"slots": [
...
]
}
]
},
{
"month": 2,
"days":[
{
"day": 1,
"slots": [
...
]
}
]
},
...
]
Unfortunately, it is not as easy as doing df.to_json(orient="index").
Does anyone know if there is a method in pandas to perform this kind of transformations? or in what way I could iterate over the DataFrame to build the final object?

Here's one way. Basically repeated groupby + apply(to_dict) + reset_index until we get the desired shape:
out = (df.groupby(level=[0,1])
.apply(lambda x: x.to_dict('records'))
.reset_index()
.rename(columns={0:'slots'})
.groupby('month')
.apply(lambda x: x[['day','slots']].to_dict('records'))
.reset_index()
.rename(columns={0:'days'})
.to_json(orient='records', indent=True)
)
Output:
[
{
"month":1,
"days":[
{
"day":1,
"slots":[
{
"time":"10:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"12:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"14:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"16:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"18:00:00",
"available_slots":1,
"status":"AVAILABLE"
}
]
},
{
"day":2,
"slots":[
{
"time":"10:00:00",
"available_slots":1,
"status":"AVAILABLE"
}
]
}
]
},
{
"month":2,
"days":[
{
"day":28,
"slots":[
{
"time":"12:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"14:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"16:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"18:00:00",
"available_slots":1,
"status":"AVAILABLE"
},
{
"time":"20:00:00",
"available_slots":1,
"status":"AVAILABLE"
}
]
}
]
}
]

You can use a double loop for each level of your index:
data = []
for month, df1 in df.groupby(level=0):
data.append({'month': month, 'days': []})
for day, df2 in df1.groupby(level=1):
data[-1]['days'].append({'day': day, 'slots': df2.to_dict('records')})
Output:
import json
print(json.dumps(data, indent=2))
[
{
"month": 1,
"days": [
{
"day": 1,
"slots": [
{
"time": "10:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "12:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "14:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "16:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "18:00:00",
"available_slots": 1,
"status": "AVAILABLE"
}
]
},
{
"day": 2,
"slots": [
{
"time": "10:00:00",
"available_slots": 1,
"status": "AVAILABLE"
}
]
}
]
},
{
"month": 2,
"days": [
{
"day": 28,
"slots": [
{
"time": "12:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "14:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "18:00:00",
"available_slots": 1,
"status": "AVAILABLE"
},
{
"time": "20:00:00",
"available_slots": 1,
"status": "AVAILABLE"
}
]
}
]
}
]

Related

Python Fattening Complex JSON

I have a LinkedIn dataset of follower statistics in the following JSON (Removed many key pair values for easy understanding). In this, each key has a different number of inner key pair values.
Can somebody help convert this to a CSV output using python?
{
"paging": { "start": 0, "count": 10, "links": [] },
"elements": [
{
"followerCountsByAssociationType": [
{
"followerCounts": {
"organicFollowerCount": 2775,
"paidFollowerCount": 0
}
},
{
"followerCounts": {
"organicFollowerCount": 13,
"paidFollowerCount": 0
},
"associationType": "EMPLOYEE"
}
],
"followerCountsByRegion": [
{
"region": "urn:li:region:7312",
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
}
},
{
"region": "urn:li:region:6981",
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
}
},
{
"region": "urn:li:region:620",
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
}
}
],
"followerCountsBySeniority": [
{
"followerCounts": {
"organicFollowerCount": 12,
"paidFollowerCount": 0
},
"seniority": "urn:li:seniority:8"
},
{
"followerCounts": {
"organicFollowerCount": 5,
"paidFollowerCount": 0
},
"seniority": "urn:li:seniority:9"
},
{
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
},
"seniority": "urn:li:seniority:1"
}
],
"followerCountsByIndustry": [
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"industry": "urn:li:industry:51"
},
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"industry": "urn:li:industry:74"
},
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"industry": "urn:li:industry:77"
},
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"industry": "urn:li:industry:78"
},
],
"followerCountsByFunction": [
{
"followerCounts": {
"organicFollowerCount": 3,
"paidFollowerCount": 0
},
"function": "urn:li:function:14"
},
{
"followerCounts": {
"organicFollowerCount": 3,
"paidFollowerCount": 0
},
"function": "urn:li:function:21"
},
{
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
},
"function": "urn:li:function:11"
},
{
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
},
"function": "urn:li:function:17"
},
{
"followerCounts": {
"organicFollowerCount": 2,
"paidFollowerCount": 0
},
"function": "urn:li:function:1"
},
],
"followerCountsByStaffCountRange": [
{
"followerCounts": {
"organicFollowerCount": 267,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_1001_TO_5000"
},
{
"followerCounts": {
"organicFollowerCount": 185,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_201_TO_500"
},
{
"followerCounts": {
"organicFollowerCount": 131,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_501_TO_1000"
},
{
"followerCounts": {
"organicFollowerCount": 81,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_5001_TO_10000"
},
{
"followerCounts": {
"organicFollowerCount": 74,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_2_TO_10"
},
{
"followerCounts": {
"organicFollowerCount": 10,
"paidFollowerCount": 0
},
"staffCountRange": "SIZE_1"
}
],
"followerCountsByCountry": [
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"country": "urn:li:country:es"
},
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"country": "urn:li:country:ph"
},
{
"followerCounts": {
"organicFollowerCount": 1,
"paidFollowerCount": 0
},
"country": "urn:li:country:ng"
}
],
"organizationalEntity": "urn:li:organization:28849398"
}
]
}
I tried using the json_normalize(data['Elements']) but that gives the following Output
I am sure that there must be some parameter in json_normalize() that can simplify the inner nesting.
The desired output is as follows-
FollwerCounByAssociationorganicfollowecount
FollwerCounByAssociationpaidfollowecount
AssociationType
Region
RegionOrganicFollowercount
RegionpaidFollowercount
2775
0
Employee
urn:li:region:7312
2
0
urn:li:region:6981
2
0
.......And so on
Now I have only made a small part of the output but largely for as many entries in the last the columns will go on (while for the others it will be null)
Would appreciate any help possible! Thanks!

How to eliminate duplicate items while adding them to their own structure

I have a list of dictionary items, with each dictionary containing a list of presentation items. The sample dictionaries below are a small prototype of my real data set.
I need to remove duplicate presentations based on day (one presentation per day) and store them in a new dictionary with the same structure within the existing list.
So starting with:
[
{
"time": "04:00-20:59",
"category": 1,
"presentations": [
{
"presentation": "ABC",
"day": 7,
},
{
"presentation": "DEF",
"day": 7,
},
{
"presentation": "GHI",
"day": 8,
},
{
"presentation": "JKL",
"day": 8,
},
{
"presentation": "MNO",
"day": 9,
},
{
"presentation": "PQR",
"day": 9,
},
{
"presentation": "STU",
"day": 9,
}
]
} #only one dictionary item in the list for simplicity
]
The end result should be three dictionaries containing lists of presentations where there is one presentation for a given day:
[
{
"time": "04:00-20:59",
"category": 1,
"presentations": [
{
"presentation": "ABC",
"day": 7
},
{
"presentation": "DEF",
"day": 8
},
{
"presentation": "GHI",
"day": 9
}
]
},
{
"time": "04:00-20:59",
"category": 1,
"presentations": [
{
"presentation": "JKL",
"day": 7
},
{
"presentation": "MNO",
"day": 8
},
{
"presentation": "PQR",
"day": 9
}
]
},
{
"time": "04:00-20:59",
"category": 1,
"presentations": [
{
"presentation": "STU",
"day": 9
}
]
}
]
I don't know how to go about removing these duplicates (based on day) while adding them to their own dictionary.

Python - Get Nested Data from Multiple Levels

Wasn't sure how to title this question but I am working with the Quickbooks Online API and when querying a report like BalanceSheet or GeneralLedger the API returns data rows in multiple nested levels which is quite frustrating to parse through.
Example of the BalanceSheet return included below. I am only interested in the data from "Row" objects but as you can see that can be returned in 1, 2, 3 or more different levels of data. I am thinking of going through each level to check for Rows and then get each Row but that seems overly complex as I would need multiple for loops for each level.
I'm wondering if there is a better way to get each "Row" in that data without regard to which level it is on? Any ideas would be appreciated!
Here's an example of a return from their sandbox data:
{
"Header": {
"Time": "2021-04-28T14:12:17-07:00",
"ReportName": "BalanceSheet",
"DateMacro": "this calendar year-to-date",
"ReportBasis": "Accrual",
"StartPeriod": "2021-01-01",
"EndPeriod": "2021-04-28",
"SummarizeColumnsBy": "Month",
"Currency": "USD",
"Option": [
{
"Name": "AccountingStandard",
"Value": "GAAP"
},
{
"Name": "NoReportData",
"Value": "false"
}
]
},
"Columns": {
"Column": [
{
"ColTitle": "",
"ColType": "Account",
"MetaData": [
{
"Name": "ColKey",
"Value": "account"
}
]
},
{
"ColTitle": "Jan 2021",
"ColType": "Money",
"MetaData": [
{
"Name": "StartDate",
"Value": "2021-01-01"
},
{
"Name": "EndDate",
"Value": "2021-01-31"
},
{
"Name": "ColKey",
"Value": "Jan 2021"
}
]
},
{
"ColTitle": "Feb 2021",
"ColType": "Money",
"MetaData": [
{
"Name": "StartDate",
"Value": "2021-02-01"
},
{
"Name": "EndDate",
"Value": "2021-02-28"
},
{
"Name": "ColKey",
"Value": "Feb 2021"
}
]
},
{
"ColTitle": "Mar 2021",
"ColType": "Money",
"MetaData": [
{
"Name": "StartDate",
"Value": "2021-03-01"
},
{
"Name": "EndDate",
"Value": "2021-03-31"
},
{
"Name": "ColKey",
"Value": "Mar 2021"
}
]
},
{
"ColTitle": "Apr 1-28, 2021",
"ColType": "Money",
"MetaData": [
{
"Name": "StartDate",
"Value": "2021-04-01"
},
{
"Name": "EndDate",
"Value": "2021-04-28"
},
{
"Name": "ColKey",
"Value": "Apr 1-28, 2021"
}
]
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "ASSETS"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Current Assets"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Bank Accounts"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Checking",
"id": "35"
},
{
"value": "1201.00"
},
{
"value": "1201.00"
},
{
"value": "1201.00"
},
{
"value": "1201.00"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Savings",
"id": "36"
},
{
"value": "800.00"
},
{
"value": "800.00"
},
{
"value": "800.00"
},
{
"value": "800.00"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Bank Accounts"
},
{
"value": "2001.00"
},
{
"value": "2001.00"
},
{
"value": "2001.00"
},
{
"value": "2001.00"
}
]
},
"type": "Section",
"group": "BankAccounts"
},
{
"Header": {
"ColData": [
{
"value": "Accounts Receivable"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Accounts Receivable (A/R)",
"id": "84"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Accounts Receivable"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
},
{
"value": "5281.52"
}
]
},
"type": "Section",
"group": "AR"
},
{
"Header": {
"ColData": [
{
"value": "Other Current Assets"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Inventory Asset",
"id": "81"
},
{
"value": "596.25"
},
{
"value": "596.25"
},
{
"value": "596.25"
},
{
"value": "596.25"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Undeposited Funds",
"id": "4"
},
{
"value": "2062.52"
},
{
"value": "2062.52"
},
{
"value": "2062.52"
},
{
"value": "2062.52"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Other Current Assets"
},
{
"value": "2658.77"
},
{
"value": "2658.77"
},
{
"value": "2658.77"
},
{
"value": "2658.77"
}
]
},
"type": "Section",
"group": "OtherCurrentAssets"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Current Assets"
},
{
"value": "9941.29"
},
{
"value": "9941.29"
},
{
"value": "9941.29"
},
{
"value": "9941.29"
}
]
},
"type": "Section",
"group": "CurrentAssets"
},
{
"Header": {
"ColData": [
{
"value": "Fixed Assets"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Truck",
"id": "37"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Original Cost",
"id": "38"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Truck"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
}
]
},
"type": "Section"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Fixed Assets"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
},
{
"value": "13495.00"
}
]
},
"type": "Section",
"group": "FixedAssets"
}
]
},
"Summary": {
"ColData": [
{
"value": "TOTAL ASSETS"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
}
]
},
"type": "Section",
"group": "TotalAssets"
},
{
"Header": {
"ColData": [
{
"value": "LIABILITIES AND EQUITY"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Liabilities"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Current Liabilities"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Accounts Payable"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Accounts Payable (A/P)",
"id": "33"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Accounts Payable"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
},
{
"value": "1602.67"
}
]
},
"type": "Section",
"group": "AP"
},
{
"Header": {
"ColData": [
{
"value": "Credit Cards"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Mastercard",
"id": "41"
},
{
"value": "157.72"
},
{
"value": "157.72"
},
{
"value": "157.72"
},
{
"value": "157.72"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Credit Cards"
},
{
"value": "157.72"
},
{
"value": "157.72"
},
{
"value": "157.72"
},
{
"value": "157.72"
}
]
},
"type": "Section",
"group": "CreditCards"
},
{
"Header": {
"ColData": [
{
"value": "Other Current Liabilities"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Arizona Dept. of Revenue Payable",
"id": "89"
},
{
"value": "0.00"
},
{
"value": "0.00"
},
{
"value": "0.00"
},
{
"value": "0.00"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Board of Equalization Payable",
"id": "90"
},
{
"value": "370.94"
},
{
"value": "370.94"
},
{
"value": "370.94"
},
{
"value": "370.94"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Loan Payable",
"id": "43"
},
{
"value": "4000.00"
},
{
"value": "4000.00"
},
{
"value": "4000.00"
},
{
"value": "4000.00"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Other Current Liabilities"
},
{
"value": "4370.94"
},
{
"value": "4370.94"
},
{
"value": "4370.94"
},
{
"value": "4370.94"
}
]
},
"type": "Section",
"group": "OtherCurrentLiabilities"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Current Liabilities"
},
{
"value": "6131.33"
},
{
"value": "6131.33"
},
{
"value": "6131.33"
},
{
"value": "6131.33"
}
]
},
"type": "Section",
"group": "CurrentLiabilities"
},
{
"Header": {
"ColData": [
{
"value": "Long-Term Liabilities"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Notes Payable",
"id": "44"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
}
],
"type": "Data"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Long-Term Liabilities"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
},
{
"value": "25000.00"
}
]
},
"type": "Section",
"group": "LongTermLiabilities"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Liabilities"
},
{
"value": "31131.33"
},
{
"value": "31131.33"
},
{
"value": "31131.33"
},
{
"value": "31131.33"
}
]
},
"type": "Section",
"group": "Liabilities"
},
{
"Header": {
"ColData": [
{
"value": "Equity"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"value": "Opening Balance Equity",
"id": "34"
},
{
"value": "-9337.50"
},
{
"value": "-9337.50"
},
{
"value": "-9337.50"
},
{
"value": "-9337.50"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Retained Earnings",
"id": "2"
},
{
"value": "1642.46"
},
{
"value": "1642.46"
},
{
"value": "1642.46"
},
{
"value": "1642.46"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Net Income"
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
},
{
"value": ""
}
],
"type": "Data",
"group": "NetIncome"
}
]
},
"Summary": {
"ColData": [
{
"value": "Total Equity"
},
{
"value": "-7695.04"
},
{
"value": "-7695.04"
},
{
"value": "-7695.04"
},
{
"value": "-7695.04"
}
]
},
"type": "Section",
"group": "Equity"
}
]
},
"Summary": {
"ColData": [
{
"value": "TOTAL LIABILITIES AND EQUITY"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
},
{
"value": "23436.29"
}
]
},
"type": "Section",
"group": "TotalLiabilitiesAndEquity"
}
]
}
}

Extracting data from JSON File to CSV

I have a big JSON file with a very complex structure
you can look on it here: https://drive.google.com/file/d/1tBVJ2xYSCpTTUGPJegvAz2ZXbeN0bteX/view?usp=sharing
it contains more than 7 millions lines, and I want to extract only the "text" field
I have written a python code, to extra all the values of the "text" key or field in the whole file, and it extracted only 12 values! while when I open the JSON file on the Visualstudio, I have more than 19000 values!!
you can see the code here:
import json
import csv
with open("/Users/zahraa-maher/rasa-init-demo/venv/Tickie/external_data/frames2.json") as file:
data = json.load(file)
fname = "outputText8.csv"
with open(fname, "w") as file:
csv_file = csv.writer(file,lineterminator='\n')
csv_file.writerow(["text"])
for item in data[i]["turns"]:
csv_file.writerow([item['text']])
please take a look on the JSON file as it is very large one and with a complex structure, so I an not paste it here to see because it would be not understandable
also this is a part of the son file:
[
{
"user_id": "U22HTHYNP",
"turns": [
{
"text": "I'd like to book a trip to Atlantis from Caprica on Saturday, August 13, 2016 for 8 adults. I have a tight budget of 1700.",
"labels": {
"acts": [
{
"args": [
{
"val": "book",
"key": "intent"
}
],
"name": "inform"
},
{
"args": [
{
"val": "Atlantis",
"key": "dst_city"
},
{
"val": "Caprica",
"key": "or_city"
},
{
"val": "Saturday, August 13, 2016",
"key": "str_date"
},
{
"val": "8",
"key": "n_adults"
},
{
"val": "1700",
"key": "budget"
}
],
"name": "inform"
}
],
"acts_without_refs": [
{
"args": [
{
"val": "book",
"key": "intent"
}
],
"name": "inform"
},
{
"args": [
{
"val": "Atlantis",
"key": "dst_city"
},
{
"val": "Caprica",
"key": "or_city"
},
{
"val": "Saturday, August 13, 2016",
"key": "str_date"
},
{
"val": "8",
"key": "n_adults"
},
{
"val": "1700",
"key": "budget"
}
],
"name": "inform"
}
],
"active_frame": 1,
"frames": [
{
"info": {
"intent": [
{
"val": "book",
"negated": false
}
],
"budget": [
{
"val": "1700.0",
"negated": false
}
],
"dst_city": [
{
"val": "Atlantis",
"negated": false
}
],
"or_city": [
{
"val": "Caprica",
"negated": false
}
],
"str_date": [
{
"val": "august 13",
"negated": false
}
],
"n_adults": [
{
"val": "8",
"negated": false
}
]
},
"frame_id": 1,
"requests": [],
"frame_parent_id": null,
"binary_questions": [],
"compare_requests": []
}
]
},
"author": "user",
"timestamp": 1471272019730.0
},
{
"db": {
"result": [
[
{
"trip": {
"returning": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 10,
"year": 2016,
"day": 24,
"min": 51,
"month": 8
},
"departure": {
"hour": 10,
"year": 2016,
"day": 24,
"min": 0,
"month": 8
}
},
"seat": "ECONOMY",
"leaving": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 0,
"year": 2016,
"day": 16,
"min": 51,
"month": 8
},
"departure": {
"hour": 0,
"year": 2016,
"day": 16,
"min": 0,
"month": 8
}
},
"or_city": "Porto Alegre",
"duration_days": 9
},
"price": 2118.81,
"hotel": {
"gst_rating": 7.15,
"vicinity": [],
"name": "Scarlet Palms Resort",
"country": "Brazil",
"amenities": [
"FREE_BREAKFAST",
"FREE_PARKING",
"FREE_WIFI"
],
"dst_city": "Goiania",
"category": "3.5 star hotel"
}
},
{
"trip": {
"returning": {
"duration": {
"hours": 2,
"min": 37
},
"arrival": {
"hour": 12,
"year": 2016,
"day": 10,
"min": 37,
"month": 8
},
"departure": {
"hour": 10,
"year": 2016,
"day": 10,
"min": 0,
"month": 8
}
},
"seat": "ECONOMY",
"leaving": {
"duration": {
"hours": 2,
"min": 37
},
"arrival": {
"hour": 0,
"year": 2016,
"day": 4,
"min": 37,
"month": 8
},
"departure": {
"hour": 22,
"year": 2016,
"day": 3,
"min": 0,
"month": 8
}
},
"or_city": "Porto Alegre",
"duration_days": 7
},
"price": 2369.83,
"hotel": {
"gst_rating": 0,
"vicinity": [],
"name": "Sunway Hostel",
"country": "Argentina",
"amenities": [
"FREE_BREAKFAST",
"FREE_WIFI"
],
"dst_city": "Rosario",
"category": "2.0 star hotel"
}
},
{
"trip": {
"returning": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 10,
"year": 2016,
"day": 24,
"min": 51,
"month": 8
},
"departure": {
"hour": 10,
"year": 2016,
"day": 24,
"min": 0,
"month": 8
}
},
"seat": "BUSINESS",
"leaving": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 0,
"year": 2016,
"day": 16,
"min": 51,
"month": 8
},
"departure": {
"hour": 0,
"year": 2016,
"day": 16,
"min": 0,
"month": 8
}
},
"or_city": "Porto Alegre",
"duration_days": 9
},
"price": 2375.72,
"hotel": {
"gst_rating": 7.15,
"vicinity": [],
"name": "Scarlet Palms Resort",
"country": "Brazil",
"amenities": [
"FREE_BREAKFAST",
"FREE_PARKING",
"FREE_WIFI"
],
"dst_city": "Goiania",
"category": "3.5 star hotel"
}
},
{
"trip": {
"returning": {
"duration": {
"hours": 1,
"min": 30
},
"arrival": {
"hour": 11,
"year": 2016,
"day": 1,
"min": 30,
"month": 9
},
"departure": {
"hour": 10,
"year": 2016,
"day": 1,
"min": 0,
"month": 9
}
},
"seat": "BUSINESS",
"leaving": {
"duration": {
"hours": 1,
"min": 30
},
"arrival": {
"hour": 18,
"year": 2016,
"day": 19,
"min": 30,
"month": 8
},
"departure": {
"hour": 17,
"year": 2016,
"day": 19,
"min": 0,
"month": 8
}
},
"or_city": "Porto Alegre",
"duration_days": 13
},
"price": 2492.95,
"hotel": {
"gst_rating": 0,
"vicinity": [],
"name": "Hotel Mundo",
"country": "Brazil",
"amenities": [
"FREE_BREAKFAST",
"FREE_WIFI",
"FREE_PARKING"
],
"dst_city": "Manaus",
"category": "2.5 star hotel"
}
},
{
"trip": {
"returning": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 10,
"year": 2016,
"day": 31,
"min": 51,
"month": 8
},
"departure": {
"hour": 10,
"year": 2016,
"day": 31,
"min": 0,
"month": 8
}
},
"seat": "ECONOMY",
"leaving": {
"duration": {
"hours": 0,
"min": 51
},
"arrival": {
"hour": 19,
"year": 2016,
"day": 27,
"min": 51,
"month": 8
},
"departure": {
"hour": 19,
"year": 2016,
"day": 27,
"min": 0,
"month": 8
}
},
"or_city": "Porto Alegre",
"duration_days": 4
},
"price": 2538.0,
"hotel": {
"gst_rating": 8.22,
"vicinity": [],
"name": "The Glee",
"country": "Brazil",
"amenities": [
"FREE_BREAKFAST",
"FREE_WIFI"
],
"dst_city": "Recife",
"category": "4.0 star hotel"
}
}
],
[],
[],
[],
[],
[],
[]
],
"search": [
{
"ORIGIN_CITY": "Porto Alegre",
"PRICE_MIN": "2000",
"NUM_ADULTS": "2",
"timestamp": 1471271949.995,
"PRICE_MAX": "3000",
"ARE_DATES_FLEXIBLE": "true",
"NUM_CHILDREN": "5",
"START_TIME": "1470110400000",
"MAX_DURATION": 2592000000.0,
"DESTINATION_CITY": "Brazil",
"RESULT_LIMIT": "10",
"END_TIME": "1472616000000"
},
{
"ORIGIN_CITY": "Atlantis",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272148.124,
"PRICE_MAX": "1700",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "NaN",
"END_TIME": "NaN"
},
{
"ORIGIN_CITY": "Caprica",
"PRICE_MAX": "1700",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272189.07,
"DESTINATION_CITY": "Atlantis",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "1470715200000",
"END_TIME": "1472011200000"
},
{
"ORIGIN_CITY": "Caprica",
"PRICE_MAX": "1700",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272205.436,
"DESTINATION_CITY": "Atlantis",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "1470715200000",
"END_TIME": "1472011200000"
},
{
"ORIGIN_CITY": "Caprica",
"PRICE_MIN": "1700",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272278.72,
"DESTINATION_CITY": "Atlantis",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "1470715200000",
"END_TIME": "1472011200000"
},
{
"ORIGIN_CITY": "Caprica",
"PRICE_MIN": "1700",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272454.542,
"DESTINATION_CITY": "Atlantis",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "1471060800000",
"END_TIME": "1472011200000"
},
{
"ORIGIN_CITY": "Caprica",
"PRICE_MIN": "1700",
"NUM_ADULTS": "8",
"RESULT_LIMIT": "10",
"timestamp": 1471272466.008,
"DESTINATION_CITY": "Atlantis",
"NUM_CHILDREN": "",
"ARE_DATES_FLEXIBLE": "true",
"START_TIME": "1471060800000",
"END_TIME": "1472011200000"
}
]
},
How it could be modified to extract all the "text" values from the JSON file to a CSV file?

This is a potential solution using pandas:
import pandas as pd
#importing data
dj = pd.read_json("frames2.json")
dtext = dj[["user_id","turns"]]
#Saving text records in a list
list_ = []
for record in dtext["turns"].values:
for r in record:
list_.append(r["text"])
#Exporting the csv
out = pd.Series(list_,name="text")
out.to_csv("text.csv")
It gives the following output.

Try:
import json
import csv
with open("/Users/zahraa-maher/rasa-init-demo/venv/Tickie/external_data/frames2.json") as file:
data = json.load(file)
fname = "outputText8.csv"
with open(fname, "w") as file:
csv_file = csv.writer(file,lineterminator='\n')
csv_file.writerow(["text"])
for keys,values in data.items():
now it up to you which of the fields you want to save, if you user a debugger you can see the values and Keys

Dictionary data is not seperated into columns in Pandas DataFrame

I have created a variable that stores my json data. It looks like this:
datasett = '''
{
"data": {
"trafficRegistrationPoints": [
{
"id": "99100B1687283",
"name": "Menstad sykkeltellepunkt",
"location": {
"coordinates": {
"latLon": {
"lat": 59.173876,
"lon": 9.641772
}
}
}
},
{
"id": "11101B1800681",
"name": "Garpa - sykkel",
"location": {
"coordinates": {
"latLon": {
"lat": 63.795114,
"lon": 11.494511
}
}
}
},
{
"id": "30961B1175469",
"name": "STENMALEN-SYKKEL",
"location": {
"coordinates": {
"latLon": {
"lat": 59.27665,
"lon": 10.411814
}
}
}
},
{
"id": "53749B1700621",
"name": "TUNEVANNET SYKKEL",
"location": {
"coordinates": {
"latLon": {
"lat": 59.292846,
"lon": 11.084058
}
}
}
},
{
"id": "80565B1689290",
"name": "Nenset sykkeltellepunkt",
"location": {
"coordinates": {
"latLon": {
"lat": 59.168377,
"lon": 9.634257
}
}
}
},
{
"id": "24783B2045151",
"name": "Orstad sykkel- begge retn.",
"location": {
"coordinates": {
"latLon": {
"lat": 58.798377,
"lon": 5.72743
}
}
}
},
{
"id": "46418B2616452",
"name": "Elgeseter bru sykkel øst",
"location": {
"coordinates": {
"latLon": {
"lat": 63.425015,
"lon": 10.393928
}
}
}
},
{
"id": "35978B1700571",
"name": "Tune kirke nord",
"location": {
"coordinates": {
"latLon": {
"lat": 59.292626,
"lon": 11.084066
}
}
}
},
{
"id": "21745B1996708",
"name": "Munkedamsveien Sykkel",
"location": {
"coordinates": {
"latLon": {
"lat": 59.911198,
"lon": 10.725568
}
}
}
},
{
"id": "33443B2542097",
"name": "KANALBRUA-SYKKEL",
"location": {
"coordinates": {
"latLon": {
"lat": 59.261823,
"lon": 10.416349
}
}
}
},
{
"id": "77570B384357",
"name": "HAVRENESVEGEN (SYKKEL)",
"location": {
"coordinates": {
"latLon": {
"lat": 61.598202,
"lon": 5.016999
}
}
}
},
{
"id": "95959B971385",
"name": "JELØGATA SYKKEL",
"location": {
"coordinates": {
"latLon": {
"lat": 59.43385,
"lon": 10.65388
}
}
}
},
{
"id": "61523B971803",
"name": "ST.HANSFJELLET SYKKEL",
"location": {
"coordinates": {
"latLon": {
"lat": 59.218978,
"lon": 10.93455
}
}
}
},
}
}
}
]
}
}
'''
Next, I have used json.loads() to turn it into a dictionary in Python, using the following code:
dict = json.loads(datasett)
Because the result I get is a nested dictionary,I we want to move further into the nest.
movedDict = dict['data']
I then want to this into a Pandas DataFrame
df = pd.DataFrame.from_dict(movedDict)
However, when I print this. The data is not seperated into unique columns. What do I do wrong?

You can use json_normalize here, I also removed some extra } from your JSON:
data = json.loads(datasett)
df = pd.json_normalize(data, record_path=['data', 'trafficRegistrationPoints'])
print(df)
id name location.coordinates.latLon.lat location.coordinates.latLon.lon
0 99100B1687283 Menstad sykkeltellepunkt 59.173876 9.641772
1 11101B1800681 Garpa - sykkel 63.795114 11.494511
2 30961B1175469 STENMALEN-SYKKEL 59.276650 10.411814
3 53749B1700621 TUNEVANNET SYKKEL 59.292846 11.084058
4 80565B1689290 Nenset sykkeltellepunkt 59.168377 9.634257
5 24783B2045151 Orstad sykkel- begge retn. 58.798377 5.727430
6 46418B2616452 Elgeseter bru sykkel øst 63.425015 10.393928
7 35978B1700571 Tune kirke nord 59.292626 11.084066
8 21745B1996708 Munkedamsveien Sykkel 59.911198 10.725568
9 33443B2542097 KANALBRUA-SYKKEL 59.261823 10.416349
10 77570B384357 HAVRENESVEGEN (SYKKEL) 61.598202 5.016999
11 95959B971385 JELØGATA SYKKEL 59.433850 10.653880
12 61523B971803 ST.HANSFJELLET SYKKEL 59.218978 10.934550

when use from_dict the dict should look like this:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
in your case:
data = {'trafficRegistrationPoints':[.....]}
save the 'trafficRegistrationPoints' as a list and then create the dataFrame

The values for the data key in your dict are not individual dicts but rather a list of dicts under trafficRegistrationPoints key, so you need to move further into that key:
df = pd.DataFrame.from_dict(movedDict['trafficRegistrationPoints'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a MultiIndex pandas DataFrame to a nested JSON - python

Related

Python Fattening Complex JSON

How to eliminate duplicate items while adding them to their own structure

Python - Get Nested Data from Multiple Levels

Extracting data from JSON File to CSV

Dictionary data is not seperated into columns in Pandas DataFrame

Categories

Resources