Nested Json to pandas DataFrame with specific format - python

I need to format the contents of a Json file in a certain format in a pandas DataFrame so that I can run pandassql to transform the data and run it through a scoring model.
file = C:\scoring_model\json.js (contents of 'file' are below)
{
"response":{
"version":"1.1",
"token":"dsfgf",
"body":{
"customer":{
"customer_id":"1234567",
"verified":"true"
},
"contact":{
"email":"mr#abc.com",
"mobile_number":"0123456789"
},
"personal":{
"gender": "m",
"title":"Dr.",
"last_name":"Muster",
"first_name":"Max",
"family_status":"single",
"dob":"1985-12-23",
}
}
}
I need the dataframe to look like this (obviously all values on same row, tried to format it best as possible for this question):
version | token | customer_id | verified | email | mobile_number | gender |
1.1 | dsfgf | 1234567 | true | mr#abc.com | 0123456789 | m |
title | last_name | first_name |family_status | dob
Dr. | Muster | Max | single | 23.12.1985
I have looked at all the other questions on this topic, have tried various ways to load Json file into pandas
with open(r'C:\scoring_model\json.js', 'r') as f:
c = pd.read_json(f.read())
with open(r'C:\scoring_model\json.js', 'r') as f:
c = f.readlines()
tried pd.Panel() in this solution Python Pandas: How to split a sorted dictionary in a column of a dataframe with dataframe results from [yo = f.readlines()]. I thought about trying to split contents of each cell based on ("") and find a way to put the split contents into different columns but no luck so far.

If you load in the entire json as a dict (or list) e.g. using json.load, you can use json_normalize:
In [11]: d = {"response": {"body": {"contact": {"email": "mr#abc.com", "mobile_number": "0123456789"}, "personal": {"last_name": "Muster", "gender": "m", "first_name": "Max", "dob": "1985-12-23", "family_status": "single", "title": "Dr."}, "customer": {"verified": "true", "customer_id": "1234567"}}, "token": "dsfgf", "version": "1.1"}}
In [12]: df = pd.json_normalize(d)
In [13]: df.columns = df.columns.map(lambda x: x.split(".")[-1])
In [14]: df
Out[14]:
email mobile_number customer_id verified dob family_status first_name gender last_name title token version
0 mr#abc.com 0123456789 1234567 true 1985-12-23 single Max m Muster Dr. dsfgf 1.1

It's much easier if you deserialize the JSON using the built-in json module first (instead of pd.read_json()) and then flatten it using pd.json_normalize().
# deserialize
with open(r'C:\scoring_model\json.js', 'r') as f:
data = json.load(f)
# flatten
df = pd.json_normalize(d)
If a dictionary is passed to json_normalize(), it's flattened into a single row, but if a list is passed to it, it's flattened into multiple rows. So if the nested structure contains only key-value pairs, pd.json_normalize() with no parameters suffices to flatten it.
However, if the data contains a list (JSON array in the nesting in the file), then passing record_path= argument to let pandas find the path to the records. For example, if the data is like the following (notice how the value under "body" is a list, i.e. a list of records):
data = {
"response":[
{
"version":"1.1",
"customer": {"id": "1234567", "verified":"true"},
"body":[
{"email":"mr#abc.com", "mobile_number":"0123456789"},
{"email":"ms#abc.com", "mobile_number":"9876543210"}
]
},
{
"version":"1.2",
"customer": {"id": "0987654", "verified":"true"},
"body":[
{"email":"master#abc.com", "mobile_number":"9999999999"}
]
}
]
}
then you can pass record_path= to let the program know that the records are under "body" and pass meta= to set the path to the metadata. Note how in "body", "version" and "customer" are in the same level in the data but "id" is nested one level more so you need to pass a list to get the value under "id".
df = pd.json_normalize(data['response'], record_path=['body'], meta=['version', ['customer', 'id']])

Related

Nested Json Using pyspark

We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

How can I parse a json to a pandas dataframe without losing values? [duplicate]

This question already has answers here:
JSON to pandas DataFrame
(14 answers)
Closed 1 year ago.
I'm trying to parse a json into a dataframe. And I one focus on the first key on the json (validations). The structure of the json is pretty standard, as the example below:
{
"validations": [
{
"id": "1111111-2222-3333-4444-555555555555",
"created_at": "2020-02-19T14:35:58-03:00",
"finished_at": "2020-02-19T14:36:01-03:00",
"processing_status": "concluded",
"receivable_id": "VAL-AAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE",
"external_reference": "FFFFFFFF-GGGG-HHHH-IIII-JJJJJJJJJJJJ",
"batch_id": "e2fb8d34-8c53-4910-b7a4-602ab6845855",
"portfolio": {
"id": "57a3e56a-347b-449c-8f1a-253baba90e7a",
"nome": "COMPANY_NAME"
}
}],
"pages": {
"per_page": 10,
"page": 1
}
}
I'm using the following code:
import json as json
import pandas as pd
import os
print(os.getcwd()) ## point out the directory you're working on this cell
filename = r"file_path\file_name.json"
f = open(filename)
data1 = json.loads(f.read())
df = pd.json_normalize(data1)
data1.keys()
## => returns: dict_keys(['validacoes', 'paginacao'])
res = dict((k, data1[k]) for k in ['validacoes']
if k in data1)
res.keys()
## => returns dict_keys(['validacoes'])
df = pd.DataFrame(res, columns=['id', 'data_criacao', 'data_finalizacao', 'status_do_processamento', 'recebivel_id','referencia_externa', 'lote_id', 'veiculo'])
df.head()
## returns=> a dataframe with no values on the columns, as if they were empty from the json
| id | created_at | finished_at | processing_status | receivable_id | external_reference | batch_id | external_reference | portfolio |
So, I already checked the original file on a text editor and, yes, the json is properly mapped with values.
And the format is standardized throughout the file.
Any thoughts as to why the data from the json is being lost on the process?
You need to flatten your JSON.
This post should help you out:
Python flatten multilevel/nested JSON
Also, pandas has a simple json_normalize method you could use.

Create json files from dataframe

I have a dataframe with 10 rows like this:
id
name
team
1
Employee1
Team1
2
Employee2
Team2
...
How can I generate 10 json files from the dataframe with python?
Here is the format of each json file:
{
"Company": "Company",
"id": "1",
"name": "Employee1",
"team": "Team1"
}
The field "Company": "Company" is the same in all json files.
Name of each json file is the name of each employee (i.e Employee1.json)
I do not really like iterrows but as you need a file per row, I cannot imagine how to vectorize the operation:
for _, row in df.iterrows():
row['Company'] = 'Company'
row.to_json(row['name'] + '.json')
You could use apply in the following way:
df.apply(lambda x: x.to_json(), axis=1)
And inside the to_json pass the employee name, it’s available to you in x
Another approach is to iterate over the rows like:
for i in df.index:
df.loc[i].to_json("Employee{}.json".format(i))

Python dataframe to nested json file

I have a python dataframe as below.
python dataframe:-
Emp_No Name Project Task
1 ABC P1 T1
1 ABC P2 T2
2 DEF P3 T3
3 IJH Null Null
I need to convert it to json file and save it to disk as below
Json File
{
"Records"[
{
"Emp_No":"1",
"Project_Details":[
{
"Project":"P1",
"Task":"T1"
},
{
"Project":"P2",
"Task":"T2"
}
],
"Name":"ÄBC"
},
{
"Emp_No":"2",
"Project_Details":[
{
"Project":"P2",
"Task":"T3"
}
],
"Name":"DEF"
},
{
"Emp_No":"3",
"Project_Details":[
],
"Name":"IJH"
}
]
}
I feel like this post is not a doubt per se, but a cheecky atempt to avoid formatting the data, hahaha. But, since i'm trying to get used to the dataframe structure and the different ways of handling it, here you go!
import pandas as pd
asutosh_data = {'Emp_No':["1","1","2","3"], 'Name':["ABC","ABC","DEF","IJH"], 'Project':["P1","P2","P3","Null"], 'Task':["T1","T2","T3","Null"]}
df = pd.DataFrame(data=asutosh_data)
records = []
dif_emp_no = df['Emp_No'].unique()
for emp_no in dif_emp_no :
emp_data = df.loc[df['Emp_No'] == emp_no]
emp_project_details = []
for index,data in emp_data.iterrows():
if data["Project"]!="Null":
emp_project_details.append({"Project":data["Project"],"Task":data["Task"]})
records.append({"Emp_No":emp_data.iloc[0]["Emp_No"], "Project_Details":emp_project_details , "Name":emp_data.iloc[0]["Name"]})
final_data = {"Records":records}
print(final_data)
If you have any question about the code above, feel free to ask. I'll also leave below the documentation i've used to solve your problem (you may wanna check that):
unique : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
loc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
iloc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Categories