Convert JSON into dataframe [duplicate]

Convert JSON into dataframe [duplicate] - python

This question already has an answer here:
Python converting URL JSON response to pandas dataframe
(1 answer)
Closed 12 months ago.
I am working with Python and I have the following JSON which I need to convert to a Dataframe:
JSON:
{"Results":
{"forecast": [2.1632421537363355, 16.35421956127545],
"prediction_interval": ["[-114.9747272420262, 119.30121154949884]",
"[-127.10990770140964, 159.8183468239605]"],
"index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0},
{"SaleDate": 1644364800000, "OfferingGroupId": 1}]
}
}
Expected Dataframe output:
Forecast SaleDate OfferingGroupId
2.1632421537363355 2022-02-08 0
16.35421956127545 2022-02-09 1
I have tried a few things but not getting anywhere close, my last attempt was:
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
json_obj = json.loads(string)
df = pd.DataFrame(json_obj)
print(df)
df = pd.concat([df['Results']], axis=0)
df = pd.concat([df['forecast'], df['index'].apply(pd.Series)], axis=1)
which resulted in an error:
AttributeError: 'list' object has no attribute 'apply'

One possible approach is to create a DataFrame from the value under "Results" (this will create a column named "index") and build another DataFrame with the "index" column and join it back to the original DataFrame:
df = pd.DataFrame(data['Results'])
df = df.join(pd.DataFrame(df['index'].tolist())).drop(columns=['prediction_interval', 'index'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
Output:
forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1

Not very pretty but I guess you can just throw out all the nesting that makes it complicated by forcing it into an aligned tuple list and then use that:
import json
import pandas as pd
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_tuples = zip(results_dict["forecast"],
[d["SaleDate"] for d in results_dict["index"]],
[d["OfferingGroupId"] for d in results_dict["index"]])
df = pd.DataFrame(results_tuples, columns=["Forecast", "SaleDate", "OfferingGroupId"])
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
> Forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1
Or the same idea but forcing it into an aligned dict format:
string = '{"Results": {"forecast": [2.1632421537363355, 16.35421956127545], "prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"], "index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]}}'
results_dict = json.loads(string)["Results"]
results_dict = {"Forecast": results_dict["forecast"],
"SaleDate": [d["SaleDate"] for d in results_dict["index"]],
"OfferingGroupId": [d["OfferingGroupId"] for d in results_dict["index"]]}
df = pd.DataFrame.from_dict(results_dict)
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
print(df)
> Forecast SaleDate OfferingGroupId
0 2.163242 2022-02-08 0
1 16.354220 2022-02-09 1
Generally from my experience letting pandas read a non-intended input format and then using the pandas methods to fix it causes much more of a headache than creating a dict or tuple list format as a middle step and just read that. But that might just be personal preference.

Just load the index as a column, then use tolist() to export it as two columns and create a new DataFrame. Combine the new dataframe with the original via pd.concat().
In this example, I also included columns for prediction_interval because I figured you might want that, too.
d = {"Results":
{"forecast": [2.1632421537363355, 16.35421956127545],
"prediction_interval": ["[-114.9747272420262, 119.30121154949884]", "[-127.10990770140964, 159.8183468239605]"],
"index": [{"SaleDate": 1644278400000, "OfferingGroupId": 0}, {"SaleDate": 1644364800000, "OfferingGroupId": 1}]
}
}
res = pd.DataFrame(d['Results'])
sd = pd.DataFrame(res['index'].tolist())
sd['SaleDate'] = pd.to_datetime(sd['SaleDate'], unit='ms')
pi = pd.DataFrame(res['prediction_interval'].map(json.loads).tolist(), columns=['pi_start', 'pi_end'])
df = pd.concat((res, pi, sd), axis=1).drop(columns=['index', 'prediction_interval'])

You must use the pandas library:
import json
import pandas as pd
with open('data.json') as f:
data = json.load(f)
print(data)
df = pd.read_json('data.json')
df

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe

This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101

Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

Flatten a column value using dataframe

Im trying to flatten 2 columns from a table loaded into a dataframe as below:
u_group
t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
I want to separate them and get them as:
u_group.link
u_group.value
t_group.link
t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
I used the below code, but wasnt successful.
import ast
from pandas.io.json import json_normalize
df12 = spark.sql("""select u_group,t_group from tbl""")
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')
TypeError: 'Column' object is not callable
Kindly help or suggest if any other code would work better.

need simple example and code for answer
example:
data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}],
[{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])
output(df):
u t
0 {'link': 'A1', 'value': 'B1'} {'link': 'A2', 'value': 'B2'}
1 {'link': 'C1', 'value': 'D1'} {'link': 'C2', 'value': 'D2'}
use following code:
pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i + '_') for i in df.columns], axis=1)
output:
u_link u_value t_link t_value
0 A1 B1 A2 B2
1 C1 D1 C2 D2

Here are my 2 cents,
A simple way to achieve this using PYSPARK.
Create the dataframe as follows:
data = [
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
),
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
)
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])
Then use the from_json() to parse the dictionary and fetch the individual values as follows:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema_column = StructType([
StructField("link",StringType(),True),
StructField("value",StringType(),True),
])
df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
.withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
.withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
.withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
.withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
.withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
.drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')
Print the dataframe
df.show(truncate=False)
Please check the below image for your reference:

Pandas DataFrame - remove / replace dict values based on key

Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?

You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}

This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

Convert API Reponse to Pandas DataFrame

I making an API call with the following code:
req = urllib.request.Request(url, body, headers)
try:
response = urllib.request.urlopen(req)
string = response.read().decode('utf-8')
json_obj = json.loads(string)
Which returns the following:
{"forecast": [17.588294043898163, 17.412641963452206],
"index": [
{"SaleDate": 1629417600000, "Type": "Type 1"},
{"SaleDate": 1629504000000, "Type": "Type 2"}
]
}
How can I convert this api response to a Panda DataFrame to convert the dict in the following format in pandas dataframe
Forecast SaleDate Type
17.588294043898163 2021-08-16 Type 1
17.412641963452206 2021-08-17 Type 1

You can use the following. It uses pandas.Series to convert the dictionary to columns and pandas.to_datetime to map the correct date from the millisecond timestamp:
d = {"forecast": [17.588294043898163, 17.412641963452206],
"index": [
{"SaleDate": 1629417600000, "Type": "Type 1"},
{"SaleDate": 1629504000000, "Type": "Type 2"}
]
}
df = pd.DataFrame(d)
df = pd.concat([df['forecast'], df['index'].apply(pd.Series)], axis=1)
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
output:
forecast SaleDate Type
0 17.588294 2021-08-20 Type 1
1 17.412642 2021-08-21 Type 2

Here is a solution you can give it a try, using list comprehension to flatten the data.
import pandas as pd
flatten = [
{"forecast": j, **resp['index'][i]} for i, j in enumerate(resp['forecast'])
]
pd.DataFrame(flatten)
forecast SaleDate Type
0 17.588294 1629417600000 Type 1
1 17.412642 1629504000000 Type 2

creating a json object from pandas dataframe

Groups sub-groups selections
0 sg1 csg1 sc1
1 sg1 csg1 sc2
2 sg1 csg2 sc3
3 sg1 csg2 sc4
4 sg2 csg3 sc5
5 sg2 csg3 sc6
6 sg2 csg4 sc7
7 sg2 csg4 sc8
I have the dataframe mentioned above and I am trying to create a JSON object as follows:
{
"sg1": {
"csg1": ['sc1', 'sc2'],
"csg2": ['sc3', 'sc4']
},
"sg2": {
"csg3": ['sc5', 'sc6'],
"csg4": ['sc7', 'sc8']
}
}
I tried using the pandas to_json and to_dict with orient arguments but I am not getting the expected result. I also tried grouping by the columns and then creating the list and converting it into a JSON.
Any help is much appreciated.

You can groupby ['Groups','sub-groups'] and build a dictionary from the multiindex series with a dictionary comprehension:
s = df.groupby(['Groups','sub-groups']).selections.agg(list)
d = {k1:{k2:v} for (k1,k2),v in s.iteritems()}
print(d)
# {'sg1': {'csg2': ['sc3', 'sc4']}, 'sg2': {'csg4': ['sc7', 'sc8']}}

You need to group on the columns of interest such as:
import pandas as pd
data = {
'Groups': ['sg1', 'sg1', 'sg1', 'sg1', 'sg2', 'sg2', 'sg2', 'sg2'],
'sub-groups': ['csg1', 'csg1', 'csg2', 'csg2', 'csg3', 'csg3', 'csg4', 'csg4'],
'selections': ['sc1', 'sc2', 'sc3', 'sc4', 'sc5', 'sc6', 'sc7', 'sc8']
}
df = pd.DataFrame(data)
print(df.groupby(['Groups', 'sub-groups'])['selections'].unique().to_dict())
The output is:
{
('sg1', 'csg1'): array(['sc1', 'sc2'], dtype=object),
('sg1', 'csg2'): array(['sc3', 'sc4'], dtype=object),
('sg2', 'csg3'): array(['sc5', 'sc6'], dtype=object),
('sg2', 'csg4'): array(['sc7', 'sc8'], dtype=object)
}

Lets try dictify function which builds a nested dictionary with top level keys from the Groups and corresponding sub level keys from sub-groups:
from collections import defaultdict
def dictify():
dct = defaultdict(dict)
for (x, y), g in df.groupby(['Groups', 'sub-groups']):
dct[x][y] = [*g['selections']]
return dict(dct)
# dictify()
{
"sg1": {
"csg1": ["sc1","sc2"],
"csg2": ["sc3","sc4"]
},
"sg2": {
"csg3": ["sc5","sc6"],
"csg4": ["sc7","sc8"]
}
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert JSON into dataframe [duplicate] - python

You must use the pandas library: import json import pandas as pd with open('data.json') as f: data = json.load(f) print(data) df = pd.read_json('data.json') df

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

Flatten a column value using dataframe

Pandas DataFrame - remove / replace dict values based on key

Convert API Reponse to Pandas DataFrame

creating a json object from pandas dataframe

Categories

Resources