Create pandas MultiIndex Dataframe from json - python

I am receiving the following json from a webservice:
{
"headers":[
{
"seriesId":"18805",
"Name":"Name1",
"assetId":"4"
},
{
"seriesId":"18801",
"Name":"Name2",
"assetId":"209"
}
],
"values":[
{
"Date":"01-Jan-2021",
"18805":"127.93",
"18801":"75.85"
}
]
}
Is there a way to create a MultiIndex dataframe from this data? I would like Date to be the row index and the rest to be column indexes.

the values key is a straight forward data frame
columns can be rebuilt from headers key
js = {'headers': [{'seriesId': '18805', 'Name': 'Name1', 'assetId': '4'},
{'seriesId': '18801', 'Name': 'Name2', 'assetId': '209'}],
'values': [{'Date': '01-Jan-2021', '18805': '127.93', '18801': '75.85'}]}
# get values into dataframe
df = pd.DataFrame(js["values"]).set_index("Date")
# get headers for use in rebuilding column names
dfc = pd.DataFrame(js["headers"])
# rebuild columns
df.columns = pd.MultiIndex.from_tuples(dfc.apply(tuple, axis=1), names=dfc.columns)
print(df)
seriesId 18805 18801
Name Name1 Name2
assetId 4 209
Date
01-Jan-2021 127.93 75.85

Related

How to remove redundant elements from a JSON string in Python

I have the below JSON string which I converted from a Pandas data frame.
[
{
"ID":"1",
"Salary1":69.43,
"Salary2":513.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
}
]
I want to change the above JSON to the below format.
[
{
"Date":"2022-06-09",
"Name":"john",
"DateTime":"2022-09-0710:57:55",
"employeeId":12,
"Results":[
{
"ID":1,
"Salary1":69.43,
"Salary2":513
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123
}
]
}
]
Kindly let me know how we can achieve this in Python.
Original Dataframe:
ID Salary1 Salary2 Date Name employeeId DateTime
1 69.43 513.0 2022-06-09 john 12 2022-09-0710:57:55
2 691.43 5123.0 2022-06-09 john 12 2022-09-0710:57:55
Thank you.
As #Harsha pointed, you can adapt one of the answers from another question, with just some minor tweaks to make it work for OP's case:
(
df.groupby(["Date","Name","DateTime","employeeId"])[["ID","Salary1","Salary2"]]
# to_dict(orient="records") - returns list of rows, where each row is a dict,
# "oriented" like [{column -> value}, … , {column -> value}]
.apply(lambda x: x.to_dict(orient="records"))
# groupBy makes a Series: with grouping columns as index, and dict as values.
# This structure is no good for the next to_dict() method.
# So here we create new DataFrame out of grouped Series,
# with Series' indexes as columns of DataFrame,
# and also renamimg our Series' values to "Results" while we are at it.
.reset_index(name="Results")
# Finally we can achieve the desired structure with the last call to to_dict():
.to_dict(orient="records")
)
# [{'Date': '2022-06-09', 'Name': 'john', 'DateTime': '2022-09-0710:57:55', 'employeeId': 12,
# 'Results': [
# {'ID': 1, 'Salary1': 69.43, 'Salary2': 513.0},
# {'ID': 2, 'Salary1': 691.43, 'Salary2': 5123.0}
# ]}]

Pandas DataFrame - remove / replace dict values based on key

Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?
You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}
This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

Add dict as value to dataframe

I want to add a dict to a dataframe and the appended dict has dicts or list as value.
Example:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00'
}
Now, I want to add this dict to a dataframe. I tried this, but it failed:
df = pd.DataFrame(abc, columns = abc.keys())
Output:
ValueError: All arrays must be of the same length
I'm thankful for your help.
Your question is not very clear in terms of what your expected output is. But assuming you want to create a dataframe where the columns should be id, category, date and numbers (just added to show the list case) in which each cell in the category column keeps a dictionary and each cell in the numbers column keeps a list, you may use from_dict method with transpose:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00',
'numbers': [1,2,3,4,5]
}
df = pd.DataFrame.from_dict(abc, orient="index").T
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
So let's say you want to add another item to this dataframe:
efg = {'id': 'notniceId',
'category': {'sport':'swimming',
'land': 'UK'
},
'date': '2021-04-12T23:33:21+02:00',
'numbers': [4,5]
}
df2 = pd.DataFrame.from_dict(efg, orient="index").T
pd.concat([df, df2], ignore_index=True)
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
1
notniceId
{'sport':'swimming','land': 'UK'}
2021-04-12T23:33:21+02:00
[4,5]

Converting json dictionary to spark dataframe by having keys as columns

Is it possible to convert a dictionary into a dataframe by having the keys as columns with the values beneath?
I have this result set from api as a dictionary:
{
'information': [{
'created': '2020-10-26T00:00:00+00:00',
'title': 'Random1',
'published': 'YES',
}, {
'created': '2020-11-06T00:00:00+00:00',
'title': 'Random2',
'published': 'YES',
}, {
'created': '2020-10-27T00:00:00+00:00',
'title': 'Random3',
'published': 'YES',
}, {
'created': '2020-10-29T00:00:00+00:00',
'title': 'Random4',
'published': 'YES',
}]
}
If I convert this to a dataframe like this:
json_rdd=sc.parallelize([data_dict['information']])
spark_df = spark.createDataFrame(json_rdd)
spark_df.createOrReplaceTempView("data_df");
This gives me columns listed as _1, _2, _3,_4 with the data still showing as objects within them.
Is it possible to have the data_df (converted dataframe) show the columns as created, title, published and have the values within the corresponding columns as flat?
You can directly use the dictionary to create dataframe no need to covert it to rdd.
arr = your_dict_here
spark.createDataFrame(arr['information']).show()
Output:
+--------------------+---------+-------+
| created|published| title|
+--------------------+---------+-------+
|2020-10-26T00:00:...| YES|Random1|
|2020-11-06T00:00:...| YES|Random2|
|2020-10-27T00:00:...| YES|Random3|
|2020-10-29T00:00:...| YES|Random4|
+--------------------+---------+-------+

Pandas json_normalize and JSON flattening error

A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")

Categories