How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13
Related
I'm creating a dataframe using
data_df = pd.DataFrame(
{'time_stamp': pd.date_range(date_fromx, date_tox, freq=time_length, tz=timezone)}
)
data_df['data'] = np.nan
Where I'm making time_stamp column timezone aware to match the data in list of dictionaries like below
I have a lists of dictionaries like
[{'time_stamp': '2022-10-07T00:00:00.000Z', 'data': 8044.849457711932}, {'time_stamp': '2022-10-07T00:15:00.000Z', 'data': 4997.691731774312}, {'time_stamp': '2022-10-07T00:30:00.000Z', 'data': 6984.109211541678}, {'time_stamp': '2022-10-07T00:45:00.000Z', 'data': 5492.891985861485}, {'time_stamp': '2022-10-07T01:00:00.000Z', 'data': 5473.0496118099745}, {'time_stamp': '2022-10-07T01:15:00.000Z', 'data': 6501.250396808703}, {'time_stamp': '2022-10-07T01:30:00.000Z', 'data': 6017.03827304475}, {'time_stamp': '2022-10-07T01:45:00.000Z', 'data': 7511.133012466583}, {'time_stamp': '2022-10-07T02:00:00.000Z', 'data': 5942.32914821161}]
The problem
I have few missing dates in list of dictionaries with time_stamp and data. I want to show the missing dates with empty data in front of it.
I'm trying to match the key in list of dictionaries with the dates in my dataframe and enter data in from the matching key in front of the matched date in dataframe.
I have tried creating a new dataframe using lists of dictionaries and merge using that with
usage_df = pd.DataFrame(usage_data) #usage_data is list of dictionaries
map_dict = dict(zip(usage_df['time_stamp'], usage_df['data']))
data_df['data'] = data_df['time_stamp'].map(usage_data)
But this is missing few of the datas at the end.
Also by concating like
merged_df = pd.concat([data_df, usage_df], ignore_index=True)
I want to add a dict to a dataframe and the appended dict has dicts or list as value.
Example:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00'
}
Now, I want to add this dict to a dataframe. I tried this, but it failed:
df = pd.DataFrame(abc, columns = abc.keys())
Output:
ValueError: All arrays must be of the same length
I'm thankful for your help.
Your question is not very clear in terms of what your expected output is. But assuming you want to create a dataframe where the columns should be id, category, date and numbers (just added to show the list case) in which each cell in the category column keeps a dictionary and each cell in the numbers column keeps a list, you may use from_dict method with transpose:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00',
'numbers': [1,2,3,4,5]
}
df = pd.DataFrame.from_dict(abc, orient="index").T
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
So let's say you want to add another item to this dataframe:
efg = {'id': 'notniceId',
'category': {'sport':'swimming',
'land': 'UK'
},
'date': '2021-04-12T23:33:21+02:00',
'numbers': [4,5]
}
df2 = pd.DataFrame.from_dict(efg, orient="index").T
pd.concat([df, df2], ignore_index=True)
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
1
notniceId
{'sport':'swimming','land': 'UK'}
2021-04-12T23:33:21+02:00
[4,5]
I have the following JSON snippet:
{'search_metadata': {'completed_in': 0.027,
'count': 2},
'statuses': [{'contributors': None,
'coordinates': None,
'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
'text': 'The text',
'truncated': True,
'user': {'contributors_enabled': False,
'screen_name': 'abcde',
'verified': false
}
}
,{...}]
}
The info that interests me is all in the statuses array. With pandas I can turn this into a DataFrame like this
df = pd.DataFrame(Data['statuses'])
Then I extract a subset out of this dataframe with
dfsub = df[['created_at', 'text']]
display(dfsub) shows exactly what I expect.
But I also want to include [user][screen_name] to the subset.
dfs = df[[ 'user', 'created_at', 'text']]
is syntactically correct but user contains to much information.
How do I add only the screen_name to the subset?
I have tried things like the following but none of that works
[user][screen_name]
user.screen_name
user:screen_name
I would normalize data before contructing DataFrame.
Take a look here: https://stackoverflow.com/a/41801708/14596032
Working example as an answer for your question:
df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)
You can try to access Dataframe, then Series, then Dict
df['user'] # user column = Series
df['user'][0] # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict
You can use pd.Series.str. The docs don't do justice to all the wonderful things .str can do, such as accessing list and dict items. Case in point, you can access dict elements like this:
df['user'].str['screen_name']
That said, I agree with #VladimirGromes that a better way is to normalize your data into a flat table.
A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")
I have some json data that I want to put into a pandas dataframe. The json looks like this:
{'date': [20170629,
20170630,
20170703,
20170705,
20170706,
20170707],
'errorMessage': None,
'seriesarr': [{'chartOnlyFlag': 'false',
'dqMaxValidStr': None,
'expression': 'DB(FXO,V1,EUR,USD,7D,VOL)',
'freq': None,
'frequency': None,
'iDailyDates': None,
'label': '',
'message': None,
'plotPoints': [0.0481411225888,
0.0462401214563,
0.0587196848727,
0.0765737640932,
0.0678912611279,
0.0675766942022],
}
I am trying to create a pandas DataFrame with 'date' as the index and 'plotPoints' as a second column. I don't need any of the other infomation.
I've tried
df = pd.io.json.json_normalize(data, record_path = 'date', meta = ['seriesarr', ['plotPoints']])
When I do this I get the following error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('plotPoints',)
Any help with this is appreciated.
Thanks!
IIUC, json_normalize may not be able to help you here. It might instead just be easier to extract that data and then load it into a dataframe directly. If need be, convert to datetime using pd.to_datetime:
date = data.get('date')
plotPoints = data.get('seriesarr')[0].get('plotPoints')
df = pd.DataFrame({'date' : pd.to_datetime(date, format='%Y%m%d'),
'plotPoints' : plotPoints})
df
date plotPoints
0 2017-06-29 0.048141
1 2017-06-30 0.046240
2 2017-07-03 0.058720
3 2017-07-05 0.076574
4 2017-07-06 0.067891
5 2017-07-07 0.067577
This is under the assumption that your data is exactly as shown in the question.
As #COLDSPEED pointed out, getting data directly from dictionary columns will be suitable since 'plotPoints' is contained within a list of dict.
A list comprehension variation is as below that has date as index and plotpoints as column..
col1 = data['date']
adict = dict((k,v) for d in data['seriesarr'] for k,v in d.iteritems() )
col2 = adict['plotPoints']
pd.DataFrame(data= col2, index=col1)
>>> 0
20170629 0.048141
20170630 0.046240
20170703 0.058720
20170705 0.076574
20170706 0.067891
20170707 0.067577