Enter matched front of value with key to element in dataframe pandas - python

I'm creating a dataframe using
data_df = pd.DataFrame(
{'time_stamp': pd.date_range(date_fromx, date_tox, freq=time_length, tz=timezone)}
)
data_df['data'] = np.nan
Where I'm making time_stamp column timezone aware to match the data in list of dictionaries like below
I have a lists of dictionaries like
[{'time_stamp': '2022-10-07T00:00:00.000Z', 'data': 8044.849457711932}, {'time_stamp': '2022-10-07T00:15:00.000Z', 'data': 4997.691731774312}, {'time_stamp': '2022-10-07T00:30:00.000Z', 'data': 6984.109211541678}, {'time_stamp': '2022-10-07T00:45:00.000Z', 'data': 5492.891985861485}, {'time_stamp': '2022-10-07T01:00:00.000Z', 'data': 5473.0496118099745}, {'time_stamp': '2022-10-07T01:15:00.000Z', 'data': 6501.250396808703}, {'time_stamp': '2022-10-07T01:30:00.000Z', 'data': 6017.03827304475}, {'time_stamp': '2022-10-07T01:45:00.000Z', 'data': 7511.133012466583}, {'time_stamp': '2022-10-07T02:00:00.000Z', 'data': 5942.32914821161}]
The problem
I have few missing dates in list of dictionaries with time_stamp and data. I want to show the missing dates with empty data in front of it.
I'm trying to match the key in list of dictionaries with the dates in my dataframe and enter data in from the matching key in front of the matched date in dataframe.
I have tried creating a new dataframe using lists of dictionaries and merge using that with
usage_df = pd.DataFrame(usage_data) #usage_data is list of dictionaries
map_dict = dict(zip(usage_df['time_stamp'], usage_df['data']))
data_df['data'] = data_df['time_stamp'].map(usage_data)
But this is missing few of the datas at the end.
Also by concating like
merged_df = pd.concat([data_df, usage_df], ignore_index=True)

Related

How to normalize a complex json format in a pandas data frame that is a list of dictionaries

I have a pandas data frame that has one column like this in json format. I am not able to understand how to extract this.
df['completionDetails'][0] gives:
[{'name': 'start', 'time': 1654098788177},
{'name': 'arrival',
'time': 1654099038368,
'location': [-74.2713929, 40.5017297]},
{'name': 'departure',
'time': 1654098843357,
'location': [-74.2802414, 40.5095964]}]
I have tried:
dict_df = pd.DataFrame([ast.literal_eval(i) for i in df['completionDetails'].values])
But it is giving me error. What method can I use for this?
Expected Output:
start_time arrival_time arrival_location departure_time departure_location
1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]
IIUC each cell of the completionDetails column is a list of dictionaries.
You can make a dataframe out of each cell and concatenate the dfs:
dict_df = pd.concat([pd.DataFrame(i) for i in df['completionDetails'].values])
Edit:
Following your own edit, this is how you'd get the desired output:
dict_df = pd.concat([pd.DataFrame({f"{x['name']}_{k}": [v]
for x in i for k,v in x.items() if k!='name'}
) for i in df['completionDetails'].values if isinstance(i, list)])
As you can see we're building key names from the name key and other keys to create new dictionaries that will be used to create dataframes (that in turn will be concatenated to each other)
Output:
start_time arrival_time arrival_location departure_time departure_location
0 1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]

Add dict as value to dataframe

I want to add a dict to a dataframe and the appended dict has dicts or list as value.
Example:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00'
}
Now, I want to add this dict to a dataframe. I tried this, but it failed:
df = pd.DataFrame(abc, columns = abc.keys())
Output:
ValueError: All arrays must be of the same length
I'm thankful for your help.
Your question is not very clear in terms of what your expected output is. But assuming you want to create a dataframe where the columns should be id, category, date and numbers (just added to show the list case) in which each cell in the category column keeps a dictionary and each cell in the numbers column keeps a list, you may use from_dict method with transpose:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00',
'numbers': [1,2,3,4,5]
}
df = pd.DataFrame.from_dict(abc, orient="index").T
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
So let's say you want to add another item to this dataframe:
efg = {'id': 'notniceId',
'category': {'sport':'swimming',
'land': 'UK'
},
'date': '2021-04-12T23:33:21+02:00',
'numbers': [4,5]
}
df2 = pd.DataFrame.from_dict(efg, orient="index").T
pd.concat([df, df2], ignore_index=True)
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
1
notniceId
{'sport':'swimming','land': 'UK'}
2021-04-12T23:33:21+02:00
[4,5]

Basic pandas dataframe manipulation question

I have the following JSON snippet:
{'search_metadata': {'completed_in': 0.027,
'count': 2},
'statuses': [{'contributors': None,
'coordinates': None,
'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
'text': 'The text',
'truncated': True,
'user': {'contributors_enabled': False,
'screen_name': 'abcde',
'verified': false
}
}
,{...}]
}
The info that interests me is all in the statuses array. With pandas I can turn this into a DataFrame like this
df = pd.DataFrame(Data['statuses'])
Then I extract a subset out of this dataframe with
dfsub = df[['created_at', 'text']]
display(dfsub) shows exactly what I expect.
But I also want to include [user][screen_name] to the subset.
dfs = df[[ 'user', 'created_at', 'text']]
is syntactically correct but user contains to much information.
How do I add only the screen_name to the subset?
I have tried things like the following but none of that works
[user][screen_name]
user.screen_name
user:screen_name
I would normalize data before contructing DataFrame.
Take a look here: https://stackoverflow.com/a/41801708/14596032
Working example as an answer for your question:
df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)
You can try to access Dataframe, then Series, then Dict
df['user'] # user column = Series
df['user'][0] # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict
You can use pd.Series.str. The docs don't do justice to all the wonderful things .str can do, such as accessing list and dict items. Case in point, you can access dict elements like this:
df['user'].str['screen_name']
That said, I agree with #VladimirGromes that a better way is to normalize your data into a flat table.

Pandas json_normalize and JSON flattening error

A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")

Handle nested lists in pandas

How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13

Categories