Selectively load JSON data into a dataframe

Selectively load JSON data into a dataframe - python

I have some json data that I want to put into a pandas dataframe. The json looks like this:
{'date': [20170629,
20170630,
20170703,
20170705,
20170706,
20170707],
'errorMessage': None,
'seriesarr': [{'chartOnlyFlag': 'false',
'dqMaxValidStr': None,
'expression': 'DB(FXO,V1,EUR,USD,7D,VOL)',
'freq': None,
'frequency': None,
'iDailyDates': None,
'label': '',
'message': None,
'plotPoints': [0.0481411225888,
0.0462401214563,
0.0587196848727,
0.0765737640932,
0.0678912611279,
0.0675766942022],
}
I am trying to create a pandas DataFrame with 'date' as the index and 'plotPoints' as a second column. I don't need any of the other infomation.
I've tried
df = pd.io.json.json_normalize(data, record_path = 'date', meta = ['seriesarr', ['plotPoints']])
When I do this I get the following error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('plotPoints',)
Any help with this is appreciated.
Thanks!

IIUC, json_normalize may not be able to help you here. It might instead just be easier to extract that data and then load it into a dataframe directly. If need be, convert to datetime using pd.to_datetime:
date = data.get('date')
plotPoints = data.get('seriesarr')[0].get('plotPoints')
df = pd.DataFrame({'date' : pd.to_datetime(date, format='%Y%m%d'),
'plotPoints' : plotPoints})
df
date plotPoints
0 2017-06-29 0.048141
1 2017-06-30 0.046240
2 2017-07-03 0.058720
3 2017-07-05 0.076574
4 2017-07-06 0.067891
5 2017-07-07 0.067577
This is under the assumption that your data is exactly as shown in the question.

As #COLDSPEED pointed out, getting data directly from dictionary columns will be suitable since 'plotPoints' is contained within a list of dict.
A list comprehension variation is as below that has date as index and plotpoints as column..
col1 = data['date']
adict = dict((k,v) for d in data['seriesarr'] for k,v in d.iteritems() )
col2 = adict['plotPoints']
pd.DataFrame(data= col2, index=col1)
>>> 0
20170629 0.048141
20170630 0.046240
20170703 0.058720
20170705 0.076574
20170706 0.067891
20170707 0.067577

Related

How to normalize a complex json format in a pandas data frame that is a list of dictionaries

I have a pandas data frame that has one column like this in json format. I am not able to understand how to extract this.
df['completionDetails'][0] gives:
[{'name': 'start', 'time': 1654098788177},
{'name': 'arrival',
'time': 1654099038368,
'location': [-74.2713929, 40.5017297]},
{'name': 'departure',
'time': 1654098843357,
'location': [-74.2802414, 40.5095964]}]
I have tried:
dict_df = pd.DataFrame([ast.literal_eval(i) for i in df['completionDetails'].values])
But it is giving me error. What method can I use for this?
Expected Output:
start_time arrival_time arrival_location departure_time departure_location
1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]

IIUC each cell of the completionDetails column is a list of dictionaries.
You can make a dataframe out of each cell and concatenate the dfs:
dict_df = pd.concat([pd.DataFrame(i) for i in df['completionDetails'].values])
Edit:
Following your own edit, this is how you'd get the desired output:
dict_df = pd.concat([pd.DataFrame({f"{x['name']}_{k}": [v]
for x in i for k,v in x.items() if k!='name'}
) for i in df['completionDetails'].values if isinstance(i, list)])
As you can see we're building key names from the name key and other keys to create new dictionaries that will be used to create dataframes (that in turn will be concatenated to each other)
Output:
start_time arrival_time arrival_location departure_time departure_location
0 1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]

Flatten and Shape JSON DataFrame

I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL

We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL

If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T

Basic pandas dataframe manipulation question

I have the following JSON snippet:
{'search_metadata': {'completed_in': 0.027,
'count': 2},
'statuses': [{'contributors': None,
'coordinates': None,
'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
'text': 'The text',
'truncated': True,
'user': {'contributors_enabled': False,
'screen_name': 'abcde',
'verified': false
}
}
,{...}]
}
The info that interests me is all in the statuses array. With pandas I can turn this into a DataFrame like this
df = pd.DataFrame(Data['statuses'])
Then I extract a subset out of this dataframe with
dfsub = df[['created_at', 'text']]
display(dfsub) shows exactly what I expect.
But I also want to include [user][screen_name] to the subset.
dfs = df[[ 'user', 'created_at', 'text']]
is syntactically correct but user contains to much information.
How do I add only the screen_name to the subset?
I have tried things like the following but none of that works
[user][screen_name]
user.screen_name
user:screen_name

I would normalize data before contructing DataFrame.
Take a look here: https://stackoverflow.com/a/41801708/14596032
Working example as an answer for your question:
df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)

You can try to access Dataframe, then Series, then Dict
df['user'] # user column = Series
df['user'][0] # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict

You can use pd.Series.str. The docs don't do justice to all the wonderful things .str can do, such as accessing list and dict items. Case in point, you can access dict elements like this:
df['user'].str['screen_name']
That said, I agree with #VladimirGromes that a better way is to normalize your data into a flat table.

Handle nested lists in pandas

How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.

# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers

By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.

Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13

Get a list of keys and values in a nested dictionary oriented by index

I have an Excel file with a structure like this:
name age status
anna 35 single
petr 27 married
I have converted such a file into a nested dictionary with a structure like this:
{'anna': {'age':35}, {'status': 'single'}},
{'petr': {'age':27}, {'status': 'married'}}
using pandas:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True)
print(df.to_dict(orient='index'))
But now when running list(df.keys()) it returns me a list of all keys in the dictionary ('age', 'status', etc) but not 'name'.
My eventual goal is that it returns me all the keys and values by typing a name.
Is it possible somehow? Or maybe I should use some other way to import a data in order to achieve a goal? Eventually I should anyway come to a dictionary because I will merge it with other dictionaries by a key.

I think you need parameter drop=False to set_index for not drop column Name:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True, drop=False)
print (df)
name age status
name
anna anna 35 single
petr petr 27 married
d = df.to_dict(orient='index')
print (d)
{'anna': {'age': 35, 'status': 'single', 'name': 'anna'},
'petr': {'age': 27, 'status': 'married', 'name': 'petr'}}
print (list(df.keys()))
['name', 'age', 'status']

Given a dataframe from excel, you should do this to obtain the thing you want:
resulting_dict = {}
for name, info in df.groupby('name').apply(lambda x: x.to_dict()).iteritems():
stats = {}
for key, values in info.items():
if key != 'name':
value = list(values.values())[0]
stats[key] = value
resulting_dict[name] = stats

Try this :
import pandas as pd
df = pd.read_excel('path/to/file')
df[df['name']=='anna'] #Get all details of anna
df[df['name']=='petr'] #Get all details of petr

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selectively load JSON data into a dataframe - python

Related

How to normalize a complex json format in a pandas data frame that is a list of dictionaries

Flatten and Shape JSON DataFrame

Basic pandas dataframe manipulation question

Handle nested lists in pandas

Get a list of keys and values in a nested dictionary oriented by index

Categories

Resources