How to flatten a nested JSON into a pandas dataframe - python

I have a bit of a tricky JSON I want to put into a dataframe.
{'A': {'name': 'A',
'left_foot': [{'toes': '5'}],
'right_foot': [{'toes': '4'}]},
'B': {'name': 'B',
'left_foot': [{'toes': '3'}],
'right_foot': [{'toes': '5'}]},
...
}
I don't need the first layer with A and B as it is part of name. There will always only be one left_foot and one right_foot.
The data I want is as follows:
name left_foot.toes right_foot.toes
0 A 5 4
1 B 3 5
Using this post is was able to get the feet and toes but that is if you say data["A"]. Is there an easier way?
EDIT
I have something like this, but I need to specify "A" in the first line.
df = pd.json_normalize(tickers["A"]).pipe(
lambda x: x.drop('left_foot', 1).join(
x.left_foot.apply(lambda y: pd.Series(merge(y)))
)
).rename(columns={"toes": "left_foot.toes"}).pipe(
lambda x: x.drop('right_foot', 1).join(
x.right_foot.apply(lambda y: pd.Series(merge(y)))
)).rename(columns={"toes": "right_foot.toes"})

Given your data, each top level key (e.g. 'A' and 'B') is repeated as a value in 'name', therefore it will be easier to use pandas.json_normalize on only the values of the dict.
The 'left_foot' and 'right_foot' columns need be exploded to remove each dict from the list
The final step converts the columns of dicts to a dataframe and joins it back to df
It's not necessarily less code, but this should be significantly faster than the multiple applies used in the current code.
See this timing analysis comparing apply pandas.Series to just using pandas.DataFrame to convert a column.
If there are issues because your dataframe has NaN (e.g. missing dicts or lists) in the columns to be exploded and converted to a dataframe, see How to json_normalize a column with NaNs
import pandas as pd
# test data
data = {'A': {'name': 'A', 'left_foot': [{'toes': '5'}], 'right_foot': [{'toes': '4'}]}, 'B': {'name': 'B', 'left_foot': [{'toes': '3'}], 'right_foot': [{'toes': '5'}]}, 'C': {'name': 'C', 'left_foot': [{'toes': '5'}], 'right_foot': [{'toes': '4'}]}, 'D': {'name': 'D', 'left_foot': [{'toes': '3'}], 'right_foot': [{'toes': '5'}]}}
# normalize data.values and explode the dicts out of the lists
df = pd.json_normalize(data.values()).apply(pd.Series.explode).reset_index(drop=True)
# display(df)
name left_foot right_foot
0 A {'toes': '5'} {'toes': '4'}
1 B {'toes': '3'} {'toes': '5'}
2 C {'toes': '5'} {'toes': '4'}
3 D {'toes': '3'} {'toes': '5'}
# extract the values from the dicts and create toe columns
df = df.join(pd.DataFrame(df.pop('left_foot').values.tolist())).rename(columns={'toes': 'lf_toes'})
df = df.join(pd.DataFrame(df.pop('right_foot').values.tolist())).rename(columns={'toes': 'rf_toes'})
# display(df)
name lf_toes rf_toes
0 A 5 4
1 B 3 5
2 C 5 4
3 D 3 5

Related

python pandas - add unique Ids in column from master df back in to processed dfs stored in list of dataframes

I have a single df that includes multiple json strings per row that need reading and normalizing.
I can read out the json info and normalize the columns by storing each row as a new dataframe in a list - which i have done with the code below.
However I need to append the original unique Id in the original df (i.e. 'id': ['9clpa','g659am']) - which is lost in my current code.
The expected output is a list of dataframes per Id that include the exploded json info, with an additional column including Id (which will be repeated for each row of the final df).
I hope that makes sense, any suggestions are very welcome. thanks so much
dataframe
df = pd.DataFrame(data={'id': ['9clpa','g659am'],'i2': [('{"t":"unique678","q":[{"qi":"01","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}]},{"qi":"02","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"} ]}]}'),('{"t":"unique428","q":[{"qi":"01","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}]},{"qi":"02","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"} ]}]}')]})
current code
out={}
for i in range(len(df)):
out[i] = pd.read_json(df.i2[i])
out[i] = pd.json_normalize(out[i].q)
expected output
pd.DataFrame(data={'id': ['9clpa','9clpa'],'qi': ['01','02'], 'answers': ['{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}', '"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"']})
pd.DataFrame(data={'id': ['g659am','g659am'],'qi': ['01','02'], 'answers': ['{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}', '"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"']})
df = pd.DataFrame(data={'id': ['9clpa','g659am'],'i2': [('{"t":"unique678","q":[{"qi":"01","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}]},{"qi":"02","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"} ]}]}'),('{"t":"unique428","q":[{"qi":"01","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"G","value":"3"},{"answer":"V","value":"4"}]},{"qi":"02","answers":[{"answer":"M","value":"1"},{"answer":"F","value":"2"},{"answer":"A","value":"3"},{"answer":"B","value":"4"},{"answer":"G","value":"5"},{"answer":"NC","value":"6"},{"answer":"O","value":"7"} ]}]}')]})
out={}
columns1 = ['id','qi','answers']
for i in range(len(df)):
out[i] = pd.read_json(df.i2[i])
out[i] = pd.json_normalize(out[i].q)
df_new = pd.DataFrame(data=out[i],columns=columns1)
df_new = df_new.assign(id = lambda x: df.id[i])
display(df_new)
You can add a lambda function which will assign the value of 'id' to new df formed.
Edit: You can add location of 'id' column, in columns1 and define where you want it to appear when you create a dataframe.
Output dataframe:
You are just missing on assigning the id to your dataframe after your normalize columns:
out={}
for i in range(len(df)):
out[i] = pd.read_json(df.i2[i])
out[i] = pd.json_normalize(out[i].q)
out[i]['id'] = df.id[i]
out[i] = out[i].loc[:, ['id','qi','answers']]
Output:
>>> out[0]
id qi answers
0 9clpa 01 [{'answer': 'M', 'value': '1'}, {'answer': 'F', 'value': '2'}, {'answer': 'G', 'value': '3'}, {'answer': 'V', 'value': '4'}]
1 9clpa 02 [{'answer': 'M', 'value': '1'}, {'answer': 'F', 'value': '2'}, {'answer': 'A', 'value': '3'}, {'answer': 'B', 'value': '4'}, {'answer': 'G', 'value': '5'}, {'answer': 'NC', 'value': '6'}, {'answer': 'O', 'value': '7'}]
You can use .json_normalize (doc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html)
(from https://medium.com/swlh/converting-nested-json-structures-to-pandas-dataframes-e8106c59976e)

Python one liner to merge dictionary which has common values

What I have:
a=[{'name':'a','vals':1,'required':'yes'},{'name':'b','vals':2},{'name':'d','vals':3}]
b=[{'name':'a','type':'car'},{'name':'b','type':'bike'},{'name':'c','type':'van'}]
What I tried:
[[i]+[j] for i in b for j in a if i['name']==j['name']]
What I got:
[[{'name': 'a', 'type': 'car'}, {'name': 'a', 'vals': 1}], [{'name': 'b', 'type': 'bike'}, {'name': 'b', 'vals': 2}]]
What I want:
[{'name': 'a', 'type': 'car','vals': 1},{'name': 'b', 'type': 'bike','vals': 2}]
Note:
I need to merge dicts into one dict.
It should merge only those have common 'name' in both a and b.
I want python one liner answer.
For Python 3, you can do this:
a=[{'name':'a','vals':1},{'name':'b','vals':2},{'name':'d','vals':3}]
b=[{'name':'a','type':'car'},{'name':'b','type':'bike'},{'name':'c','type':'van'}]
print([{**i,**j} for i in b for j in a if i['name']==j['name']])

Separating nested list and dictionary in separate columns

I created an function to gather the following sample list below:
full_list = ['Group1', [{'a':'1', 'b':'2'},{'c':'3', 'x':'1'}]
'Group2', [{'d':'7', 'e':'18'}],
'Group3', [{'m':'21'}, {'n':'44','p':'13'}]]
As you can see some of the elements inside the lists are made up of key-value pair dictionaries.
And these dictionaries are of different sizes (number of kv pairs).
Can anyone suggest what to use in python to display this list in separate columns?
Group1 Group2 Group3
{'a':'1', 'b':'2'} {'d':'7', 'e':'18'} {'m':'21'}
{'c':'3', 'x':'1'} {'n':'44','p':'13'}
I am not after a solution but rather a point in the right direction for a novice like me.
I have briefly looked at itertools and pandas dataframes
Thanks in advance
Here is one way:
First extract the columns and the data:
import pandas as pd
columns = full_list[::2]
#['Group1', 'Group2', 'Group3']
data = full_list[1::2]
#[[{'a': '1', 'b': '2'}, {'c': '3', 'x': '1'}],
# [{'d': '7', 'e': '18'}],
# [{'m': '21'}, {'n': '44', 'p': '13'}]]
Here the [::2] means iterate from begin to end but only every 2 items and so does [1::2] but it starts iterating from index 1 (second position)
Then create a pd.DataFrame:
df = pd.DataFrame(data)
#0 {'a': '1', 'b': '2'} {'c': '3', 'x': '1'}
#1 {'d': '7', 'e': '18'} None
#2 {'m': '21'} {'n': '44', 'p': '13'}
Ooops but the columns and rows are transposed so we need to convert it:
df = df.T
Then add the columns:
df.columns = columns
And there we have it:
Group1 Group2 Group3
0 {'a': '1', 'b': '2'} {'d': '7', 'e': '18'} {'m': '21'}
1 {'c': '3', 'x': '1'} None {'n': '44', 'p': '13'}

Making new columns of keys with values stores as a list of values from a list of dicts?

I have a data frame (10 million rows) which looks like following. For better understanding, I have simplified it.
user_id event_params
10 [{'key': 'x', 'value': '1'}, {'key': 'y', 'value': '3'}, {'key': 'z', 'value': '4'}]
11 [{'key': 'y', 'value': '5'}, {'key': 'z', 'value': '9'}]
12 [{'key': 'a', 'value': '5'}]
I want to make new columns that are all the unique keys from the dataframe, with values stored in the respective keys. Output should like below:
user_id x y z a
10 1 3 4 NA
11 NA 5 9 NA
12 NA NA NA 5
Just create the new dataframe and append new lines via append function. You can find more alternatives here.
import pandas as pd
df = pd.DataFrame()
data = [
[12, [{'key': 'x', 'value': '1'}, {'key': 'y', 'value': '3'}, {'key': 'z', 'value': '4'}]],
[13, [{'key': 'a', 'value': '5'}]]
]
for user_id, event_params in data:
record = {e['key']: e['value'] for e in event_params}
record['user_id'] = user_id
df = df.append(record, ignore_index=True)
df

How do I remove decimals from Pandas to_dict() output

The gist of this post is that I have "23" in my original data, and I want "23" in my resulting dict (not "23.0"). Here's how I've tried to handle it with Pandas.
My Excel worksheet has a coded Region column:
23
11
27
(blank)
25
Initially, I created a dataframe and Pandas set the dtype of Region to float64*
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0)
df
23.0
11.0
27.0
NaN
25.0
Pandas will convert the dtype to object if I use fillna() to replace NaN's with blanks which seems to eliminate the decimals.
df.fillna('', inplace=True)
df
23
11
27
(blank)
25
Except I still get decimals when I convert the dataframe to a dict:
data = df.to_dict('records')
data
[{'region': 23.0,},
{'region': 27.0,},
{'region': 11.0,},
{'region': '',},
{'region': 25.0,}]
Is there a way I can create the dict without the decimal places? By the way, I'm writing a generic utility, so I won't always know the column names and/or value types, which means I'm looking for a generic solution (vs. explicitly handling Region).
Any help is much appreciated, thanks!
The problem is that after fillna('') your underlying values are still float despite the column being of type object
s = pd.Series([23., 11., 27., np.nan, 25.])
s.fillna('').iloc[0]
23.0
Instead, apply a formatter, then replace
s.apply('{:0.0f}'.format).replace('nan', '').to_dict()
{0: '23', 1: '11', 2: '27', 3: '', 4: '25'}
Using a custom function, takes care of integers and keeps strings as strings:
import pprint
def func(x):
try:
return int(x)
except ValueError:
return x
df = pd.DataFrame({'region': [1, 2, 3, float('nan')],
'col2': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'col2': 'a', 'region': 1},
{'col2': 'b', 'region': 2},
{'col2': 'c', 'region': 3},
{'col2': '', 'region': ''}]
A variation that also keeps floats as floats:
import pprint
def func(x):
try:
if int(x) == x:
return int(x)
else:
return x
except ValueError:
return x
df = pd.DataFrame({'region1': [1, 2, 3, float('nan')],
'region2': [1.5, 2.7, 3, float('nan')],
'region3': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'region1': 1, 'region2': 1.5, 'region3': 'a'},
{'region1': 2, 'region2': 2.7, 'region3': 'b'},
{'region1': 3, 'region2': 3, 'region3': 'c'},
{'region1': '', 'region2': '', 'region3': ''}]
You could add: dtype=str
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0, dtype=str)

Categories