pandas DataFrame: normalize one JSON column and merge with other columns - python

I have a pandas DataFrame containing one column with multiple JSON data items as list of dicts. I want to normalize the JSON column and duplicate the non-JSON columns:
# creating dataframe
df_actions = pd.DataFrame(columns=['id', 'actions'])
rows = [[12,json.loads('[{"type": "a","value": "17"},{"type": "b","value": "19"}]')],
[15, json.loads('[{"type": "a","value": "1"},{"type": "b","value": "3"},{"type": "c","value": "5"}]')]]
df_actions.loc[0] = rows[0]
df_actions.loc[1] = rows[1]
>>>df_actions
id actions
0 12 [{'type': 'a', 'value': '17'}, {'type': 'b', '...
1 15 [{'type': 'a', 'value': '1'}, {'type': 'b', 'v...
I want
>>>df_actions_parsed
id type value
12 a 17
12 b 19
15 a 1
15 b 3
15 c 5
I can normalize JSON data using:
pd.concat([pd.DataFrame(json_normalize(x)) for x in df_actions['actions']],ignore_index=True)
but I don't know how to join that back to the id column of the original DataFrame.

You can use concat with dict comprehension with pop for extract column, remove second level and join to original:
df1 = (pd.concat({i: pd.DataFrame(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
What is same as:
df1 = (pd.concat({i: json_normalize(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
print (df1)
type value id
0 a 17 12
1 b 19 12
2 a 1 15
3 b 3 15
4 c 5 15
Another solution if performance is important:
L = [{**{'i':k, **y}} for k, v in df_actions.pop('actions').items() for y in v]
df_actions = df_actions.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df_actions)
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5

Here's another solution that uses explode and json_normalize:
exploded = df_actions.explode("actions")
pd.concat([exploded["id"].reset_index(drop=True), pd.json_normalize(exploded["actions"])], axis=1)
Here's the result:
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5

Related

pandas - How to unstack dict column values to 2 new columns with the same name to all?

I currently have the dataframe bellow, with a dict inside the column value.
variable value
0 b44 {55: 20}
1 a11 {56: 19}
5 a34 {33: 19}
How to transform the above df to a df that looks like this:
variable id value
0 b44 55 20
1 a11 56 19
5 a34 33 19
import pandas as pd
df = pd.DataFrame({'variable': ['b44', 'a11', 'a34'],'value': [{55: 20}, {56: 19}, {33: 19}]})
df = df.assign(**{'key': df.value.apply(lambda x: list(x.keys())[0]), 'value': df.value.apply(lambda x: list(x.values())[0])})
Use list comprehension for list of tuples, DataFrame.pop is for extract column value for new ordering of columns names:
df[['id','value']] = [list(x.items())[0] for x in df.pop('value')]
print (df)
variable id value
0 b44 55 20
1 a11 56 19
5 a34 33 19
Or:
df[['id','value']] = [(*x.keys(), *x.values()) for x in df.pop('value')]
print (df)
variable id value
0 b44 55 20
1 a11 56 19
5 a34 33 19
Try with stack after create the new df
s = pd.DataFrame(df.pop('value').tolist(),index=df.index).stack().reset_index(level=1)
s.columns = ['id','value']
df = df.join(s)
df
Out[82]:
variable id value
0 b44 55 20.0
1 a11 56 19.0
5 a34 33 19.0
You can use the apply function :
df['id'] = df.values.apply(lambda x: list(x.keys())[0])
df['value'] = df.values.apply(lambda x: list(x.values())[0])

how to split dataframe cells using delimiter into different dataframes. with conditions

There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])

Remove any 0 value from row, order values descending for row, for each non 0 value in row return the index, column name, and score to a new df

I'm looking for a more efficient way of doing the below (perhaps using boolean masks and vecotrization).
I'm new to this forum so apologies if my first question is not quite what was expected.
#order each row by values descending
#remove any 0 value column from row
#for each non 0 value return the index, column name, and score to a new dataframe
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
column_names = ['index_row','header','score']
#create empty df with final output columns
df_result = pd.DataFrame(columns = column_names)
row_index=list(df.index.values)
for row in row_index:
working_row=row
#change all 0 values to null and drop any extraneous columns
subset_cols=df.loc[[working_row],:].replace(0,pd.np.nan).dropna(axis=1,how='any').columns.to_list()
#order by score
sub_df = df.loc[[working_row],subset_cols].sort_values(by =row, axis=1, ascending=False)
s_cols = sub_df.columns.to_list()
scores = sub_df.values.tolist()
scores = scores[0]
index_row=[]
header=[]
score=[]
for count, value in enumerate(scores):
header.append(s_cols[count])
score.append(value)
index_row.append(row)
data={'index_row': index_row,
'header': header,
'score': score}
result_frame = pd.DataFrame (data, columns =['index_row','header','score'])
df_result=pd.concat([df_result, result_frame], ignore_index=True)
df_result
You could do it directly with melt and some additional processing:
df_result = df.reset_index().rename(columns={'index': 'index_row'}).melt(
id_vars='index_row', var_name='header', value_name='score').query(
"score!=0").sort_values(['index_row', 'score'], ascending=[True, False]
).reset_index(drop=True)
it gives as expected:
index_row header score
0 0 b 36
1 0 d 7
2 0 c 2
3 0 a 1
4 1 c 8
5 1 d 8
6 1 b 2
7 2 c 100
8 2 d 9
9 2 a 8
10 3 d 50
11 3 b 6
12 3 a 5
for index in df.index:
temp_df = df.loc[index].reset_index().reset_index()
temp_df.columns = ['index_row', 'header', 'score']
temp_df['index_row'] = index
temp_df.sort_values(by=['score'], ascending=False, inplace=True)
df_result = df_result.append(temp_df[temp_df.score != 0], ignore_index=True)
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
df=df.reset_index()
results=pd.melt(df,id_vars='index',var_name='header',value_name='score')
filter=results['score']!=0
print(results[filter].sort_values(by=['index','score'],ascending=[True,False]))
output:
index header score
4 0 b 36
12 0 d 7
8 0 c 2
0 0 a 1
9 1 c 8
13 1 d 8
5 1 b 2
10 2 c 100
14 2 d 9
2 2 a 8
15 3 d 50
7 3 b 6
3 3 a 5
​

Insert a blank row between each grouping in a dataframe BUT only display the first header

The following code courtesy of #jezrael displays a blank row AND a Header for each grouping
data = {
'MARKET_SECTOR_DES':['A','A','B','B','B','B'],
'count':[10,9,20,19,18,17]
}
df = pd.DataFrame(data)
print(df)
print("")
# retrieve column headers
df2 = pd.DataFrame([[''] * len(df.columns), df.columns], columns=df.columns)
# For each grouping Apply insert headers
df1 = (df.groupby('MARKET_SECTOR_DES', group_keys=False)
.apply(lambda d: d.append(df2))
.iloc[:-2]
.reset_index(drop=True))
print(df1)
Output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
2
3 MARKET_SECTOR_DES count
4 B 20
5 B 19
6 B 18
7 B 17
Desired output:
MARKET_SECTOR_DES count
0 A 10
1 A 9
4 B 20
5 B 19
6 B 18
7 B 17
So only the single header at the top.
Change your df2 to
df2 = pd.DataFrame([[''] * len(df.columns)], columns=df.columns)

Efficient way of looping through list of dictionaries and appending items into column in dataframe

Here is MRE:
data = [
{'1':20},
{'1':10},
{'1':40},
{'1':14},
{'1':33}
]
What I am trying to do is loop through each dictionary and append each value to a column in a dataframe.
right now I am doing
import pandas as pd
lst = []
for item in data:
lst.append(item['1'])
df = pd.DataFrame({"col1":lst})
outputting:
col1
0 20
1 10
2 40
3 14
4 33
Yes this is what I want however I have over 1M dictionaries in a list. Is it most efficient way?
EDIT:
pd.DataFrame(data).rename(columns={'1':'col1'})
works perfectly for above case however what if data looks like this?
data = [
{'1':
{'value':20}},
{'1':
{'value':10}},
{'1':
{'value':40}},
{'1':
{'value':14}},
{'1':
{'value':33}}]
so I would use:
lst = []
for item in data:
lst.append(item['1']['value'])
df = pd.DataFrame({"col1":lst})
is there more efficient way for list of dictionary that contain dictionary?
One idea is pass data to DataFrame cosntructor and then use rename:
df = pd.DataFrame(data).rename(columns={'1':'col1'})
print (df)
col1
0 20
1 10
2 40
3 14
4 33
If is necessary filtering use list comprehension and add parameter columns:
df = pd.DataFrame([x['1'] for x in data], columns=['col1'])
print (df)
col1
0 20
1 10
2 40
3 14
4 33
EDIT: For new data use:
data = [
{'1':
{'value':20}},
{'1':
{'value':10}},
{'1':
{'value':40}},
{'1':
{'value':14}},
{'1':
{'value':33}}]
df = pd.DataFrame([x['1']['value'] for x in data], columns=['col1'])
print (df)
col1
0 20
1 10
2 40
3 14
4 33
Or:
df = pd.DataFrame([x['1'] for x in data]).rename(columns={'value':'col1'})
print (df)
col1
0 20
1 10
2 40
3 14
4 33
#jezrael's answer is correct but to be more specific with col:
df = pd.DataFrame(data)
print(df.add_prefix('col'))
Output:
col1
0 20
1 10
2 40
3 14
4 33

Categories