Python List of Dictionaries Denormalization - python

I have a list of list of dictionaries such as the following:
[[{'ID': '1',
'Value': '100'},
{'ID': '2',
'Value': '200'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'}],
...]]
I want to convert it into a denormalized dataframe which would have new column for each key such as:
# ID Value ID Value
#0 1 100 2 100
#1 2 300 2 300
If one item has 3 pairs of id, value those should be null for the other items. Running pd.DataFrame(list) creates only one ID and one Value column and puts the values under. How can we achieve this as seperate columns?

You can do it with the concat function:
data = [pd.DataFrame(i) for i in input_data]
out = pd.concat(data, axis=1)
print(out)
Prints:
ID Value ID Value
0 1 100 2 300
1 2 200 2 300
The key is the axis=1 which concatenates along the column axis.
Edit:
Just saw the information with the zeros for all "shorter" columns. THis code results in NaN instead of zero, this however can be resolved fast with the fillna() method:
out = out.fillna(value=0)
Example:
import pandas as pd
input_data = [[{'ID': '1',
'Value': '100'},
{'ID': '2',
'Value': '200'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'}],
[{'ID': '2',
'Value': '300'},
{'ID': '2',
'Value': '300'},
{'ID': '3',
'Value': '300'}]]
data = [pd.DataFrame(i) for i in input_data]
out = pd.concat(data, axis=1)
out = out.fillna(value=0)
print(out)
prints:
ID Value ID Value ID Value
0 1 100 2 300 2 300
1 2 200 2 300 2 300
2 0 0 0 0 3 300

Related

How to convert elements in Series to Dataframe in python?

I'm new to python.
I got a Dataframe like this:
df = pd.DataFrame({'column_a' : [1, 2, 3],
'conversions' : [[{'action_type': 'type1',
'value': '1',
'value_plus_10': '11'},
{'action_type': 'type2',
'value': '2',
'value_plus_10': '12'}],
np.nan,
[{'action_type': 'type3',
'value': '3',
'value_plus_10': '13'},
{'action_type': 'type4',
'value': '4',
'value_plus_10': '14'}]]} )
where values in the column conversions is either a list or a NaN.
values in conversions looks like this:
print(df['conversions'][0])
>>> [{'action_type': 'type1', 'value': '1', 'value_plus_10': '11'}, {'action_type': 'type2', 'value': '2', 'value_plus_10': '12'}]
But it's kinda hard to manipulate, so I want elements in conversions to be either a dataFrame or a NaN, like this:
print(df['conversions'][0])
>>>
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
print(df['conversions'][1])
>>> nan
print(df['conversions'][2])
>>>
action_type value value_plus_10
0 type3 3 13
1 type4 4 14
Here's what I tried:
df['conversions'] = df['conversions'].apply(lambda x : pd.DataFrame(x) if type(x)=='list' else x)
which works, but nothing really changes.
I could only find ways to convert a series to a dataframe, but what I'm trying to do is converting elements in a series to dataframes.
Is it possible to do? Thanks a lot!
Edit: Sorry for the unclear expected output , hope it's clear now.
You can apply the DataFrame constructor to the conversions columns:
df['conversions'] = df['conversions'].apply(lambda x: pd.DataFrame(x) if isinstance(x, list) else x)
print(df['conversions'][0])
Output:
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
Edit: it seems I misread your question (which is a bit unclear tbf) since you claim that this doesn't get the expected result. Are you trying to get all elements in one df? In that case you can use concat:
df_out = pd.concat([
pd.DataFrame(x) for x in df['conversions'] if isinstance(x, list)
])
print(df_out)
Output:
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
0 type3 3 13
1 type4 4 14

Extract specific value from a dictionary within a list in a column

I'm trying to extract values from a dictionary within a list in a column, my dataframe looks like,
id proteinIds
0 ENSG00000006194 [{'id': 'O14978', 'source': 'uniprot_swissprot...
1 ENSG00000007520 [{'id': 'Q9UJK0', 'source': 'uniprot_swissprot...
2 ENSG00000020922 [{'id': 'P49959', 'source': 'uniprot_swissprot...
3 ENSG00000036549 [{'id': 'Q8IYH5', 'source': 'uniprot_swissprot...
4 ENSG00000053524 [{'id': 'Q86YR7', 'source': 'uniprot_swissprot...
Each value in proteinIds column has multiple ids like below, I'm trying to extract only the id related to uniprot_swissprot and return none if uniprot_swissprot not present in the dictionary
[{'id': 'O60284', 'source': 'uniprot_swissprot'},
{'id': 'E5RFE8', 'source': 'uniprot_trembl'},
{'id': 'E5RHS3', 'source': 'uniprot_trembl'},
{'id': 'E5RHY1', 'source': 'uniprot_trembl'},
{'id': 'E5RID0', 'source': 'uniprot_trembl'},
{'id': 'E5RK88', 'source': 'uniprot_trembl'},
{'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]
Expected output
id proteinIds
0 ENSG00000006194 O14978
1 ENSG00000007520 Q9UJK0
2 ENSG00000020922 P49959
3 ENSG00000036549 Q8IYH5
4 ENSG00000053568 None
I tried using below code, but it was not returning the correct ids related to uniprot_swissprot, any help is appreciated, thanks.
df1 = pd.DataFrame([[y['id'] for y in x] if isinstance(x, list) else [None] for x in df['proteinIds']], index=df.index)
You can try explode the list in proteinIds column into list then convert the dictionary to multiple dataframe columns and conditionally select the id column where source is uniprot_swissprot
df['Ids'] = (df['proteinIds'].explode() # explode will keep the original index by default so we can safely assign it back
.apply(pd.Series)
.loc[lambda d: d['source'].eq('uniprot_swissprot'), 'id'])
print(df)
id \
0 ENSG00000006194
1 ENSG00000007520
proteinIds \
0 [{'id': 'O60284', 'source': 'uniprot_swissprot'}, {'id': 'E5RFE8', 'source': 'uniprot_trembl'}]
1 [{'id': 'E5RK88', 'source': 'uniprot_trembl'}, {'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]
Ids
0 O60284
1 NaN

Pandas: Merge contents of a dataframe into a single column (as a list of dict / json)

I want to introduce the contents of one df to another but as a list based on ID. I know to merge based on ID but I do not want duplicate rows for ID in the new dataframe. How do I get this done?
data1 = {'ID': ['AB01','AB02'],
'Name': ["toyota", "honda"],
'Age':[21,22]
}
df1 = pd.DataFrame.from_dict(data1)
data2 = {'ID': ['AB01','AB01','AB03','AB03'],
'Type': ["C",np.nan,"X","S"],
'Score':[87,98,45,82]
}
df2 = pd.DataFrame.from_dict(data2)
The result should look like this
You can make dict on the rows of df2 by .apply(), then group by ID and aggregate the dict of same ID into list by .groupby() + .agg().
Then, merge with df1 with .merge() by left join with ID as matching keys, as follows:
df2_info = (df2.apply(dict, axis=1)
.groupby(df2['ID'])
.agg(list)
.reset_index(name='Info')
)
df_out = df1.merge(df2_info, on='ID', how='left')
Result
print(df_out)
ID Name Age Info
0 AB01 toyota 21 [{'ID': 'AB01', 'Type': 'C', 'Score': 87}, {'ID': 'AB01', 'Type': nan, 'Score': 98}]
1 AB02 honda 22 NaN
For reference only, interim result of df2_info:
ID Info
0 AB01 [{'ID': 'AB01', 'Type': 'C', 'Score': 87}, {'ID': 'AB01', 'Type': nan, 'Score': 98}]
1 AB03 [{'ID': 'AB03', 'Type': 'X', 'Score': 45}, {'ID': 'AB03', 'Type': 'S', 'Score': 82}]
Try merge:
print(df1.merge(df2, on='ID', how='left').groupby(['ID', 'Name', 'Age']).apply(lambda x: a.to_dict('records') if (a:=x[['ID']].join(x.iloc[:, 3:])).dropna().any().any() else []).reset_index(name='Info'))
Output:
ID Name Age Info
0 AB01 toyota 21 [{'ID': 'AB01', 'Type': 'C', 'Score': 87.0}, {...
1 AB02 honda 22 []

Extract value from dict in column as column value, with NA's present [duplicate]

This question already has an answer here:
How to json_normalize a column with NaNs
(1 answer)
Closed 2 years ago.
I'm currently trying to pull the value out of several dicts appearing in a series of columns, there are two issues:
Since there are 4 columns in question they were unpacked from a previous dict-in-column value via this line of code:
df = pd.concat([df.drop(['ids'], axis = 1), df['ids'].apply(pd.Series)], axis = 1)
What this dict was unpack a dict in a column of the form:
d = {'a': {'id': 12}, 'b': {'id': 13}, 'c': {'id': 14}, 'd': {'id': 15}}
The dict d being of length between 0-4.
Before unpacking the dataframe the column I unpacked looked like this:
ids
406 {'a': {'id': '12'}}
408 None
409 {'a': {'id': '21'}, 'b': {'id': '23'}}
417 {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}
419 None
After Unpacking it now has the form
a b c
408 None {'id': '12'} None
409 {'id': '32'} {'id': '45'} {'id': '36'}
417 {'id': '09'} {'id': '31'} None
While that initially solved my first problem, I'm now trying to pull the values out of columns that have the dictionaries in them, and I'm kind of at a loss for this.
Potential solutions I've tried are just running the snippet above for each column (a,b,c), however that is both ugly and inefficient. At most I know an easy fix would be to pd.json_normalize the initial dataframe when I first start my program, however that would require a significant fix and refactor for something that seems that it could be solved trivially. For reference the ideal output would be this:
a b c
408 None 12 None
409 32 45 36
417 09 31 None
And the whole dataframe is several hundred thousand rows, with 20 columns that are in flux.
Using the solution from How to json_normalize a column with NaNs?
import pandas as pd
# setup dataframe
data = {'ids': [{'a': {'id': '12'}}, None, {'a': {'id': '21'}, 'b': {'id': '23'}}, {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}, None]}
df = pd.DataFrame(data)
# display(df)
ids
0 {'a': {'id': '12'}}
1 None
2 {'a': {'id': '21'}, 'b': {'id': '23'}}
3 {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}
4 None
# fill None with {}
df.ids = df.ids.fillna({i: {} for i in df.index})
# normalize the column
df = pd.json_normalize(df.ids).dropna(how='all')
# display(df)
a.id b.id c.id
0 12 NaN NaN
2 21 23 NaN
3 53 98 45
One option is to apply a customer function to each column
def my_func(val):
if isinstance(val,dict):
return val['id']
else:
return val
for col in df.columns:
df[col]=df[col].apply(my_func)
a b c
0 None 12 None
1 32 45 36
2 09 31 None

How to convert data frame with dictionary columns into multi level data frame

I have DataFrame which contains dictionaries in the columns.
Can be created as below
lis = [
{'id': '1',
'author': {'self': 'A',
'displayName': 'A'},
'created': '2018-12-18',
'items': {'field': 'status',
'fromString': 'Backlog'}},
{'id': '2',
'author': {'self': 'B',
'displayName': 'B'},
'created': '2018-12-18',
'items': {'field': 'status',
'fromString': 'Funnel'}}]
pd.DataFrame(lis)
author created id items
0 {'self': 'A', 'displayName': 'A'} 2018-12-18 1 {'field': 'status', 'fromString': 'Backlog'}
1 {'self': 'B', 'displayName': 'B'} 2018-12-18 2 {'field': 'status', 'fromString': 'Funnel'}
I want to convert this info multi level DataFrame.
I have been trying with
pd.MultiIndex.from_product(lis)
pd.MultiIndex.from_frame(pd.DataFrame(lis))
But not able to get the result i am looking for.Basically i want like below:
author created id items
self displayName field fromString
A A 2018-12-18 1 status Backlog
B B 2018-12-18 2 status Funnel
Any suggestions on how i can achieve this ?
Thanks
You can use json.json_normalize - but columns names are flattened with . separator:
from pandas.io.json import json_normalize
lis = [
{'id': '1',
'author': {'self': 'A',
'displayName': 'A'},
'created': '2018-12-18',
'items': {'field': 'status',
'fromString': 'Backlog'}},
{'id': '2',
'author': {'self': 'B',
'displayName': 'B'},
'created': '2018-12-18',
'items': {'field': 'status',
'fromString': 'Funnel'}}]
df = json_normalize(lis)
print (df)
id created author.self author.displayName items.field items.fromString
0 1 2018-12-18 A A status Backlog
1 2 2018-12-18 B B status Funnel
For MulitIndex in columns and in index - first create Mulitiindex by all columns without . by DataFrame.set_index and then use str.split:
df = df.set_index(['id','created'])
df.columns = df.columns.str.split('.', expand=True)
print (df)
author items
self displayName field fromString
id created
1 2018-12-18 A A status Backlog
2 2018-12-18 B B status Funnel
If need MulitIndex in columns - it is possible, but get missing values in columns names:
df.columns = df.columns.str.split('.', expand=True)
print (df)
id created author items
NaN NaN self displayName field fromString
0 1 2018-12-18 A A status Backlog
1 2 2018-12-18 B B status Funnel
Missing values should be replaced by empty string:
df = df.rename(columns= lambda x: '' if x != x else x)
print (df)
id created author items
self displayName field fromString
0 1 2018-12-18 A A status Backlog
1 2 2018-12-18 B B status Funnel
Try the below, Hope this would help.
df = pd.io.json.json_normalize(lis)
print(sorted(df.columns))
tupleList = [tuple(values.split(".")) if "." in values else (values,None) for values in sorted(df.columns)]
df.columns=pd.MultiIndex.from_tuples(tuplelist)
print(df)
Ouput will be as given below
author created id items
displayName self NaN NaN field fromString
A A 2018-12-18 1 status Backlog
B B 2018-12-18 2 status Funnel

Categories