I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'
Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3
This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures
Related
I am currently struggling with extracting/flatten data from hugely nested dictionary: Flattening a nested dictionary with unique keys for each dictionary? .
I received a somewhat acceptable response, but do now have problems in relation to applying that methodology to another dictionary.
So far I have gotten to a point where I have the following
DataFrame.
First I would concatenate the values of "this_should_be_columns" + '_' + "child_column_name", (not a problem)
What I want is for all the unique values in ("this_should_be_columns"_"child_column_name") to become headers, and the rows should be there corresponding value (column "0").
Any ideas/solutions would be much appreciated!
FYI, my dictionary looks as follows:
{'7454':
{'coach':
{'wyId': 562711, 'shortName': 'Name1', 'firstName': 'N1', 'middleName': '', 'lastName': 'N2',
'birthDate': None,
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7454, 'gender': 'male', 'status': 'active'}},
'7453':
{'coach':
{'wyId': 56245, 'shortName': 'Name2', 'firstName': 'N3', 'middleName': '', 'lastName': 'N4',
'birthDate': 'yyyy-mm-dd',
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7453, 'gender': 'male', 'status': 'active'}}}
The code looks as follows:
df_test = pd.DataFrame(pd.Series(responses).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).reset_index().rename(columns={'level_0': 'teamId', 'level_1': 'type', 'level_2': 'this_should_be_columns', 'level_3': 'child_column_name', 'level_4': 'firstname', 'level_5' :'middleName', 'level_6' : 'ignore'}))
del df_test['firstname']
del df_test['middleName']
del df_test['ignore']
print(df_test)
The problem is that your dictionaries have a different number of levels. 'birthArea' and 'passportArea' contain dictionaries while the other keys simply contain values. You can use pd.json_normalize() to flatten the keys of the innermost dictionary as described in Flatten nested dictionaries, compressing keys.
In [37]: pd.DataFrame(responses).stack().apply(lambda x: pd.json_normalize(x, sep='_').to_dict(orient='records')[0]).apply(pd.Series).stack().reset_index()
Out[37]:
level_0 level_1 level_2 0
0 coach 7454 wyId 562711
1 coach 7454 shortName Name1
2 coach 7454 firstName N1
3 coach 7454 middleName
4 coach 7454 lastName N2
.. ... ... ... ...
28 coach 7453 birthArea_name Denmark
29 coach 7453 passportArea_id 208
30 coach 7453 passportArea_alpha2code DK
31 coach 7453 passportArea_alpha3code DNK
32 coach 7453 passportArea_name Denmark
Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
I have a column "data" which has json object as values. I would like to split them up.
source = {'_id':['SE-DATA-BB3A','SE-DATA-BB3E','SE-DATA-BB3F'], 'pi':['BB3A_CAP_BMLS','BB3E_CAP_BMLS','BB3F_CAP_PAS'], 'Datetime':['190725-122500', '190725-122500', '190725-122500'], 'data': [ {'bb3a_bmls':[{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes-': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}]}
, {'bb3b_bmls':[{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}]}
, {'bb3c_bmls':[{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}]}
] }
input_df = pd.DataFrame(source)
My input_df is as below:
I'm expecting the output_df as below:
I could manage to get the columns volume_id volume_name volume_state
name id state nodes using the below method.
input_df['data'] = input_df['data'].apply(pd.Series)
which will result as below
Test_df=pd.concat([json_normalize(input_df['bb3a_bmls'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in input_df.index if isinstance(input_df['bb3a_bmls'][key],list)]).reset_index(drop=True)
Which will result for one "SERVER" - bb3a_bmls
Now, I don't have an idea how to get the parent columns "_id", "pi", "Datetime" back.
Idea is loop by each nested lists or by dicts and create list of dictionary for pass to DataFrame constructor:
out = []
zipped = zip(source['_id'], source['pi'], source['Datetime'], source['data'])
for a,b,c,d in zipped:
for k1, v1 in d.items():
for e in v1:
#get all values of dict with exlude volumes
di = {k2:v2 for k2, v2 in e.items() if k2 != 'volumes'}
#for each dict in volumes add volume_ to keys
for f in e['volumes']:
di1 = {f'volume_{k3}':v3 for k3, v3 in f.items()}
#create dict from previous values
di2 = {'_id':a, 'pi':b,'Datetime':c, 'SERVER':k1}
#add to list merged dictionaries
out.append({**di2, ** di1, **di})
df = pd.DataFrame(out)
print (df)
_id pi Datetime SERVER volume_state \
0 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
1 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
2 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
3 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
4 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
5 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
volume_id volume_name name id state nodes
0 330172 q_-4144d4e WAG 01 105F available 3
1 275192 p_3089d821ae WAG 01 105F available 3
2 830172 w_-4144d4e FEC 01 382E available 4
3 223192 g_3089d821ae FEC 01 382E available 4
4 930172 e_-4144d4e ASD 01 303F available 6
5 245192 h_3089d821ae ASD 01 303F available 6
I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list
You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
Target_df=source_df.apply(json_normalize)
I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'
If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object
Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())
Try following line:
names = [name.get('Name') for name in df['name']]