Creating a Pandas Dataframe from List of Dictionaries of Dictionaries - python

I have a list of dictionaries, where each dictionary represents a record. It is formatted as follows:
>>> ListOfData=[
... {'Name':'Andrew',
... 'number':4,
... 'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
... {'Name':'Ben',
... 'number':6,
... 'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}},
... {'Name':'Cathy',
... 'number':1,
... 'contactinfo':{'Phone':'555-5556', 'Address':'126 3rd St'}}]
>>>
>>> import pprint
>>> pprint.pprint(ListOfData)
[{'Name': 'Andrew',
'contactinfo': {'Address': '123 Main St', 'Phone': '555-5555'},
'number': 4},
{'Name': 'Ben',
'contactinfo': {'Address': '124 2nd St', 'Phone': '555-5554'},
'number': 6},
{'Name': 'Cathy',
'contactinfo': {'Address': '126 3rd St', 'Phone': '555-5556'},
'number': 1}]
>>>
What is the best way to read this into a Pandas dataframe with multiindex columns for those attributes in the sub dictionaries?
For example, I'd ideally have 'Phone' and 'Address' columns nested under the 'contactinfo' columns.
I can read in the data as follows, but would like the contact info column to be broken into sub columns.
>>> pd.DataFrame.from_dict(ListOfData)
Name contactinfo number
0 Andrew {u'Phone': u'555-5555', u'Address': u'123 Main... 4
1 Ben {u'Phone': u'555-5554', u'Address': u'124 2nd ... 6
2 Cathy {u'Phone': u'555-5556', u'Address': u'126 3rd ... 1
>>>

how about this
declare empty data frame
df = DataFrame(columns=('Name', 'conntactinfo', 'number'))
then iterate over List and add rows
for row in ListOfData:
df.loc[len(df)] = row
complete code
import pandas as pd
ListOfData=[
{'Name':'Andrew',
'number':4,
'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
{'Name':'Ben',
'number':6,
'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}}]
df = pd.DataFrame(columns=('Name', 'contactinfo', 'number'))
for row in ListOfData:
df.loc[len(df)] = row
print(df)
this prints
Name contactinfo number
0 Andrew {'Phone': '555-5555', 'Address': '123 Main St'} 4
1 Ben {'Phone': '555-5554', 'Address': '124 2nd St'} 6

Here is a pretty clunky workaround that I was able to get what I need. I loop through the columns, find those that are made of dicts and then divide it into multiple columns and merge it to the dataframe. I'd appreciate hearing any ways to improve this code. I'd imagine that ideally the dataframe would be constructed from the get-go without having dictionaries as values.
>>> df=pd.DataFrame.from_dict(ListOfData)
>>>
>>> for name,col in df.iteritems():
... if any(isinstance(x, dict) for x in col.tolist()):
... DividedDict=col.apply(pd.Series)
... DividedDict.columns=pd.MultiIndex.from_tuples([(name,x) for x in DividedDict.columns.tolist()])
... df=df.join(DividedDict)
... df.drop(name,1, inplace=True)
...
>>> print df
Name number (contactinfo, Address) (contactinfo, Phone)
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556
>>>

Don't know about best or not, but you could do it in two steps:
>>> df = pd.DataFrame(ListOfData)
>>> df = df.join(pd.DataFrame.from_records(df.pop("contactinfo")))
>>> df
Name number Address Phone
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556

Related

Several dictionary with duplicate key but different value and no limit in column

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

Pandas get column contains character in all rows

I want to get list of dataframe columns that contains all rows with 2 spaces.
Input:
import pandas as pd
import numpy as np
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.expand_frame_repr = False
df = pd.DataFrame({'id': [101, 102, 103],
'full_name': ['John Brown', 'Bob Smith', 'Michael Smith'],
'comment_1': ['one two', 'qw er ty', 'one space'],
'comment_2': ['ab xfd xsxws', 'dsd sdd dwde', 'wdwd ofjpoej oihoe'],
'comment_3': ['ckdf cenfw cd', 'cewfwf wefep lwcpem', np.nan],
'birth_year': [1960, 1970, 1970]})
print(df)
Output:
id full_name comment_1 comment_2 comment_3 birth_year
0 101 John Brown one two ab xfd xsxws ckdf cenfw cd 1960
1 102 Bob Smith qw er ty dsd sdd dwde cewfwf wefep lwcpem 1970
2 103 Michael Smith one space wdwd ofjpoej oihoe NaN 1970
Expected Output:
['comment_2', 'comment_3']
You can use series.str.count() to count the appearances of a substring or pattern in a string, use .all() to check whether all items meet the criteria, and iterate over df.columns using only string columns with select_dtypes('object')
[i for i in df.select_dtypes('object').columns if (df[i].dropna().str.count(' ')==2).all()]
['comment_2', 'comment_3']
Try:
res=[]
for col in df.columns:
if(df[col].dtype==object):
dftemp=df[col].fillna(" ").str.replace(r"[^\s]", "").str.len()
dftemp=dftemp.eq(2).all()
if(dftemp): res.append(col)
print(res)
Outputs:
['comment_2', 'comment_3']
It runs through all columns, which might be strings (object type), removes all the non-space characters from these columns, then just counts charcters. In case if all have exactly 2 characters - it adds column name to the res array.

How can I return a dataframe after dropping rows which do not contain a substring

I have a large dataset and would like to filter it to only show rows which contain a particular substring (In the following example, 'George') (also bonus points if you tell me how to pass multiple substrings)
For example, if I start with the code
data = {
'Employee': ['George Brazil', 'Tim Swak', 'Rajesh Mahta', 'Kristy Karns', 'Jamie Hyneman', 'Pamela Voss', 'Tyrone Johnson', 'Anton Lafreu'],
'Director': ['Omay Wanja', 'Omay Wanja', 'George Stafford', 'Omay Wanja', 'George Stafford', 'Kristy Karns', 'Carissa Karns', 'Christie Karns'],
'Supervisor': ['George Foreman', 'Mary Christiemas', 'Omay Wanja', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'George of the jungle'],
'A series of ints to make this more complex': [1,0,1,4 , 1, 3, 3, 7]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Employee Director Supervisor A series of ints to make this more complex
a George Brazil Omay Wanja George Foreman 1
b Tim Swak Omay Wanja Mary Christiemas 0
c Rajesh Mahta George Stafford Omay Wanja 1
d Kristy Karns Omay Wanja CEO PERSON 4
e Jamie Hyneman George Stafford CEO PERSON 1
f Pamela Voss Kristy Karns CEO PERSON 3
g Tyrone Johnson Carissa Karns CEO PERSON 3
h Anton Lafreu Christie Karns George of the jungle 7
I would like to then perform an operation such that it returns the dataframe but with only rows a, c, e, and h, because they are the only rows which contain the substring 'George'
Try this
filters = 'George'
df[df.apply(lambda row: row.astype(str).str.contains(filters).any(), axis=1)]
edited to return subset
You can separate use an or statement for each column. There's probably a more elegant way to get it to work, but this will do.
df[df['Employee'].str.contains("George") | df['Director'].str.contains("George") | df['Supervisor'].str.contains("George")]
From your code, it seems you only want the rows that have 'George' in columns ['Employee', 'Director', 'Supervisor']. If so, try this:
# Lambda solution for first `n` columns
mask = df.iloc[:, 0:3].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Lambda solution with named columns
mask = df[['Employee','Director','Supervisor']].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Trivial solution
df[(df['Employee'].str.contains('George')) | (df['Director')].str.contains('George')) | (df['Supervisor'].str.contains('George'))]

split dictionary in pandas into separate columns [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 3 years ago.
II have a serie object containing 3 columns (name, code, value) which I get from the below function:
def get_fuzz(df, w):
s = df['Legal_Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'name': df['Legal_Name'].iloc[idx], 'lei': df['LEI'].iloc[idx], 'val': s.max()}
df1['Name'].apply(lambda x: get_fuzz(df, x))
The Serie looks like this
output
0 {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
2 {'name': 'VCC Live Group Zártkörűen Működő Rés...
I can assign the output to my dataframe with the code below.
df1.assign(search=df1['Name'].apply(lambda x: get_fuzz(df, x)))
The Dataframe that I get looks like this
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
Question
How can I split this column into 3 columns?
Final output wanted
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 Zion ZION INVESTMENT 549300D2O 100
Assuming you have the dataframe set up as:-
>>> df
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
you can use:-
>>> df = pd.concat([df.drop(['search'], axis=1), df['search'].apply(pd.Series)], axis=1)
>>> df
ID Name name lei value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100
And then update the column names as needed:-
>>> df.columns = ['ID', 'Name', 'Name_bis', 'LEI', 'Value']
>>> df
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100

Python - Check if column contains value in list, return value

I have a df:
d = {'id': [1,2,3,4,5,6,7,8,9,10],
'text': ['bill did this', 'jim did something', 'phil', 'bill did nothing',
'carl was here', 'this is random', 'other name',
'other bill', 'bill and carl', 'last one']}
df = pd.DataFrame(data=d)
And I would like to check if a column contains a value in a list, where the list is:
list = ['bill','carl']
I'd like to return something like this then
id text contains
1 bill did this bill
2 jim did something
3 phil
4 bill did nothing bill
5 carl was here carl
6 this is random
7 other name
8 other bill bill
9 bill and carl bill
9 bill and carl carl
10 last one
Although the way to handle 2 or more names in the same row is open to change.
Any suggestions?
You can create a lambda function to check for every item in your list:
d = {'id': [1,2,3,4,5,6,7,8,9,10],
'text': ['bill did this', 'jim did something', 'phil', 'bill did nothing',
'carl was here', 'this is random', 'other name',
'other bill', 'bill and carl', 'last one']}
df = pd.DataFrame(data=d)
l = ['bill','carl']
df['contains'] = df['text'].apply(lambda x: ','.join([i for i in l if i in x]))
You can remove join if you want the list, else it will just concatenate the values separated by a comma.
Output
>>df['contains']
0 bill
1
2
3 bill
4 carl
5
6
7 bill
8 bill,carl
9
Name: contains, dtype: object

Categories