This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 3 years ago.
II have a serie object containing 3 columns (name, code, value) which I get from the below function:
def get_fuzz(df, w):
s = df['Legal_Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'name': df['Legal_Name'].iloc[idx], 'lei': df['LEI'].iloc[idx], 'val': s.max()}
df1['Name'].apply(lambda x: get_fuzz(df, x))
The Serie looks like this
output
0 {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
2 {'name': 'VCC Live Group Zártkörűen Működő Rés...
I can assign the output to my dataframe with the code below.
df1.assign(search=df1['Name'].apply(lambda x: get_fuzz(df, x)))
The Dataframe that I get looks like this
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
Question
How can I split this column into 3 columns?
Final output wanted
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 Zion ZION INVESTMENT 549300D2O 100
Assuming you have the dataframe set up as:-
>>> df
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
you can use:-
>>> df = pd.concat([df.drop(['search'], axis=1), df['search'].apply(pd.Series)], axis=1)
>>> df
ID Name name lei value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100
And then update the column names as needed:-
>>> df.columns = ['ID', 'Name', 'Name_bis', 'LEI', 'Value']
>>> df
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100
Related
I need a help in pandas to group the rows based on a specific condition. I have a dataset as follows:
Name Source Country Severity
ABC XYZ USA Low
DEF XYZ England High
ABC XYZ India Medium
EFG XYZ Algeria High
DEF XYZ UK Medium
I want to group these rows based on the Name field in such a way that Country should be appended by rows in the column and Severity is set based on Its Highest priority.
After that output table looks like this:
Name Source Country Severity
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I am able to aggregate the first 3 columns using below code but not get solution for merging severity.
df = df.groupby('Name').agg({'source':'first', 'Country': ', '.join })
You should convert your Severity to an ordered Categorical.
This enables you to use a simple max:
df['Severity'] = pd.Categorical(df['Severity'],
categories=['Low', 'Medium', 'High'],
ordered=True)
out = (df
.groupby('Name')
.agg({'Source':'first',
'Country': ', '.join,
'Severity': 'max'})
)
output:
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
You can try converting the Severity column to a number, then aggregating Severity-number based on max, and then converting the Severity-number column back to a word like so:
import pandas
dataframe = pandas.DataFrame({'Name': ['ABC', 'DEF', 'ABC', 'EFG', 'DEF'], 'Source': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'], 'Country': ['USA', 'England', 'India', 'Algeria', 'UK'], 'Severity': ['Low', 'High', 'Medium', 'High', 'Medium']})
severity_to_number = {'Low': 1, 'Medium': 2, 'High': 3}
severity_to_word = inv_map = {v: k for k, v in severity_to_number.items()}
dataframe['Severity-number'] = dataframe['Severity'].replace(severity_to_number)
dataframe = dataframe.groupby('Name').agg({'Source':'first', 'Country': ', '.join, 'Severity-number':'max' })
dataframe['Severity'] = dataframe['Severity-number'].replace(severity_to_word)
del dataframe['Severity-number']
print(dataframe)
Output
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I would like to rename 'multi level columns' of a pandas dataframe to 'single level columns'. My code so far does not give any errors but does not rename either. Any suggestions for code improvements?
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3][[('Country', 'Country'), ('GDP[10]', 'GDP[10]')]]\
.rename(columns={('Country', 'Country'):'Country', ('GDP[10]', 'GDP[10]'): 'GDP'})
df
I prefer to use the rename method. df.columns = ['Country', 'GDP'] works but is not what I am looking for.
For rename solution create dictionary by flatten values of MultiIndex with join with new columns names in zip:
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3]
df.columns = df.columns.map('_'.join)
old = ['No._No.', 'Country_Country', 'GNI (Atlas method)[8]_value (a)',
'GNI (Atlas method)[8]_a - GDP', 'GNI[9]_value (b)', 'GNI[9]_b - GDP',
'GDP[10]_GDP[10]']
new = ['No.','Country','GNI a','GDP a','GNI b', 'GNI b', 'GDP']
df = df.rename(columns=dict(zip(old, new)))
If want create dictionary for rename:
d = {'No._No.': 'No.', 'Country_Country': 'Country', 'GNI (Atlas method)[8]_value (a)': 'GNI a', 'GNI (Atlas method)[8]_a - GDP': 'GDP a', 'GNI[9]_value (b)': 'GNI b', 'GNI[9]_b - GDP': 'GNI b', 'GDP[10]_GDP[10]': 'GDP'}
df = df.rename(columns=d)
print (df)
No. Country GNI a GDP a GNI b GNI b GDP
0 1 United States 20636317 91974 20837347 293004 20544343
1 2 China 13181372 -426779 13556853 -51298 13608151
2 3 Japan 5226599 255276 5155423 184100 4971323
3 4 Germany 3905321 -42299 4058030 110410 3947620
4 5 United Kingdom 2777405 -77891 2816805 -38491 2855296
5 6 France 2752034 -25501 2840071 62536 2777535
6 7 India 2727893 9161 2691040 -27692 2718732
7 8 Italy 2038376 -45488 2106525 22661 2083864
8 9 Brazil 1902286 16804 1832170 -53312 1885482
9 10 Canada 1665565 -47776 1694054 -19287 1713341
For alternatives of "rename", you can use get_level_values(). See below:
df.columns = df.columns.get_level_values(0)
>>> print(df)
Country GDP[10]
0 United States 20544343
1 China 13608151
2 Japan 4971323
3 Germany 3947620
4 United Kingdom 2855296
5 France 2777535
6 India 2718732
7 Italy 2083864
8 Brazil 1885482
9 Canada 1713341
I have a Yfinance dictionary like this:
{'zip': '94404',
'sector': 'Healthcare',
'fullTimeEmployees': 11800,
'circulatingSupply': None,
'startDate': None,
'regularMarketDayLow': 67.99,
'priceHint': 2,
'currency': 'USD'}
I want to convert it into DataFrame but the output has no information on Row:
Jupyter Notebook view:
Data = {'zip': '94404', 'sector': 'Healthcare', 'fullTimeEmployees': 11800, 'circulatingSupply': None, 'startDate': None, 'regularMarketDayLow': 67.99, 'priceHint': 2, 'currency': 'USD'}
Third bracket around the dictionary variable:
df = pd.DataFrame([Data])
df
Result:
zip sector fullTimeEmployees circulatingSupply startDate regularMarketDayLow priceHint currency
0 94404 Healthcare 11800 None None 67.99 2 USD
# Data returned back by yfinance
stock = yf.Ticker('AAPL')
# Store stock info as dictionary
stock_dict = stock.info
# Create DataFrame from non-compatible dictionary
stock_df = pd.DataFrame(list(stock_dict.items()))
0 1
0 zip 95014
1 sector Technology
2 fullTimeEmployees 137000
3 longBusinessSummary Apple Inc. designs, manufactures, and markets ...
4 city Cupertino
5 phone 408-996-1010
6 country United States
7 companyOfficers []
8 website http://www.apple.com
9 maxAge 1
10 address1 One Apple Park Way
Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
I have a list of dictionaries, where each dictionary represents a record. It is formatted as follows:
>>> ListOfData=[
... {'Name':'Andrew',
... 'number':4,
... 'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
... {'Name':'Ben',
... 'number':6,
... 'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}},
... {'Name':'Cathy',
... 'number':1,
... 'contactinfo':{'Phone':'555-5556', 'Address':'126 3rd St'}}]
>>>
>>> import pprint
>>> pprint.pprint(ListOfData)
[{'Name': 'Andrew',
'contactinfo': {'Address': '123 Main St', 'Phone': '555-5555'},
'number': 4},
{'Name': 'Ben',
'contactinfo': {'Address': '124 2nd St', 'Phone': '555-5554'},
'number': 6},
{'Name': 'Cathy',
'contactinfo': {'Address': '126 3rd St', 'Phone': '555-5556'},
'number': 1}]
>>>
What is the best way to read this into a Pandas dataframe with multiindex columns for those attributes in the sub dictionaries?
For example, I'd ideally have 'Phone' and 'Address' columns nested under the 'contactinfo' columns.
I can read in the data as follows, but would like the contact info column to be broken into sub columns.
>>> pd.DataFrame.from_dict(ListOfData)
Name contactinfo number
0 Andrew {u'Phone': u'555-5555', u'Address': u'123 Main... 4
1 Ben {u'Phone': u'555-5554', u'Address': u'124 2nd ... 6
2 Cathy {u'Phone': u'555-5556', u'Address': u'126 3rd ... 1
>>>
how about this
declare empty data frame
df = DataFrame(columns=('Name', 'conntactinfo', 'number'))
then iterate over List and add rows
for row in ListOfData:
df.loc[len(df)] = row
complete code
import pandas as pd
ListOfData=[
{'Name':'Andrew',
'number':4,
'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
{'Name':'Ben',
'number':6,
'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}}]
df = pd.DataFrame(columns=('Name', 'contactinfo', 'number'))
for row in ListOfData:
df.loc[len(df)] = row
print(df)
this prints
Name contactinfo number
0 Andrew {'Phone': '555-5555', 'Address': '123 Main St'} 4
1 Ben {'Phone': '555-5554', 'Address': '124 2nd St'} 6
Here is a pretty clunky workaround that I was able to get what I need. I loop through the columns, find those that are made of dicts and then divide it into multiple columns and merge it to the dataframe. I'd appreciate hearing any ways to improve this code. I'd imagine that ideally the dataframe would be constructed from the get-go without having dictionaries as values.
>>> df=pd.DataFrame.from_dict(ListOfData)
>>>
>>> for name,col in df.iteritems():
... if any(isinstance(x, dict) for x in col.tolist()):
... DividedDict=col.apply(pd.Series)
... DividedDict.columns=pd.MultiIndex.from_tuples([(name,x) for x in DividedDict.columns.tolist()])
... df=df.join(DividedDict)
... df.drop(name,1, inplace=True)
...
>>> print df
Name number (contactinfo, Address) (contactinfo, Phone)
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556
>>>
Don't know about best or not, but you could do it in two steps:
>>> df = pd.DataFrame(ListOfData)
>>> df = df.join(pd.DataFrame.from_records(df.pop("contactinfo")))
>>> df
Name number Address Phone
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556