Related
I need a help in pandas to group the rows based on a specific condition. I have a dataset as follows:
Name Source Country Severity
ABC XYZ USA Low
DEF XYZ England High
ABC XYZ India Medium
EFG XYZ Algeria High
DEF XYZ UK Medium
I want to group these rows based on the Name field in such a way that Country should be appended by rows in the column and Severity is set based on Its Highest priority.
After that output table looks like this:
Name Source Country Severity
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I am able to aggregate the first 3 columns using below code but not get solution for merging severity.
df = df.groupby('Name').agg({'source':'first', 'Country': ', '.join })
You should convert your Severity to an ordered Categorical.
This enables you to use a simple max:
df['Severity'] = pd.Categorical(df['Severity'],
categories=['Low', 'Medium', 'High'],
ordered=True)
out = (df
.groupby('Name')
.agg({'Source':'first',
'Country': ', '.join,
'Severity': 'max'})
)
output:
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
You can try converting the Severity column to a number, then aggregating Severity-number based on max, and then converting the Severity-number column back to a word like so:
import pandas
dataframe = pandas.DataFrame({'Name': ['ABC', 'DEF', 'ABC', 'EFG', 'DEF'], 'Source': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'], 'Country': ['USA', 'England', 'India', 'Algeria', 'UK'], 'Severity': ['Low', 'High', 'Medium', 'High', 'Medium']})
severity_to_number = {'Low': 1, 'Medium': 2, 'High': 3}
severity_to_word = inv_map = {v: k for k, v in severity_to_number.items()}
dataframe['Severity-number'] = dataframe['Severity'].replace(severity_to_number)
dataframe = dataframe.groupby('Name').agg({'Source':'first', 'Country': ', '.join, 'Severity-number':'max' })
dataframe['Severity'] = dataframe['Severity-number'].replace(severity_to_word)
del dataframe['Severity-number']
print(dataframe)
Output
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I have set of restaurant sales data the structure of which comprises two dimensions: a location dimension and a food type dimension, as well as a fact table that contains some measures. I am having trouble manipulating the table to perform analysis. In the end, it will likely be displayed in excel.
Here is a toy dataset:
tuples = [('California', 'SF'), ('California', 'SD'),
('New York', 'Roch'), ('New York', 'NY'),
('Texas', 'Houst'), ('Texas', 'SA')]
measure1 = [5, 10,
30, 60,
10, 30]
measure2 = [50, 10,
30, 6,
1, 30]
tier1 = ['Burger',
'Burger',
'Burger',
'Pizza',
'Pizza',
'Burger']
tier2 = ['Beef',
'Chicken',
'Beef',
'Pep',
'Cheese',
'Beef']
index = pd.MultiIndex.from_tuples(tuples, names=['State', 'City'])
revenue = pd.Series(measure1, index=index)
revenue = revenue.reindex(index)
rev_df = pd.DataFrame({'Category':tier1,
'Subcategory':tier2,
'Revenue': revenue,
'NumOfOrders': [3, 5,1, 3,10, 20]})
rev_df
This code produces this dataframe:
I want to do two things:
Place the Category and Subcategory as MultiIndex column headers and calculate NumOrder weighted revenue by food subcategory and category with subtotals and grand totals
Place the cities dimension on the Y-axis and move the Category and Subcategory by measure to the x-axis.
For example-
(1)
Burger Total Burger Pizza....
Beef Chicken
California SF 4 5 9
SD 5 5 10
Total Califor 9 10 19
(2)
California California Total
SF SD
Total Burger WgtRev 9 10 19
Beef WgtRev 4 5 10
Chickn WgtRev 5 5 10
Total Pizza...
To start, my first attempt was to use a pivot_table
pivoted = rev_df.pivot_table(index = ['State','City'],
columns = ['Category','Subcategory'],
aggfunc = 'sum', #How to make it a weighted average?
margins = True, #How to subtotal and grandtotal?
margins_name = 'All',
fill_value = 0)
KeyError: "['State' 'City'] not in index"
As you can see, I get an error. What is the most python way to manipulate this snowflake-esque datamodel?
Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 3 years ago.
II have a serie object containing 3 columns (name, code, value) which I get from the below function:
def get_fuzz(df, w):
s = df['Legal_Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'name': df['Legal_Name'].iloc[idx], 'lei': df['LEI'].iloc[idx], 'val': s.max()}
df1['Name'].apply(lambda x: get_fuzz(df, x))
The Serie looks like this
output
0 {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
2 {'name': 'VCC Live Group Zártkörűen Működő Rés...
I can assign the output to my dataframe with the code below.
df1.assign(search=df1['Name'].apply(lambda x: get_fuzz(df, x)))
The Dataframe that I get looks like this
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
Question
How can I split this column into 3 columns?
Final output wanted
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 Zion ZION INVESTMENT 549300D2O 100
Assuming you have the dataframe set up as:-
>>> df
ID Name search
0 1 Marshalll {'name': 'MGR Farms LLC', 'lei': '984500486BBD...
1 2 JP Morgan {'name': 'RAVENOL NORGE AS', 'lei': '549300D2O...
you can use:-
>>> df = pd.concat([df.drop(['search'], axis=1), df['search'].apply(pd.Series)], axis=1)
>>> df
ID Name name lei value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100
And then update the column names as needed:-
>>> df.columns = ['ID', 'Name', 'Name_bis', 'LEI', 'Value']
>>> df
ID Name Name_bis LEI Value
0 1 Marshalll MGR Farms LLC 984500486BBD 57
1 2 JP Morgan RAVENOL NORGE AS 549300D2O 100
I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN