Pandas hierarchical sort - python

I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.
How I need it sorted
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
EDIT:
The data frame:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
Note: this is the order I want it. It may be in any arbitrary order before the sort.
EDIT2:
If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure

One way could be to first str.split the category column.
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
Then get the column amount and what you want is to get the maximum amount per group based on:
the first column alone,
then the first and the second columns
then the first-second and third columns, ...
You can do this with groupby.transform with max, and you concat each column created.
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100

I packaged #Ben. T's answer into a more generic function, hopefully this is clearer to read!
EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by #Ben. T in the comments.
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
Using Python 3.7.3, pandas 0.24.2

To answer my own question: I found a way. Kind of long winded but here it is.
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')

If you don't mind adding an extra column you can extract the main category from the category and then sort by amount/main category/category, ie.:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
Output:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
Note that this will work well only if your main categories are single words without spaces. Otherwise you will need to do it in a different way, ie. extract all non-colons and strip the trailing space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()

Related

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

How to sort dataframe in pandas by value in hierarchical category structure

I have a data frame in pandas.
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food", "Living : Something", "Living : Anitsomething"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100, 1000, -1000]
})
Categories and subcategories are split by a colon.
I am trying to sort this data frame in descending amount (absolute value) order. Whilst respecting the hierarchical grouping. I.e. the sorted result should look like
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1600
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Rent 400
Living 250
Living : Something 1000
Living : Antisomething -1000
Living : Other 150
Living : Food 100
I can do this recursively in an incredibly inefficient manner. Super slow but it works.
def sort_hierachical(self, full_df, name_column, sort_column, parent="", level=0):
result_df = pd.DataFrame(columns=full_df.columns)
part_df = full_df.loc[(full_df[name_column].str.count(':') == level) & (full_df[name_column].str.startswith(parent)), :]
part_df['abs'] = part_df[sort_column].abs()
part_df = part_df.sort_values('abs', ascending=False)
for _, row in part_df.iterrows():
category = row[name_column]
row_df = pd.DataFrame(columns = full_df.columns).append(row)
child_rows = self.sort_hierachical(full_df, name_column, sort_column, category, level+1)
if not child_rows.empty:
result_df = pd.concat([result_df, row_df], sort=False)
result_df = pd.concat([result_df, child_rows], sort=False)
else:
result_df = pd.concat([result_df, row_df], sort=False)
return result_df
df = self.sort_hierachical(df, "category", "amount")
My question: Is there a nice performant way to do such a thing in pandas. Some sort of group by sort or multi index trick??
Good karma will come to the ones who can solve this challenging problem :)
Edit:
This almost works... But the -1000, 1000 messes up the sort order.
def _sort_tree_df(self, df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_key].values)
for x in range(df.index.nlevels, 0, -1):
group_lvl = list(range(0, x))
sort_columns.append(df.groupby(level=group_lvl)[sort_key].transform('max').values)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted = df_sorted.drop(sort_key, axis=1)
return df_sorted
Edit2:
Ok I think I've managed to make it work. I'm still very confused how lexsort works. I made this work through educated trial and error. If you understand it please feel free to explain it. Also feel free to post a better method.
def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_column].abs().values)
for x in range(df.index.nlevels, 0, -1):
group_lvl = list(range(0, x))
sort_columns.append(df.groupby(level=group_lvl)[sort_column].transform('sum').abs().values)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
return df_sorted
Edit3 :
Actually this doesn't always sort correctly :(
Edit4
The problem is I need a way to make th transform('sum') only apply to items where level = x-1
ie something like:
df['level'] = df[tree_column].str.count(':')
sorting_by = df.groupby(level=group_lvl)[sort_column].transform('sum' if 'level' = x-1).abs().values
or
sorting_by = df.groupby(level=group_lvl).loc['level' = x-1: sort_column].transform('sum').abs().values
both of which are not valid
Anyone know how to do a conditional transform like this on a multi index df?
I am not sure I exactly understood the question but I think you should split the columns in sub categories and then do a value sorting based on hierarchy you want. Something like the following might do the job.
use the following to create new columns:
for _, row in df.iterrows():
for item, col in zip(row.category.split(':'), ['cat', 'sub_cat', 'sub_sub_cat']):
df.loc[_, col] = item
and then just sort them
df.sort_values(['cat', 'sub_cat', 'sub_sub_cat', 'amount'])
category amount cat sub_cat sub_sub_cat
3 Household 1100 Household NaN NaN
7 Household : Cleaning 100 Household Cleaning NaN
8 Household : Cleaning : Bathroom 75 Household Cleaning Bathroom
9 Household : Cleaning : Kitchen 25 Household Cleaning Kitchen
10 Household : Rent 400 Household Rent NaN
4 Household : Utilities 600 Household Utilities NaN
6 Household : Utilities : Electric 200 Household Utilities Electric
5 Household : Utilities : Water 400 Household Utilities Water
11 Living 250 Living NaN NaN
15 Living : Anitsomething -1000 Living Anitsomething NaN
13 Living : Food 100 Living Food NaN
12 Living : Other 150 Living Other NaN
14 Living : Something 1000 Living Something NaN
0 Transport 5000 Transport NaN NaN
1 Transport : Car 4900 Transport Car NaN
2 Transport : Train 100 Transport Train Na
OK, took a while to nut out but now I'm pretty sure this works. Much faster than recursive method too.
def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
df=df.copy()
parts = df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series)
for i, column in enumerate(parts.columns):
df[column] = parts[column]
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_column].abs().values)
df['level'] = df[tree_column].str.count(':')
for x in range(len(parts.columns), 0, -1):
group_columns = list(range(0, x))
sorting_by = df.copy()
sorting_by.loc[sorting_by['level'] != x-1, sort_column] = np.nan
sorting_by = sorting_by.groupby(group_columns)[sort_column].transform('sum').abs().values
sort_columns.append(sorting_by)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df.drop([column for column in parts.columns], inplace=True, axis=1)
df.drop('level', inplace=True, axis=1)
return df_sorted

Pandas: Using Append Adds New Column and Makes Another All NaN

I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))

Specifying column order following groupby aggregation

The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5

How to concat a dataframe in pandas?

I am fetching the data from mongoDB to python through pymongo and then converting it into pandas dataframe
df = pd.DataFrame(list(db.dataset2.find()))
This is how data looks like in mongoDB.
"dish" : [
{
"dish_id" : "005" ,
"dish_name" : "Sandwitch",
"dish_price" : 50,
"coupon_applied" : "Yes",
"coupon_type" : "Rs 20 off"
},
{
"dish_id" : "006" ,
"dish_name" : "Chicken Hundi",
"dish_price" : 125,
"coupon_applied" : "No",
"coupon_type" : "Null"
}
],
I want to seperate dish attributes into two rows in pandas dataframe. Here is the code which does that. (There are 3 dish documents) so, I am iterating it through for loop.
for i in range(0,len(df.dish)):
data_dish = json_normalize(df['dish'][i])
print data_dish
But it gives me below output..
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 001 Chicken Biryani 120
1 No Null 001 Paneer Biryani 100
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 40 off 002 Mutton Biryani 130
1 No Null 004 Aaloo tikki 95
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 005 Sandwitch 50
1 No Null 006 Chicken Hundi 125
And I want output in following format..
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 001 Chicken Biryani 120
1 No Null 001 Paneer Biryani 100
2 Yes Rs 40 off 002 Mutton Biryani 130
3 No Null 004 Aaloo tikki 95
4 Yes Rs 20 off 005 Sandwitch 50
5 No Null 006 Chicken Hundi 125
Can you help me with this? thanks in advance :)
There is
dishes = [json_normalize(d) for d in df['dish']]
df = pd.concat(dishes, ignore_index=True)
You should be able to get a list of dataframes in a list and then concat them.
Inizialize a new Dataframe:
df = pd.DataFrame()
Create an empty list of Dataframes:
dflist = []
Loop and append dataframes
for i in range(0,len(df.dish)):
data_dish = json_normalize(df['dish'][i])
dflist.append(data_dish)
Then concat the list into the full dataframe:
df = pd.concat(dflist, ignore_index=True)

Categories