Pandas Dataframe: Accessing via composite index created by groupby operation - python

I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.

Related

Pandas - groupby and show aggregate on all "levels"

I am a Pandas newbie and I am trying to automate the processing of ticket data we get from our IT ticketing system. After experimenting I was able to get 80 percent of the way to the result I am looking for.
Currently I pull in the ticket data from a CSV into a "df" dataframe. I then want to summarize the data for the higher ups to review and get high level info like totals and average "age" of tickets (number of days between ticket creation date and current date).
Here's an example of the ticket data for "df" dataframe:
I then create "df2" dataframe to summarize df using:
df2 = df.groupby(["dept", "group", "assignee", "ticket_type"]).agg(task_count=('ticket_type', 'size'), mean_age_in_days=('age', 'mean'),)
And here's what it I am getting if I print out df2...which is very close to what I need.
As you can see we look at the count of tickets assigned to each staff member, separated by type (incident, request), and also look at the average "age" of each ticket type (incident, request) for each staff member.
The roadblock that I am hitting now and have been pulling my hair out about is I need to show the aggregates (count and averages of ages) at all 3 levels (sorry if I am using the wrong jargon). Basically I need to show the count and average age for all tickets associated with a group, then the same thing for tickets at the department ("Division") level, and lastly the grand total and grand average in green...for all tickets which is the entire organization (all tickets in all departments, groups).
Here's an example of the ideal result I am trying to get:
You will see in red I want the count of tickets and average age for tickets for a given group. Then, in blue I want the count and average age for all tickets on the dept/division level (all tickets for all groups belonging to a given dept./division). Lastly, I want the grand total and grand average for all tickets in the entire organization. In the end both the df2 (summary of ticket data) and df will be dumped to an Excel file on separate worksheets in the same workbook.
Please have mercy on me! Can someone show me how I could generate the desired "summary" with counts and average age at all levels (group, dept., and organization)? Thanks in advance for any assistance, I'd really, really appreciate it!
*Added link to CSV with sample ticket data below:
on Github
Also, here's raw CSV text for the sample ticket data:
,number,created_on,dept,group,assignee,ticket_type,age
0,14500,2021-02-19 11:48:28,IT_Services_Division,Helpdesk,Jane Doe,Incident,361
1,16890,2021-04-20 10:51:49,IT_Services_Division,Helpdesk,Jane Doe,Incident,120
2,16891,2021-04-20 11:51:00,IT_Services_Division,Helpdesk,Tilly James,Request,120
3,15700,2021-06-09 09:05:28,IT_Services_Division,Systems,Steve Lee,Incident,252
4,16000,2021-08-12 09:32:39,IT_Services_Division,Systems,Linda Nguyen,Request,188
5,16100,2021-08-18 17:43:54,IT_Services_Division,TechSupport,Joseph Wills,Incident,181
6,19000,2021-01-17 15:01:50,IT_Services_Division,TechSupport,Bill Gonzales,Request,30
7,18990,2021-01-10 13:00:01,IT_Services_Division,TechSupport,Bill Gonzales,Request,37
8,18800,2021-12-03 21:13:12,Data_Division,DataGroup,Bob Simpson,Incident,74
9,16880,2021-10-18 11:56:03,Data_Division,DataGroup,Bob Simpson,Request,119
10,18000,2021-11-09 14:28:44,IT_Services_Division,Systems,Veronica Paulson,Incident,98
Here's a different approach which is easier, but results in a different structure
agg_df = df.copy()
#Add dept-level info to the department
gb = agg_df.groupby('dept')
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['dept'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add group-level info to the group label
gb = agg_df.groupby(['dept','group'])
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['group'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add org-level info
agg_df['org'] = 'Org [{} tasks, avg age = {}]'.format(len(agg_df),agg_df['age'].mean().round(2))
agg_df = (
agg_df.groupby(['org','dept','group','assignee','ticket_type']).agg(
task_count=('ticket_type','count'),
mean_ticket_age=('age','mean'))
)
agg_df
Couldn't think of a cleaner way to get the structure you want and had to manually loop through the different groupby levels adding one row at a time
multi_ind = pd.MultiIndex.from_tuples([],names=('dept','group','assignee','ticket_type'))
agg_df = pd.DataFrame(index=multi_ind, columns=['task_count','mean_age_in_days'])
data = lambda df: {'task_count':len(df),'mean_age_in_days':df['age'].mean()}
for dept,dept_g in df.groupby('dept'):
for group,group_g in dept_g.groupby('group'):
for assignee,assignee_g in group_g.groupby('assignee'):
for ticket_type,ticket_g in assignee_g.groupby('ticket_type'):
#Add ticket totals
agg_df.loc[(dept,group,assignee,ticket_type)] = data(ticket_g)
#Add group totals
agg_df.loc[(dept,group,assignee,'Group Total/Avg')] = data(group_g)
#Add dept totals
agg_df.loc[(dept,group,assignee,'Dept Total/Avg')] = data(dept_g)
#Add org totals
agg_df.loc[('','','','Org Total/Avg')] = data(df)
agg_df
Output

How to iterate over column values for each group and track sum

I have 4 dataframes like as given below
df_raw = pd.DataFrame(
{'stud_id' : [101, 101,101],
'prod_id':[12,13,16],
'total_qty':[100,1000,80],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})
df_accu = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,16],
'accu_qty':[10,500,10],
'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})
df_inv = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,18],
'inv_qty':[5,100,15],
'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})
df_bkl = pd.DataFrame(
{'stud_id' : [101,101,101,101],
'prod_id' :[12,12,12,17],
'bkl_qty' :[15,40,2,10],
'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})
My objective is to find out the below
a) Get the date when threshold exceeds 50%
threshold is given by the formula below
threshold = (((df_inv['inv_qty']+df_bkl['bkl_qty']+df_accu['accu_qty'])/df_raw['total_qty'])*100)
We have to add in the same order. Meaning, first, we have to add inv_qty, then bkl_qty and finally accu_qty.We do this way in order to identify the correct date when they exceeded 50% of total qty. Additionally, this has to be computed for each stud_id and prod_id.
but the problem is df_bkl has multiple records for the same stud_id and prod_id and it is by design. Real data also looks like this. Whereas df_accu and df_inv will have only row for each stud_id and prod_id.
In the above formula for df['bkl_qty'],we have to use each value of df['bkl_qty'] to compute the sum.
for ex: let's take stud_id = 101 and prod_id = 12.
His total_qty = 100, inv_qty = 5, his accu_qty=10. but he has three bkl_qty values - 15,40 and 2. So, threshold has to be computed in a fashion like below
5 (is value of inv_qty) +15 (is 1st value of bkl_qty) +40 (is 2nd value of bkl_qty) +2 (is 3rd value of bkl_qty) +10(is value of accu_qty)
So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty value was 40. Meaning, 5+15+40 = 60 (which is greater than 50% of total_qty (100)).
I was trying something like below
df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['bkl_qty'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
But this is incorrect as I am not able to do each value by value for bkl_qty from df_bkl
In this post, I have shown only sample data with one stud_id=101 but in real time I have more than 1000's of stud_id and prod_id.
Therfore, any elegant and efficient approach would be useful. We have to apply this logic on million record datasets.
I expect my output to be like as shown below. whenever the sum value exceeds 50% of total_qty, we need to get that corresponding date
stud_id,prod_id,total_qty,threshold,threshold_date
101 12 100 72 22/10/2021
It can be achieved using groupby and cumsum which does cumulative summation.
# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()
# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['csum'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']
# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()
df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)
provides the output as:
stud_id prod_id total_qty threshold threshold_date
0 101 12 100 72.0 22/10/2021
Does this work?

How to Merge Multilevel Column Dataframes on a Low Level Column

I have several small datasets from a databse displaying genes in different biological pathways. My end goal is to find what are the genes showing up in different datasets. For this reason, i tried to make multilevel dataframes from each dataset and merge them on a single column. However, it looks like it is getting nowhere.
Test samples: https://www.mediafire.com/file/bks9i9unfci0h1f/sample.rar/file
Making multilevel columns:
import pandas as pd
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv")
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv")
df3 = pd.read_csv("Endocytosis.csv")
title1 = "Bacterial invasion of epithelial cells"
title2 = "C-type lectin receptor signaling pathway"
title3 = "Endocytosis"
final1 = pd.concat({title1: df1}, axis = 1)
final2 = pd.concat({title2: df2}, axis = 1)
final3 = pd.concat({title3: df3}, axis = 1)
I tried to use pandas.merge() to merge the dataframes on "User ID" column:
pd.merge(final1, final2, on = "User ID", how = "outer")
But i get an error. I can not use droplevel(), because i need the title on top. So, i can see which dataset each sample belongs to.
Any sugesstion?
Seeing as you want to see which genes appear in different datasets, it sounds like an inner join might be more useful? With User ID as just a single row index.
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv").set_index('User ID')
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv").set_index('User ID')
df3 = pd.read_csv("Endocytosis.csv").set_index('User ID')
final1 = pd.concat({"Bacterial invasion of epithelial cells": df1}, axis = 1)
final2 = pd.concat({"C-type lectin receptor signaling pathway": df2}, axis = 1)
final3 = pd.concat({"Endocytosis": df3}, axis = 1)
final1.merge(final3, left_index=True, right_index=True)#.merge(final2, left_index=True, right_index=True)
Output:
Bacterial invasion of epithelial cells Endocytosis
Gene Symbol Gene Name Entrez Gene Score Gene Symbol Gene Name Entrez Gene Score
User ID
P51636 CAV2 caveolin 2 858 1.3911 CAV2 caveolin 2 858 1.3911
Q03135 CAV1 caveolin 1 857 1.5935 CAV1 caveolin 1 857 1.5935
(I've commented out the second merge operation with final2 as there aren't any overlapping genes between it and the other two, but you can repeat that process with as many datasets as you like.)

Pandas DataFrame: Adding a new column with the average price sold of an Author

I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)

How to create a column in a data frame based on the values of another two columns?

I am pre-formatting some data for a tax filing and I am using python to automate some of the excel work. I have a data frame with three columns: Account; Opposite Account; Amount. I only have the names of the opposite account and the values, but the values for the same pair of account - opposite account should be exactly the same. For example:
Account Opposite Acc. Amount
Cash -240.56
Supplies 240.56
Dentist -10.45
Gum 10.45
From that, I can deduce that Cash is the opposite of Supplies and Dentist is the opposite to Gum, so I would like my output to be:
Account Opposite Acc. Amount
Supplies Cash -240.56
Cash Supplies 240.56
Gum Dentist -10.45
Dentist Gum 10.45
Right now I doing this manually by using str.contains
df = df.assign(en_accounts = df['Opposite Acc.'])
df['Account'] = df['Account'].fillna("0")
df.loc[df['Account'].str.contains('Cash'), 'Account'] = 'Supplies'
But there are many variables and I wonder if there is a way to automate this process in python. One strategy could be: if two rows add up to 0, the accounts are a match --> therefore when item A (such as supplies) happens in "Opposite Acc.", item B (such as Cash) is put in the same row but in "Account".
This is what I have so far:
df['Amount'] = np.abs(df["Amount"])
c1 = df['Amount']
c2 = df['Opposing Acc.']
for i in range(1,len(c1)-1):
p = c1[i-1]
x = c1[i]
n = c1[i+1]
if p == x:
for i in range(1,len(c2)-1):
a = c2[i-1]
df.loc[df['en_account']] = a
But I get the following error: "None of [Index[....]\n dtype='object', length=28554)] are in the [index]"

Categories