I want to make the sum of each 'Group' which have at least one 'Customer' with an 'Active' Bail.
Sample Input :
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
3 84759 CARROUF Inactive 20
For example 'NAFNAF' has 2 customers, including one with an active bail.
Output expected :
NAFNAF : 300
LINDT : 350
TOTAL ACTIVE: 650
I don't wanna change the original dataframe
You can use:
(df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
output:
Amount
Group
LINDT 350
NAFNAF 300
Full output with total:
df2 = (
df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
df2 = pd.concat([df2, df2.sum().to_frame('TOTAL').T])
output:
Amount
LINDT 350
NAFNAF 300
TOTAL 650
Create a boolean mask of Group with at least one active lease:
m = df['Group'].isin(df.loc[df['Bail'].eq('Active'), 'Group'])
out = df[m]
At this point, your filtered dataframe looks like:
>>> out
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
Now you can use groupby and sum:
out = df[m].groupby('Group')['Amount'].sum()
out = pd.concat([out, pd.Series(out.sum(), index=['TOTAL ACTIVE'])])
# Output
LINDT 350
NAFNAF 300
TOTAL ACTIVE 650
dtype: int64
Related
Have a dataframe mortgage_data with columns name mortgage_amount and month (in asceding order)
mortgage_amount_paid = 1000
mortgage_data:
name mortgage_amount month
mark 400 1
mark 500 2
mark 200 3
How to deduct and update mortgage_amount in ascending order or month using mortgage_amount_paid row by row in a dataframe
and add a column paid_status as yes if mortgage_amount_paid is fully deducted for that amount and no if not like this
if mortgage_amount_paid = 1000
mortgage_data:
name mortgage_amount month mortgage_amount_updated paid_status
mark 400 1 0 full
mark 500 2 0 full
mark 200 3 100 partial
ex:
if mortgage_amount_paid = 600
mortgage_data:
name mortgage_amount month mortgage_amount_updated paid_status
mark 400 1 0 full
mark 500 2 300 partial
mark 200 3 200 zero
tried this:
import numpy as np
mortgage_amount_paid = 1000
df['mortgage_amount_updated'] = np.where(mortgage_amount_paid - df['mortgage_amount'].cumsum() >=0 , 0, df['mortgage_amount'].cumsum() - mortgage_amount_paid)
df['paid_status'] = np.where(df['mortgage_amount_updated'],'full','partial')
IIUC, you can use masks:
mortgage_amount_paid = 600
# amount saved - debt
m1 = df['mortgage_amount'].cumsum().sub(mortgage_amount_paid)
# is it positive?
m2 = m1>0
# is the previous month also positive?
m3 = m2.shift(fill_value=False)
df['mortgage_amount_updated'] = (m1.clip(0, mortgage_amount_paid)
.mask(m3, df['mortgage_amount'])
)
df['paid_status'] = np.select([m3, m2], ['zero', 'partial'], 'full')
output:
name mortgage_amount month mortgage_amount_updated paid_status
0 mark 400 1 0 full
1 mark 500 2 300 partial
2 mark 200 3 200 zero
Idea is the cumsum before partial should less than mortgage_amount_paid and there could be at most one partial
mortgage_amount_paid = 600
m = df['mortgage_amount'].cumsum()
df['paid_status'] = np.select(
[m <= mortgage_amount_paid,
(m > mortgage_amount_paid) & (m.shift() < mortgage_amount_paid)
],
['full', 'partial'],
default='zero'
)
df['mortgage_amount_updated'] = np.select(
[df['paid_status'].eq('full'),
df['paid_status'].eq('partial')],
[0, m-mortgage_amount_paid],
default=df['mortgage_amount']
)
print(df)
name mortgage_amount month paid_status mortgage_amount_updated
0 mark 400 1 full 0
1 mark 500 2 partial 300
2 mark 200 3 zero 200
I have two dataframes. the first one looks like this:
id, orderdate, orderid, amount,camp1, camp2, camp3
1 2020-01-01 100 100 1 0 0
2 2020-02-01 120 200 1 0 1
3 2019-12-01 130 500 0 1 0
4 2019-11-01 150 750 0 1 0
5 2020-01-01 160 1000 1 1 1
camp1, camp2, camp3 parts show if the customer attended to a campaign.
and the campaigns have a period dictionary such that
camp_periods = {'camp1':
[datetime.strptime('2019-04-08', '%Y-%m-%d'), datetime.strptime('2019-06-06', '%Y-%m-%d')],
'camp2':
[datetime.strptime('2019-09-15', '%Y-%m-%d'), datetime.strptime('2019-09-28', '%Y-%m-%d')],
'camp3':
[datetime.strptime('2019-11-15', '%Y-%m-%d'), datetime.strptime('2019-12-28', '%Y-%m-%d')]
}
I would like to create a table giving the number of orders and total of order amounts per customer, if the orderdate is between the campaign periods in the camp_periods dictionary and if the customer attended to that campaign.
I'm not sure if I understood very well your question, I guess when you say the number of orders and total of order amounts you want to get the first n number of orders that are under or equal with a given total order amounts, here is my approach:
data example:
from operator import or_
from functools import reduce
number_orders = 2
total_order_amounts = 3000
orderdate is between the campaign periods in the camp_periods
dictionary and if the customer attended to that campaign
cond = [(df[k].astype('bool') & df['orderdate'].between(*v)) for k, v in camp_periods.items()]
cond = reduce(or_, cond)
df_cond = df[cond]
df_final = df_cond[df_cond['amount'].cumsum() <= total_order_amounts].head(number_orders)
df_final
output:
I process a single file which contains columns, date, id, product, sale,delivery.
I want to sum up based more than one raws, if they occurred on same day for same id and product.
data = data.groupby(['Date', 'Id', 'Product']).agg({'sale': 'sum'}).reset_index()
How can i add, sum up the deliveries too in above line?
eg:
Input
Date Id Product Sale deliveries
01/09 1000 A 1000 500
01/09 1000 A 350 0
02/09 1001 B 1100 555
02/09 1001 B 333 222
output
Date Id Product Sale deliveries
01/09 1000 A 1350 500
02/09 10001 B 1433 777
You can specify column in list for aggregation:
data = data.groupby(['Date', 'Id', 'Product'])['Sale','deliveries'].sum().reset_index()
If need procesing all another columns is necessary aggregation for each of them, e.g. mean, first, last:
data = (data.groupby(['Date', 'Id', 'Product'])
.agg({'sale': 'sum', 'deliveries':'sum', 'another col': 'mean'})
.reset_index())
Just do a sum:
df.groupby(["Date","Id","Product"])["Sale","deliveries"].sum().reset_index()
#
Date Id Product Sale deliveries
0 01/09 1000 A 1350 500
1 02/09 1001 B 1433 777
My Dataframe looks like this:
campaign_name campaign_id event_name clicks installs conversions
campaign_1 1234 registration 100 5 1
campaign_1 1234 hv_users_r 100 5 2
campaign_2 2345 registration 500 10 3
campaign_2 2345 hv_users_w 500 10 2
campaign_3 3456 registration 1000 50 10
campaign_4 3456 hv_users_r 1000 50 15
campaign_4 3456 hv_users_w 1000 50 25
I want to categorize all the "event names" into 2 new columns, where 1st new columns represents "registration", and the 2nd new column represents "hv_users", which will the sum of all rows having event-names of "hv_users_r" & "hv_users_w".
To keep this simple - "registration" column will have rows which only have event_name as "registration". All non "registration" event_names would go into the new column "hv_users".
This is my expected new Dataframe:
campaign_name campaign_id clicks installs registrations hv_users
campaign_1 1234 100 5 1 2
campaign_2 2345 500 10 3 2
campaign_3 3456 1000 50 10 40
Can someone please give me directions on how to go from the input DataFrame to the output DataFrame?
df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)
You can using split + join, then groupby+unstack
df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
unstack(fill_value=0).reset_index()
Out[302]:
event_name ampaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0
pd.crosstab() and pd.pivot() should do the trick.
#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)
Try using pivot_table
df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)
campaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0
I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)