Using custom functions on pandas group by aggregating - python

I have a dataframe like this,
>>> data = {
'year':[2019, 2020, 2020, 2019, 2020, 2019],
'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
'price':[100, 122, 0, 150, 120, 80],
'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
year provider price count
0 2019 X 100 20
1 2020 X 122 15
2 2020 Y 0 24
3 2019 Z 150 16
4 2020 Z 120 24
5 2019 T 80 10
And this is expected output:
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I want to group prices on providers and find price, count differences between 2019 and 2020.
If there is no price or count record at 2020 or 2019, don't want to see related provider.

By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values on year to make sure 2019 comes before 2020.
Then we groupby on provider and divide the rows of price and count and substract 1.
df = df.sort_values('year')
grp = (
df.groupby('provider')
.apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)
dfnew = df[['provider']].join(grp).dropna()
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
Or only vectorized methods:
dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)
dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50

You can try:
final = (df.set_index(['provider','year']).groupby(level=0)
.pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50

Related

Calculate Overlapping & Non-Overlapping Data Points in a Data frame across Years

I have a single Dataframe and I need to find how many toys with different color are same and how many are changing across years.
For Example: Toy1 color remain intact from 2019 to 2020 but in year 2021 there were two toys one with red and other with green color. Hence there is no change in 2019 to 2020 stating overlap of 1 and new count as 0. However for year 2020 to 2021 overlap count though will remain 1 (due to red color), new count will get the value as 1 (due to addition of green color of toy)
Attaching a sample data, original data has million of records.
Input data -
input_data = pd.DataFrame({'Toy': ['Toy1', 'Toy1', 'Toy1', 'Toy1', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy3', 'Toy3', 'Toy3'],
'Toy_year': [2019, 2020, 2021, 2021, 2019, 2020, 2020, 2021, 2021, 2019, 2020, 2021],
'Color': ['Red', 'Red', 'Red', 'Green ', 'Green ', 'Green ', 'Red', 'Green ', 'Red', 'Blue', 'Yellow', 'Yellow']})
Output data -
output_data = pd.DataFrame({'Year': ['2019-2020', '2019-2020', '2019-2020', '2020-2021', '2020-2021', '2020-2021'],
'Toy': ['Toy1', 'Toy2', 'Toy3', 'Toy1', 'Toy2', 'Toy3'],
'overlap_count': [1, 1, 0, 1, 1, 1],
'new_count': [0, 1, 1, 1, 1, 0]})
I am trying the below method but it is very slow -
toy_list = ['Toy1','Toy2','Toy3']
year_list = [2019,2020]
for i in toy_list:
for j in year_list:
y1 = j
y2 = j+1
x1 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y1)]
x2 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y2)]
z1 = list(set(x1.Color) & set(x2.Color))
print (x1)
print (x2)
print (z1)
Any leads is really appreciated
A few steps here. First we unstack the data to have a cross table of toy/year vs color, where 1 indicates that that color was in force for that toy/year
df1 = input_data.assign(count=1).set_index(['Toy','Toy_year','Color']).unstack(level=2)
df1
df1 looks like this:
count
Color Blue Green Red Yellow
Toy Toy_year
Toy1 2019 NaN NaN 1.0 NaN
2020 NaN NaN 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy2 2019 NaN 1.0 NaN NaN
2020 NaN 1.0 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy3 2019 1.0 NaN NaN NaN
2020 NaN NaN NaN 1.0
2021 NaN NaN NaN 1.0
Now we can aggregate these, by row, to come up with summary statistics 'overlap_count' and 'new_count'. Overlap_count is the sum of matches between each row and its next (within each toy/year group), and new_count is the sum across the next row minus the overlap from the current row
ccols= df1.columns
df2 = df1.copy()
df2['overlap_count'] = df1.groupby(['Toy'], group_keys = False).apply(lambda g: (g[ccols] == g[ccols].shift(-1)).sum(axis=1))
df2['new_count']= df2.groupby(['Toy'], group_keys = False).apply(lambda g: g[ccols].shift(-1).sum(axis=1) - g['overlap_count'])
Now we just massage the result into the required form:
df3 = df2[['overlap_count','new_count']].reset_index().droplevel(1,axis=1)
df3['Year'] = df3['Toy_year'].astype(str) + '-' + df3['Toy_year'].astype(str).shift(-1)
df3 = df3[df3['Toy_year'] != 2021].drop(columns = ['Toy_year'])
df3
output:
Toy overlap_count new_count Year
-- ----- --------------- ----------- ---------
0 Toy1 1 0 2019-2020
1 Toy1 1 1 2020-2021
3 Toy2 1 1 2019-2020
4 Toy2 2 0 2020-2021
6 Toy3 0 1 2019-2020
7 Toy3 1 0 2020-2021

Groupby Sum returns the wrong sum value as it has been multiplied in Pandas

Here's a sample code:
import pandas as pd
data = {'Date': ['10/10/21', '10/10/21', '13/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '13/10/21', '13/10/21', '13/10/21', '10/10/21', '10/10/21'],
'ID': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'TotalTimeSpentInMinutes': [19, 6, 14, 17, 51, 53, 66, 19, 14, 28, 44, 22, 41],
'Vehicle': ['V3', 'V1', 'V3', 'V1','V1','V1','V1','V1','V1','V1','V1','V1','V1']
}
df = pd.DataFrame(data)
prices = {
'V1': 9.99,
'V2': 9.99,
'V3': 14.00,
}
default_price = 9.99
df = df.sort_values('ID')
df['OrdersPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['ID'].transform('count')
df['MinutesPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['TotalTimeSpentInMinutes'].transform(sum)
df['HoursPD'] = df['MinutesPD'] / 60
df['Pay excl extra'] = df.apply(lambda x: prices[x.get('Vehicle', default_price)]*x['HoursPD'], axis=1).round(2)
extra = 1.20
df['Extra Pay'] = df.apply(lambda x: extra*x['OrdersPD'], axis=1)
df['Total_pay'] = df['Pay excl extra'] + df['Extra Pay'].round(2)
df['Total Pay PD'] = df.groupby(['ID'])['Total_pay'].transform(sum)
#Returns wrong sum
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#Returns wrong sum
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
#Returns wrong sum
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
print(df)
I'm trying to find the total sum per ID for 2 things: Hours and Pay.
Here's my code to find the total for hours and pay
Hours:
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#I've also tried with just .sum() but it returns an empty column
Pay:
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
Output Example for ID 1: - ABS Final Pay
Date ID Vehicle OrdersPD HoursPD PayExclExtra ExtraPay
10/10/21 1 V1 1 0.1 1 1.20
10/10/21 1 V3 1 0.3166 4.43 1.20
13/10/21 1 V3 1 0.2333 3.27 1.20
Total_pay Total Pay PD Total Courier Hours ABS Final Pay
2.20 12.30 0.65 36.90
5.63 12.30 0.65 36.90
4.47 12.30 0.65 36.90
The 2 columns Total Courier Hours and ABS Final Pay are wrong because right now the code calculates the total by doing this:
ABS Final Pay = Total Pay PD * OrdersPD per count of ID
Example: for 10/10/21 - it does 12.30 * 2 = 24.60
for 13/10/21 - it does 12.30 * 1 = 12.30
ABS Final Pay returns 36.90 when it should be 12.30 (7.83 + 4.47 from the 2 days)
Total Pay PD for ID 1 is also wrong as it should show the sum of pay per date, example of expected output:
Date ID Vehicle OrdersPD Total PD
10/10/21 1 V1 1 7.83
10/10/21 1 V3 1 7.83
13/10/21 1 V1 1 4.47
Total Courier Hours seems to be fine for ID 1 when it's split into 3 rows with 1 order per row but when it has more than 1 order, it calculates it wrong as it multiplies it.
Example for ID 2 - Total Courier Hours
It calculates it doing this sum:
Total Courier Hours = HoursPD * OrdersPD per count of ID
Example: 11/10/21 - ID 2 had 5 orders, 2.85 * 5 = 14.25
13/10/21 - 3 orders, 2.01 * 3 = 6.03
10/10/21 - 2 orders, 1.05 * 2 = 2.1
Total Courier Hours returns 22.38 when it should be 5.91 (2.85 + 2.01 + 1.05 from the 3 days)
Sorry for the long post, I hope this makes sense and thanks in advance.
The drop_duplicates line may have been the issue. Once I removed the code:
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
I was able to calculate the totals more accurately line by line instead of having to do calculations to the columns within the code.
To separate it neatly, I printed the columns by groupby's in a different excel sheet.
Example:
per_courier = (
df.groupby(['ID'])['Total Pay']
.agg(sum)
)

Getting % Rate using Pandas Group By and .sum()

I'd like to get some % rates based on a .groupby() in pandas. My goal is to take an indicator column Ind and get the Rate of A (numerator) divided by the total (A+B) in that year
Example Data:
import pandas as pd
import numpy as np
df: pd.DataFrame = pd.DataFrame([['2011','A',1,2,3], ['2011','B',4,5,6],['2012','A',15,20,4],['2012','B',17,12,12]], columns=["Year","Ind","X", "Y", "Z"])
print(df)
Year Ind X Y Z
0 2011 A 1 2 3
1 2011 B 4 5 6
2 2012 A 15 20 4
3 2012 B 17 12 12
Example for year 2011: XRate would be summing up the A indicators for X (which would be 1) and dividing byt the total (A+B) which would be 5 thus I would receive an Xrate of 0.20.
I would like to do this for all columns X, Y, Z to get the rates. I've tried doing lambda applys but can't quite get the desired results.
Desired Results:
Year XRate YRate ZRate
0 2011 0.20 0.29 0.33
1 2012 0.47 0.63 0.25
You can group the dataframe on Year and aggregate using sum:
s1 = df.groupby('Year').sum()
s2 = df.query("Ind == 'A'").groupby('Year').sum()
s2.div(s1).round(2).add_suffix('Rate')
XRate YRate ZRate
Year
2011 0.20 0.29 0.33
2012 0.47 0.62 0.25

Python Pandas - Dividing data in one column by another if both are summarized using the groupby function

I have the following code that generates four columns as I intended
df['revenue'] = pd.to_numeric(df['revenue']) #not exactly sure what this does
df['Date'] = pd.to_datetime(df['Date'], unit='s')
df['Year'] = df['Date'].dt.year
df['First Purchase Date'] = pd.to_datetime(df['First Purchase Date'], unit='s')
df['number_existing_customers'] = df.groupby(df['Year'])[['Existing Customer']].sum()
df['number_new_customers'] = df.groupby(df['Year'])[['New Customer']].sum()
df['Rate'] = df['number_new_customers']/df['number_existing_customers']
Table = df.groupby(df['Year'])[['New Customer', 'Existing Customer', 'Rate', 'revenue']].sum()
print(Table)
I want to be able to divide one column by another (new customers by existing) but I seem to be getting zeros when creating the new column (see output below).
>>> print(Table)
New Customer Existing Customer Rate revenue
Year
2014 7.00 2.00 0.00 11,869.47
2015 1.00 3.00 0.00 9,853.93
2016 5.00 3.00 0.00 4,058.53
2017 9.00 3.00 0.00 8,056.37
2018 12.00 7.00 0.00 22,031.23
2019 16.00 10.00 0.00 97,142.42
All you need to do is define the column and then use the corresponding operator, in this case /:
Table['Rate'] = Table['New customer']/Table['Existing customer']
In this example I'm copying your Table output and using the code I've posted:
import pandas as pd
import numpy as np
data = {'Year':[2014,2015,2016,2017,2018,2019],'New customer':[7,1,5,9,12,16],'Existing customer':[2,3,3,3,7,10],'revenue':[1000,1000,1000,1001,1100,1200]}
Table = pd.DataFrame(data).set_index('Year')
Table['Rate'] = Table['New customer']/Table['Existing customer']
print(Table)
Output:
New customer Existing customer revenue Rate
Year
2014 7 2 1000 3.500000
2015 1 3 1000 0.333333
2016 5 3 1000 1.666667
2017 9 3 1001 3.000000
2018 12 7 1100 1.714286
2019 16 10 1200 1.600000

Cumulative Multiplication in Pandas Python

I have a dataframe such as the following:
What's the best way to calculate a cumulative return to fill the Nan Values? The logic of each cell is shown.
Following is the intended result:
import pandas as pd
df = pd.DataFrame({"DATE":[2018,2019,2020,2021,2022,2023,2024],"RATIO":[0.03,0.04,0.05,0.06,0.07,0.08,0.09],"PROFIT":[10,20,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.loc[df['DATE']==2020, ['PROFIT']] = 20000*(1+0.04)
df.loc[df['DATE']==2021, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)
df.loc[df['DATE']==2022, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)
df.loc[df['DATE']==2023, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)
df.loc[df['DATE']==2024, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)*(1+0.080)
df
You are looking for cumprod
df['PROFIT']=df['PROFIT'].fillna(df.RATIO.shift().add(1).iloc[2:].cumprod()*20000)
df
Out[30]:
DATE RATIO PROFIT
0 2018 0.03 10.00000
1 2019 0.04 20.00000
2 2020 0.05 20800.00000
3 2021 0.06 21840.00000
4 2022 0.07 23150.40000
5 2023 0.08 24770.92800
6 2024 0.09 26752.60224

Categories