Year over Year difference and selecting maximum row in pandas - python

I have a dataframe given as below:
ID YEAR NPS
500 2020 0
500 2021 0
500 2022 0
501 2020 32
501 2021 52
501 2022 99
503 2021 1
503 2022 4
504 2020 45
504 2021 55
504 2022 50
I have to calculate year over year difference as given below:
ID YEAR NPS nps_gain_yoy
500 2020 0 0
500 2021 0 0
500 2022 0 0
501 2020 32 0
501 2021 52 20
501 2022 99 47
503 2021 1 0
503 2022 4 3
504 2020 45 0
504 2021 55 10
504 2022 50 -5
In above output for starting year 2020 or first occurance of Id nps_gain_yoy needs to be zero then for 2021 nps_gain_yoy is difference between nps of 2021 and 2020 i.e 52-32 = 20 as shown in output for ID 501 for year 2021 and so on.
After this I need to pick the maximum difference or maximum nps_gain_yoy for each ID as given in below output:
ID YEAR NPS NPS_gain_yoy
501 2022 0 0
501 2022 99 47
503 2022 4 3
504 2021 55 10
Here 47 is the maximum nps gain for ID 501 in year 2022 similarly 3 for ID 503 and 4 for Id 504.

If years are consecutive per ID first use DataFrameGroupBy.diff:
df = df.sort_values(['ID','YEAR'])
df['nps_gain_yoy'] = df.groupby('ID')['NPS'].diff().fillna(0)
print (df)
ID YEAR NPS nps_gain_yoy
0 500 2020 0 0.0
1 500 2021 0 0.0
2 500 2022 0 0.0
3 501 2020 32 0.0
4 501 2021 52 20.0
5 501 2022 99 47.0
6 503 2021 1 0.0
7 503 2022 4 3.0
8 504 2020 45 0.0
9 504 2021 55 10.0
10 504 2022 50 -5.0
And then DataFrameGroupBy.idxmax with DataFrame.loc:
df1 = df.loc[df.iloc[::-1].groupby('ID')['nps_gain_yoy'].idxmax()]
#alternative solution
#df1 = df.sort_values(['ID','nps_gain_yoy']).drop_duplicates('ID', keep='last')
print (df1)
ID YEAR NPS nps_gain_yoy
2 500 2022 0 0.0
5 501 2022 99 47.0
7 503 2022 4 3.0
9 504 2021 55 10.0

Related

Flatten multindex columns into separate column

I have a DataFrame that looks similar to this:
Date Close Open
AAP AWS BGG ... AAP AWS BGG ...
2020 10 50 13 ... 100 500 13 ...
2021 11 41 7 ... 111 41 7 ...
2022 12 50 13 ... 122 50 13 ...
and want to turn it into
Date Close Open Index2
2020 10 100 AAP
2021 11 111 AAP
2022 12 122 AAP
2020 50 500 AWS
...
How can I achieve it using pandas?
You can use set_index and stack to get the expected dataframe:
>>> (df.set_index('Date').stack(level=1)
.rename_axis(index=['Date', 'Ticker'])
.reset_index())
Date Ticker Close Open
0 2020 AAP 10 100
1 2020 AWS 50 500
2 2020 BGG 13 13
3 2021 AAP 11 111
4 2021 AWS 41 41
5 2021 BGG 7 7
6 2022 AAP 12 122
7 2022 AWS 50 50
8 2022 BGG 13 13
My input dataframe:
>>> df
Date Close Open
AAP AWS BGG AAP AWS BGG
0 2020 10 50 13 100 500 13
1 2021 11 41 7 111 41 7
2 2022 12 50 13 122 50 13
You could also use wide_to_long
pd.wide_to_long(df.set_axis(df.columns.map('_'.join).str.rstrip('_'),axis=1),
['Close', 'Open'], 'Date', 'Ticker', '_', '\\w+').reset_index()
Date Ticker Close Open
0 2020 AAP 10 100
1 2021 AAP 11 111
2 2022 AAP 12 122
3 2020 AWS 50 500
4 2021 AWS 41 41
5 2022 AWS 50 50
6 2020 BGG 13 13
7 2021 BGG 7 7
8 2022 BGG 13 13

How to assign the groupby results to a series in pandas

I have a df which looks like this:
Date Value
2020 0
2020 100
2020 200
2020 300
2021 100
2021 150
2021 0
I want to get the average of the grouped Value by Date where Value > 0. When I tried:
df['Yearly AVG'] = df[df['Value']>0].groupby('Date')['Value'].mean()
I get NaN Values, when I print the line above I get what I need but with the Date column.
Date
2020 200
2021 125
How Can I have the following:
Date Value Yearly AVG
2020 0 200
2020 100 200
2020 200 200
2020 300 200
2021 100 125
2021 150 125
2021 0 125
Here is trick replace non matched values to missing values and then using GroupBy.transform for new columns filled by aggregate values:
df['Yearly AVG'] = df['Value'].where(df['Value']>0).groupby(df['Date']).transform('mean')
print (df)
Date Value Yearly AVG
0 2020 0 200.0
1 2020 100 200.0
2 2020 200 200.0
3 2020 300 200.0
4 2021 100 125.0
5 2021 150 125.0
6 2021 0 125.0
Detail:
print (df['Value'].where(df['Value']>0))
0 NaN
1 100.0
2 200.0
3 300.0
4 100.0
5 150.0
6 NaN
Name: Value, dtype: float64
Your solution should be changed:
df['Yearly AVG'] = df['Date'].map(df[df['Value']>0].groupby('Date')['Value'].mean())

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

I have data that looks like this.
Year Quarter Quantity Price TotalRevenue
0 2000 1 23 142 3266
1 2000 2 23 144 3312
2 2000 3 23 147 3381
3 2000 4 23 151 3473
4 2001 1 22 160 3520
5 2001 2 22 183 4026
6 2001 3 22 186 4092
7 2001 4 22 186 4092
8 2002 1 21 212 4452
9 2002 2 19 232 4408
10 2002 3 19 223 4237
I'm trying to figure out how to get the 'MarginalRevenue', where:
MR = (∆TR/∆Q)
MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity)
I found: df.pct_change()
But that seems to get the percentage change for everything.
Also, I'm trying to figure out how to get something related:
ElasticityPrice = (%ΔQuantity/%ΔPrice)
Do you mean something like this ?
df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change()
or
df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()

How to apply a function to multiple columns that iterates through each row

Data
I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs)
company month year_ly bookings_ly year_ty bookings_ty
company a 1 2018 432 2019 253
company a 2 2018 265 2019 635
company a 3 2018 345 2019 525
company a 4 2018 233 2019
company a 5 2018 7664 2019
... ... ... ... ... ...
company a 12 2018 224 2019 321
company b 1 2018 543 2019 576
company b 2 2018 23 2019 43
company b 3 2018 64 2019 156
company b 4 2018 143 2019
company b 5 2018 41 2019
company b 6 2018 90 2019
... ... ... ... ... ...
What I want
I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company):
((SUM of previous 3 rows (or months) of bookings_ty)
/(SUM of previous 3 rows (or months) of bookings_ly))
* bookings_ly
Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty.
My attempt
df_bkgs.set_index(['operator', 'month'], inplace=True)
def calc(df_bkgs):
df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
return df_bkgs
df_bkgs.groupby(level=0).apply(calc)
import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])
Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row.
Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement.
# dataframe is named f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 NaN
4 a 5 2018 7664 2019 NaN
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 NaN
10 b 5 2018 41 2019 NaN
11 b 6 2018 90 2019 NaN
f.reset_index(inplace=True)
def aggFunct(row, df, last=3):
series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0)
ssum = series.sum()
return ssum
f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1)
f.drop('index',axis=1,inplace=True)
f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 1413.0
4 a 5 2018 7664 2019 1160.0
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 775.0
10 b 5 2018 41 2019 199.0
11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work:
import numpy as np
import pandas as pd
df = pd.read_excel('data_file.xlsx') # <-- name of your file.
companies = pd.unique(df.company)
months = pd.unique(df.month)
for c in companies:
for m in months:
# slice a single row
df_row= df[(df['company']==c) & (df['month']==m)]
val = df_slice.bookings_ty.values[0]
if np.isnan(val):
# get the index of the row
idx = df_row.index[0]
df1 = df.copy()
df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))]
ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum()
projected_value = df_slice.bookings_ly.values[0] * ratio
df.loc[idx, 'bookings_ty'] = projected_value
else:
pass
print(df)
if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows.
df = pd.read_excel('data_file.xlsx') # your input file
ly = df.bookings_ly.values.tolist()
ty = df.bookings_ty.values.tolist()
for val in ty:
if np.isnan(val):
idx = ty.index(val) # returns the index of the first 'nan' found
ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx])
ty[idx] = ratio * ly[idx]
df['bookings_ty'] = ty
here is a solution:
import numpy as np
import pandas as pd
#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x):
while x['bookings_ty'].isnull().any():
x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
(x['bookings_ty'].shift(1) +
x['bookings_ty'].shift(2) +
x['bookings_ty'].shift(3)) /
(x['bookings_ly'].shift(1) +
x['bookings_ly'].shift(2) +
x['bookings_ly'].shift(3)) *
x['bookings_ly'], x['bookings_ty'])
return x
df = df.groupby(['company']).apply(lambda x: process(x))
#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial DF:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 315 **
4 company_a 5 2018 7664 2019 13418 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution:
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x, m):
idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
for i in idx:
id = i - x.index[0]
start = 0 if id < m else id - m
sum_ty = sum(x['bookings_ty'].to_list()[start:id])
sum_ly = sum(x['bookings_ly'].to_list()[start:id])
ly = x.at[i, 'bookings_ly']
x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
return x
rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial df:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253.0
1 company_a 2 2018 265 2019 635.0
2 company_a 3 2018 345 2019 NaN
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321.0
6 company_b 1 2018 543 2019 576.0
7 company_b 2 2018 23 2019 43.0
8 company_b 3 2018 64 2019 156.0
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
final result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows
3 company_a 4 2018 233 2019 296 **
4 company_a 5 2018 7664 2019 12467 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
if you want to speed up the process you could try:
df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))
instead of
df = df.groupby(['company']).apply(lambda x: process(x))

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442
​
How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652

Categories