Pandas: aggregate and show percent difference

Pandas: aggregate and show percent difference - python

I have a dataframe that looks like this:
df = pd.DataFrame( [
['BILING',2017,7,1406 ],
['BILWPL',2017,7,199],
['BKCLUB',2017,7,9417],
['LEAVEN',2017,7,4773 ],
['MAILORDER',2017,7,10487]
], columns=['Branch','Year','Month','count']
df
Out[1]:
Branch Year Month count
0 BILING 2017 7 1406
1 BILWPL 2017 7 199
2 BKCLUB 2017 7 9417
10 LEAVEN 2017 7 4773
18 MAILORDER 2017 7 10487
It contains the same month but different years so that one can compare the time of year across time.
The desired output would look something like:
Branch Month 2017 2019 Mean(ave) percent_diff
BILING 7 1406 1501 1480 5%
BILWPL 7 199 87 102 -40%
BKCLUB 7 9417 8002 7503 -3%
LEAVEN 7 4773 5009 4509 -15%
MAILORDER 7 10487 11032 9004 8%
My question is how to aggregate based on branch to display across and add 2 columns: mean and percent difference between mean and newest year.
**** UPDATE ****
This is close but is missing some columns [ Thanks G. Anderson ]:
df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0, aggfunc='mean')
Produces:
Year 2017 2018 2019
Branch
BILING 1406 1280 4
BILWPL 199 117 239
BKCLUB 94 161 238
This is very close but I'm hoping to tack on columns corresponding to the mean, and percent difference.
* UPDATE 2 *
circ_pivot = df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0)
circ_pivot['Mean'] = circ_pivot[[2017,2018,2019]].mean(axis=1)
circ_pivot['Change'] = ((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100
circ_pivot['Change_mean'] = ((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100
Output:
Year 2017 2018 2019 Mean Change Change_mean
Branch
BILING 1406 1280 4 896.666667 -99.687500 -99.553903
BILWPL 199 117 239 185.000000 104.273504 29.189189
BKCLUB 94 161 238 164.333333 47.826087 44.827586

This is the solution I ended up with.
circ_pivot = df.pivot_table(
values='count', index='Branch', columns='Year',
fill_value=0, aggfunc=np.sum, margins=True)
circ_pivot['Mean'] = round(circ_pivot[[2017,2018,2019]].mean(axis=1))
circ_pivot['Change'] = round(((circ_pivot[2019] - circ_pivot[2018]) / circ_pivot[2018]) * 100)
circ_pivot['Change_mean'] = round(((circ_pivot[2019] - circ_pivot['Mean']) / circ_pivot['Mean']) * 100)
print(circ_pivot)
Output:
Year 2017 2018 2019 All Mean Change Change_mean
Branch
BILING 1406 1280 4 2690.0 897.0 -100.0 -100.0
BILWPL 199 117 239 555.0 185.0 104.0 29.0
BKCLUB 94 161 238 493.0 164.0 48.0 45.0
Improvements would be:
Relative dates instead of hard coded date fields.

Related

Impute values from pandas row with specific identifier to all other rows

I have this dataframe (sorry, not sure how to format it nicely here):
SRC SRCDate Ticker Coupon Vintage Bal ($bn) WAC WAM WALA LNSZ ... FICO Refi% Month_Assessed CPR Month_key SRC_year SRC_month Year Month Interest_Rate
JPM 02/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 7.536801 M+2 2021 2 2021 2 2.24
JPM 03/05/2021 FNCI 1.5 2020 28.7 2.25 175 4 293 / 286 ... 777 91 Apr 5.131145 M+1 2021 3 2021 3 2.39
JPM 04/07/2021 FNCI 1.5 2020 28 2.25 173 6 292 / 281 ... 777 91 Apr 7.233214 M 2021 4 2021 4 2.36
JPM 05/07/2021 FNCI 1.5 2020 27.6 2.25 171 7 292 / 279 ... 777 91 Apr 8.900000 M-1 2021 5 2021 5 2.28
And use this code:
cols = ['SRC_year','Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate']
jpm_2021[cols] = jpm_2021[cols].apply(pd.to_numeric, downcast='float', errors='coerce')
for col in cols:
jpm_2021[col] = jpm_2021.groupby(['SRC_year','Ticker', 'Coupon', 'Vintage', 'Month_Assessed'])[col].transform('mean')
To normalize the values of all the cols to their respective means by the grouping in groupby. The reason for this is to be able to create a pivoted table with this code:
jpm_final = jpm_2021.pivot_table(index=['SRC', 'Ticker', 'Coupon', 'Vintage', 'Month_Assessed', 'Bal ($bn)', 'WAC', 'WAM', 'WALA', 'LTV', 'FICO', 'Refi%', 'Interest_Rate'],
columns="Month_key", values="CPR").rename_axis(columns=None).reset_index()
The problem is, taking the mean of all of those columns (especially Interest Rate) renders the resulting table less than insightful. Instead, what I'd like to do is to impute all the values in the rows where Month_key is M to all the other rows with the same grouping defined in the groupby function above. Any tips on how to do that?

Transform each group in a DataFrame

I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?

df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0

You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

I have data that looks like this.
Year Quarter Quantity Price TotalRevenue
0 2000 1 23 142 3266
1 2000 2 23 144 3312
2 2000 3 23 147 3381
3 2000 4 23 151 3473
4 2001 1 22 160 3520
5 2001 2 22 183 4026
6 2001 3 22 186 4092
7 2001 4 22 186 4092
8 2002 1 21 212 4452
9 2002 2 19 232 4408
10 2002 3 19 223 4237
I'm trying to figure out how to get the 'MarginalRevenue', where:
MR = (∆TR/∆Q)
MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity)
I found: df.pct_change()
But that seems to get the percentage change for everything.
Also, I'm trying to figure out how to get something related:
ElasticityPrice = (%ΔQuantity/%ΔPrice)

Do you mean something like this ?
df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change()
or
df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442

How can I achieve this in pandas?

Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652

How can we use pandas to generate min, max, mean, median, ...as new columns for the dataframe?

I just pick up pandas. I have a dataframe as follow:
DEST MONTH PRICE SOUR TYPE YEAR
0 DEST7 8 159 SOUR4 WEEKEND 2015
1 DEST2 9 391 SOUR1 WEEKEND 2010
2 DEST5 5 612 SOUR1 WEEKDAY 2013
3 DEST4 10 836 SOUR4 WEEKEND 2013
4 DEST4 4 689 SOUR3 WEEKEND 2013
5 DEST7 3 862 SOUR4 WEEKDAY 2014
6 DEST4 5 483 SOUR4 WEEKEND 2016
7 DEST2 2 489 SOUR3 WEEKEND 2017
8 DEST4 7 207 SOUR1 WEEKDAY 2012
9 DEST3 11 374 SOUR2 WEEKDAY 2015
10 DEST1 2 959 SOUR2 WEEKEND 2017
11 DEST5 10 969 SOUR3 WEEKDAY 2011
12 DEST8 3 645 SOUR4 WEEKEND 2013
13 DEST6 7 258 SOUR4 WEEKEND 2013
14 DEST8 5 955 SOUR4 WEEKDAY 2010
15 DEST1 3 568 SOUR4 WEEKEND 2013
16 DEST5 5 601 SOUR4 WEEKDAY 2016
17 DEST1 6 159 SOUR3 WEEKDAY 2011
18 DEST3 11 322 SOUR4 WEEKDAY 2013
19 DEST2 10 103 SOUR2 WEEKDAY 2012
I've put the code below, feel free to generate your own random dataframe:
import pandas as pd
import random
import numpy as np
df= pd.DataFrame({"YEAR": np.random.choice([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], 20, replace=True),
"MONTH": np.random.choice([_ for _ in range(1, 13)], 20, replace=True),
"TYPE": np.random.choice(['WEEKDAY', 'WEEKEND'], 20, replace=True),
"SOUR": np.random.choice(['SOUR1', 'SOUR2', 'SOUR3', 'SOUR4'], 20, replace=True),
"DEST": np.random.choice(['DEST1', 'DEST2', 'DEST3', 'DEST4','DEST5', 'DEST6', 'DEST7', 'DEST8'], 20, replace=True),
"PRICE": np.random.choice([_ for _ in range(100, 999)], 20, replace=True)})
print(df)
I want to generate min, max, mean, median, ...as new columns, add those columns to the dataframe. This is the aggregation code I tried:
aggregation={
"PRICE":
{
"MIN": lambda x: x.min(skipna=True),
"MAX":lambda x: x.max(skipna=True),
"MEDIAN":lambda x: x.median(skipna=True),
"MEAN":lambda x:x.mean(skipna=True)
}
}
df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df1
But the output doesn't calculate any min, max, median, mean at all:
YEAR MONTH TYPE SOUR DEST PRICE
MIN MAX MEDIAN MEAN
0 2010 5 WEEKDAY SOUR4 DEST8 955 955 955 955
1 2010 9 WEEKEND SOUR1 DEST2 391 391 391 391
2 2011 6 WEEKDAY SOUR3 DEST1 159 159 159 159
3 2011 10 WEEKDAY SOUR3 DEST5 969 969 969 969
4 2012 7 WEEKDAY SOUR1 DEST4 207 207 207 207
5 2012 10 WEEKDAY SOUR2 DEST2 103 103 103 103
6 2013 3 WEEKEND SOUR4 DEST1 568 568 568 568
7 2013 3 WEEKEND SOUR4 DEST8 645 645 645 645
8 2013 4 WEEKEND SOUR3 DEST4 689 689 689 689
9 2013 5 WEEKDAY SOUR1 DEST5 612 612 612 612
10 2013 7 WEEKEND SOUR4 DEST6 258 258 258 258
11 2013 10 WEEKEND SOUR4 DEST4 836 836 836 836
12 2013 11 WEEKDAY SOUR4 DEST3 322 322 322 322
13 2014 3 WEEKDAY SOUR4 DEST7 862 862 862 862
14 2015 8 WEEKEND SOUR4 DEST7 159 159 159 159
15 2015 11 WEEKDAY SOUR2 DEST3 374 374 374 374
16 2016 5 WEEKDAY SOUR4 DEST5 601 601 601 601
17 2016 5 WEEKEND SOUR4 DEST4 483 483 483 483
18 2017 2 WEEKEND SOUR2 DEST1 959 959 959 959
19 2017 2 WEEKEND SOUR3 DEST2 489 489 489 489
How could I modify the python code to give correct output? Thanks.
And another question, if I want to add another column which calculate the average price group only by TYPE, SOUR, DEST, (not include MONTH OR YEAR), how to generate if I want to keep the group of TYPE, SOUR, DEST, MONTH, YEAR? My expected output:
YEAR MONTH TYPE SOUR DEST PRICE
MIN MAX MEDIAN MEAN AVG
0 2010 5 WEEKDAY SOUR4 DEST8 ... ... ... ... 500
1 2010 9 WEEKEND SOUR1 DEST2 ... ... ... ...
2 2011 6 WEEKDAY SOUR3 DEST5 ... ... ... ... 720
3 2011 10 WEEKDAY SOUR3 DEST5 ... ... ... ... 720
4 2012 7 WEEKDAY SOUR1 DEST4 ... ... ... ...
5 2012 10 WEEKDAY SOUR2 DEST2 ... ... ... ...
6 2013 3 WEEKEND SOUR4 DEST1 ... ... ... ...
7 2013 3 WEEKDAY SOUR4 DEST8 ... ... ... ... 500
8 2013 4 WEEKEND SOUR3 DEST4 ... ... ... ...
9 2013 5 WEEKDAY SOUR1 DEST5 ... ... ... ...
10 2013 7 WEEKEND SOUR4 DEST6 ... ... ... ...
...

You're code actually does calculate the min, max, median and mean. However, since your using groupby on 5 columns. The chance of 2 rows containing the same values for all 5 columns with only 20 rows is very little.
Either increase the amount of data, so the groupby actually groups rows together, or groupby on less columns at a time.
To add a column with the AVG (mean) using only 3 columns for the groupby, do the groupby on the first DataFrame seperately and merge them on the three columns.
df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df2=df.groupby(["TYPE", "SOUR", "DEST"]).agg({"PRICE":{ "avg" : "mean"} } ).reset_index()
df3= pd.merge(df1, df2, on=["TYPE", "SOUR", "DEST"], how='left')

to apply multiple function within your aggregate
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'], 'height': [9.1, 6.0, 9.5, 34.0], 'weight': [7.9, 7.5, 9.9, 198.0]})
animals.groupby("kind").agg(
min_height=pd.NamedAgg(column='height', aggfunc='min'),
max_height=pd.NamedAgg(column='height', aggfunc='max'),
average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
)
the output looks like
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: aggregate and show percent difference - python

Related

Impute values from pandas row with specific identifier to all other rows

Transform each group in a DataFrame

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

How can we use pandas to generate min, max, mean, median, ...as new columns for the dataframe?

Categories

Resources