Pandas calculate percent growth over rows - python

I've created the following pandas dataframe and try to calculate the growth in % between the years given in Col2:
Col1
Col2
Jan
Feb
Mrz
Total
A
2019
100
200
300
600
A
2020
200
300
400
900
B
2019
10
20
30
60
B
2020
20
30
40
90
C
2019
1000
2000
3000
6000
C
2020
2000
3000
4000
9000
The table including the results should look like this (see last 3 rows):
Col1
Col2
Jan
Feb
Mrz
Total
A
2019
100
200
300
600
A
2020
200
300
400
900
B
2019
10
20
30
60
B
2020
20
30
40
90
C
2019
1000
2000
3000
6000
C
2020
2000
3000
4000
9000
A
GrowthInPercent
100
50
33
50
B
GrowthInPercent
100
50
33
50
C
GrowthInPercent
100
50
33
50
Is there a way to calculate the GrowthInPercent values using a pandas function?
I do not get it ;-(

You can use pct_change with groupby
u = (df[['Col1']].join(df.drop("Col2",1).groupby('Col1').pct_change()
.mul(100).round())
.dropna().assign(Col2="Growth%"))
out = df.append(u,ignore_index=True)
print(out)
Col1 Col2 Feb Jan Mrz Total
0 A 2019 200.0 100.0 300.0 600.0
1 A 2020 300.0 200.0 400.0 900.0
2 B 2019 20.0 10.0 30.0 60.0
3 B 2020 30.0 20.0 40.0 90.0
4 C 2019 2000.0 1000.0 3000.0 6000.0
5 C 2020 3000.0 2000.0 4000.0 9000.0
6 A Growth% 50.0 100.0 33.0 50.0
7 B Growth% 50.0 100.0 33.0 50.0
8 C Growth% 50.0 100.0 33.0 50.0
Note - this is assuming the data is sorted by Col1 and Col2 , if not you can use df = df.sort_values(by=['Col1','Col2']) first to sort the data.

Related

How to calculate last two week sum for each group ID

** I have Input data frame **
ID
Date
Amount
A
2021-08-03
100
A
2021-08-04
100
A
2021-08-06
20
A
2021-08-07
100
A
2021-08-09
300
A
2021-08-11
100
A
2021-08-12
100
A
2021-08-13
10
A
2021-08-23
10
A
2021-08-24
10
A
2021-08-26
10
A
2021-08-28
10
desired Output data frame
ID
Date
Amount
TwoWeekSum
A
2021-08-03
100
320
A
2021-08-04
100
320
A
2021-08-06
20
320
A
2021-08-07
100
320
A
2021-08-09
300
830
A
2021-08-11
100
830
A
2021-08-12
100
830
A
2021-08-13
10
830
A
2021-08-23
10
40
A
2021-08-24
10
40
A
2021-08-26
10
40
A
2021-08-28
10
40
I want to calculate the last two week total sum like
twoweekSum= current week total sum + Previous Week total sum i.e. current week is 34 then twoweekSum is 34 week total sum + 33 week total sum.
Please help me in to get in this in like output data frame so I can use that for further analysis.
Thank You folks !
Use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#convert values to weeks
df['week'] = df['Date'].dt.isocalendar().week
#aggregate sum per ID and weeks, then add missing weeks and sum in rolling
f = lambda x: x.reindex(range(x.index.min(), x.index.max() + 1))
.rolling(2, min_periods=1).sum()
df1 = df.groupby(['ID', 'week'])['Amount'].sum().reset_index(level=0).groupby('ID').apply(f)
print (df1)
Amount
ID week
A 31 320.0
32 830.0
33 510.0
34 40.0
#last add to original DataFrame per ID and weeks
df=df.join(df1.rename(columns={'Amount':'TwoWeekSum'}),on=['ID','week']).drop('week',axis=1)
print (df)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
mapper = df.groupby(df['Date'].astype('Period[W]')).sum().reindex(per, fill_value=0).rolling(2, 1).sum()['Amount']
out = df['Date'].astype('Period[W]').map(mapper)
out
0 320.0
1 320.0
2 320.0
3 320.0
4 830.0
5 830.0
6 830.0
7 830.0
8 40.0
9 40.0
10 40.0
11 40.0
Name: Date, dtype: float64
make out to TwoWeekSum column
df.assign(TwoWeekSum=out)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
Update
if each ID , groupby and merge
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
s = df['Date'].astype('Period[W]')
idx = pd.MultiIndex.from_product([df['ID'].unique(), per])
df1 = df.groupby(['ID', s]).sum().reindex(idx, fill_value=0).rolling(2, 1).agg(sum).reset_index().set_axis(['ID', 'period', 'TwoWeekSum'], axis=1)
df.assign(period=s).merge(df1, how='left').drop('period', axis=1)
Try using groupby to group the dataframe by dt.week (each week), then use transform sum to add up the values weekly and repeat the values:
df['TwoWeekSum'] = df.groupby(df['Date'].dt.week)['Amount'].transform('sum')
And then:
print(df)
Gives:
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320
1 A 2021-08-04 100 320
2 A 2021-08-06 20 320
3 A 2021-08-07 100 320
4 A 2021-08-09 300 830
5 A 2021-08-11 100 830
6 A 2021-08-12 100 830
7 A 2021-08-13 10 830
8 A 2021-08-23 10 40
9 A 2021-08-24 10 40
10 A 2021-08-26 10 40
11 A 2021-08-28 10 40

How to get top n records from each category in a Python dataframe?

The data is sorted in descending order on column 'id' in the following dataframe -
id Name version copies price
6 MSFT 10.0 5 100
6 TSLA 10.0 10 200
6 ORCL 10.0 15 300
5 MSFT 10.0 20 400
5 TSLA 10.0 25 500
5 ORCL 10.0 30 600
4 MSFT 10.0 35 700
4 TSLA 10.0 40 800
4 ORCL 10.0 45 900
3 MSFT 5.0 50 1000
3 TSLA 5.0 55 1100
3 ORCL 5.0 60 1200
2 MSFT 5.0 65 1300
2 TSLA 5.0 70 1400
2 ORCL 5.0 75 1500
1 MSFT 15.0 80 1600
1 TSLA 15.0 85 1700
1 ORCL 15.0 90 1800
...
Based on the input 'n', I would like to filter above data such that, if input is '2', the resulting dataframe should look like -
Name version copies price
MSFT 10.0 5 100
TSLA 10.0 10 200
ORCL 10.0 15 300
MSFT 10.0 20 400
TSLA 10.0 25 500
ORCL 10.0 30 600
MSFT 5.0 50 1000
TSLA 5.0 55 1100
ORCL 5.0 60 1200
MSFT 5.0 65 1300
TSLA 5.0 70 1400
ORCL 5.0 75 1500
MSFT 15.0 80 1600
TSLA 15.0 85 1700
ORCL 15.0 90 1800
Basically, only the top 'n' groups of 'id' for a specific version should be present in the resulting dataframe. If a version has id's < n (e.g. in version 15.0 there is only one group with id = 1), then all the groups of id's should be present.
I tried using groupy and head, but it didn't work for me. I absolutely have no other clue in getting this to work.
I really appreciate any help with this, thank you.
you can use groupby.transform on the column version, and factorize the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc with this mask to select the wanted rows.
n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
id Name version copies price
0 6 MSFT 10.0 5 100
1 6 TSLA 10.0 10 200
2 6 ORCL 10.0 15 300
3 5 MSFT 10.0 20 400
4 5 TSLA 10.0 25 500
5 5 ORCL 10.0 30 600
9 3 MSFT 5.0 50 1000
10 3 TSLA 5.0 55 1100
11 3 ORCL 5.0 60 1200
12 2 MSFT 5.0 65 1300
13 2 TSLA 5.0 70 1400
14 2 ORCL 5.0 75 1500
15 1 MSFT 15.0 80 1600
16 1 TSLA 15.0 85 1700
17 1 ORCL 15.0 90 1800
Another option is to use groupby.head once you drop_duplicated to keep unique version-id couples. then use select version-id in a merge.
df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))

pandas create column as lagged difference of two other columns grouped by key

I have the following dataframe (df)
AmountNeeded AmountAvailable
Source Target
1 2 290.0 600.0
4 300.0 600.0
6 200.0 600.0
3 2 290.0 450.0
5 100.0 450.0
7 8 0.0 500.0
I would like to compute the remaining availability per source:
AmountNeeded AmountAvailable RemainingAvailability
Source Target
1 2 290.0 600.0 600
4 300.0 600.0 310
6 200.0 600.0 10
3 2 290.0 450.0 450
5 100.0 450.0 160
7 8 0.0 500.0 500
So if a Source appears more than once, I need to subtract the sum of lagged values of AmountNeeded for that particular Source.
If we take Source 1 and Target 4 the remaining amount should be AmountAvailable-AmountNeeded(previous_row) = 600 - 290 = 310
If we move to Source 1 and Target 6 this will be: 600 - (290+300) = 10.
This also be computed as RemainingAvailability - AmountNeeded = 310 - 300 = 10
I tried to use different combinations of groupby and diff but without much success.
Use Series.sub with helper Series created by lambda function with Series.shift and cumulative sum Series.cumsum:
s = df.groupby(level=0)['AmountNeeded'].apply(lambda x: x.shift(fill_value=0).cumsum())
df['RemainingAvailability'] = df['AmountAvailable'].sub(s)
print (df)
AmountNeeded AmountAvailable RemainingAvailability
Source Target
1 2 290.0 600.0 600.0
4 300.0 600.0 310.0
6 200.0 600.0 10.0
3 2 290.0 450.0 450.0
5 100.0 450.0 160.0
7 8 0.0 500.0 500.0

Pandas Dataframe: shift/merge multiple rows sharing the same column values into one row

Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.

Shift element by 2 when there is a change in value in a column and then forward fill using pandas

I have a pandas dataframe with date index and 100 columns of stock prices.
I want to each stock, when ever there is a price change, there to be a lag of 2 and then after forward fill.
Eg data of 2 columns (subset of my data):
Stock A Stock B
1/1/2000 100 50
1/2/2000 100 50
1/3/2000 100 50
1/4/2000 350 50
1/5/2000 350 50
1/6/2000 350 50
1/7/2000 350 25
1/8/2000 350 25
1/9/2000 500 25
1/10/2000 500 25
1/11/2000 500 25
1/12/2000 500 150
1/1/2001 250 150
1/2/2001 250 150
1/3/2001 250 150
1/4/2001 250 150
1/5/2001 250 150
1/6/2001 250 150
1/7/2001 250 150
1/8/2001 75 150
1/9/2001 75 150
1/10/2001 75 25
1/11/2001 75 25
1/12/2001 75 25
1/1/2002 75 25
Now the output I desire is this:
Stock A Stock B
1/1/2000
1/2/2000
1/3/2000
1/4/2000
1/5/2000 100
1/6/2000 100
1/7/2000 100
1/8/2000 100 50
1/9/2000 100 50
1/10/2000 350 50
1/11/2000 350 50
1/12/2000 350 50
1/1/2001 350 25
1/2/2001 500 25
1/3/2001 500 25
1/4/2001 500 25
1/5/2001 500 25
1/6/2001 500 25
1/7/2001 500 25
1/8/2001 500 25
1/9/2001 250 25
1/10/2001 250 25
1/11/2001 250 150
1/12/2001 250 150
1/1/2002 250 150
Example of stock A:
When stock A changed first time (100 to 350), then previous value (100) was assigned to 2 days ahead (1/5/200). Then when it changed again from 350 to 500, 350 was assigned to 2 days ahead (1/10/2000) etc.....then a forward fill takes place.
Any help would be appreciated.
df.where(df.diff(-1).fillna(0).ne(0)).shift(2).ffill()
A B
2000-01-01 NaN NaN
2000-02-01 NaN NaN
2000-03-01 NaN NaN
2000-04-01 NaN NaN
2000-05-01 100.0 NaN
2000-06-01 100.0 NaN
2000-07-01 100.0 NaN
2000-08-01 100.0 50.0
2000-09-01 100.0 50.0
2000-10-01 350.0 50.0
2000-11-01 350.0 50.0
2000-12-01 350.0 50.0
2001-01-01 350.0 25.0
2001-02-01 500.0 25.0
2001-03-01 500.0 25.0
2001-04-01 500.0 25.0
2001-05-01 500.0 25.0
2001-06-01 500.0 25.0
2001-07-01 500.0 25.0
2001-08-01 500.0 25.0
2001-09-01 250.0 25.0
2001-10-01 250.0 25.0
2001-11-01 250.0 150.0
2001-12-01 250.0 150.0
2002-01-01 250.0 150.0

Categories