Pandas Month-To-Date rolling sum - python

We can apply a 30D monthly rolling sum operations as:
df.rolling("30D").sum()
However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion?
Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?

Consider the following database:
Year Month week Revenue
0 2020 1 1 10
1 2020 1 2 20
2 2020 1 3 10
3 2020 1 4 20
4 2020 2 1 10
5 2020 2 2 20
6 2020 2 3 10
7 2020 2 4 20
8 2020 3 1 10
9 2020 3 2 20
10 2020 3 3 10
11 2020 3 4 20
12 2021 1 1 10
13 2021 1 2 20
14 2021 1 3 10
15 2021 1 4 20
16 2021 2 1 10
17 2021 2 2 20
18 2021 2 3 10
19 2021 2 4 20
20 2021 3 1 10
21 2021 3 2 20
22 2021 3 3 10
23 2021 3 4 20
You could use a combination of group_by + cumsum to get what you want:
df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum()
df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum()
Results:
Year Month week Revenue Year_To_date Month_To_date
0 2020 1 1 10 10 10
1 2020 1 2 20 30 30
2 2020 1 3 10 40 40
3 2020 1 4 20 60 60
4 2020 2 1 10 70 10
5 2020 2 2 20 90 30
6 2020 2 3 10 100 40
7 2020 2 4 20 120 60
8 2020 3 1 10 130 10
9 2020 3 2 20 150 30
10 2020 3 3 10 160 40
11 2020 3 4 20 180 60
12 2021 1 1 10 10 10
13 2021 1 2 20 30 30
14 2021 1 3 10 40 40
15 2021 1 4 20 60 60
16 2021 2 1 10 70 10
17 2021 2 2 20 90 30
18 2021 2 3 10 100 40
19 2021 2 4 20 120 60
20 2021 3 1 10 130 10
21 2021 3 2 20 150 30
22 2021 3 3 10 160 40
23 2021 3 4 20 180 60
Note that Month-to-date makes sense only if you have a week/date column in your data model.
EXTRAS:
The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes.
Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order.
df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)

Related

Appending DataFrame columns to another DataFrame at an index/location that meets conditions [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20

Fill column with value from previous year from the same month

How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15

How can I calculate delta at dataframe?

A B C Delta
**1 Jan 10 0**
1 Feb 20 10
1 Mar 40 30
**2 Jan 10 0**
2 Feb 30 20
2 Mar 20 10
2 Oct 40 30
**3 Jan 10 0**
3 Feb 20 10
3 Mar 30 20
3 Oct 40 30
3 Dec 50 40
how can I calculate delta column?
I couldn't find it anywhere.
Please let me know. how to calculate
Subtract column C by Series.sub with repeated first values per groups by GroupBy.transform and GroupBy.first:
df['Delta'] = df['C'].sub(df.groupby('A')['C'].transform('first'))
print (df)
A B C Delta
0 1 Jan 10 0
1 1 Feb 20 10
2 1 Mar 40 30
3 2 Jan 10 0
4 2 Feb 30 20
5 2 Mar 20 10
6 2 Oct 40 30
7 3 Jan 10 0
8 3 Feb 20 10
9 3 Mar 30 20
10 3 Oct 40 30
11 3 Dec 50 40
Detail:
print (df.groupby('A')['C'].transform('first'))
0 10
1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
10 10
11 10
Name: C, dtype: int64

calculating mean and sum in pivot_table in pandas sorted by two separate desired col values

I have a data set from 2015-2018 which has months and days as 2nd and third col like below:
Year Month Day rain temp humidity snow
2015 1 1 0 20 60 0
2015 1 2 2 18 58 0
2015 1 3 0 20 62 2
2015 1 4 5 15 62 0
2015 1 5 2 18 61 1
2015 1 6 0 19 60 2
2015 1 7 3 20 59 0
2015 1 8 2 17 65 0
2015 1 9 1 17 61 0
I wanted to use pivot_table to calculate something like (the mean of temperature for year 2016 and months (1,2,3)
I was wondering if anyone could help me with this?
You can do with pd.cut then groupby
df.temp.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[93]:
Year Month
2015 Winter 18.222222

Calculating YTD totals in Pandas

I have a DataFrame that looks like this:
FinancialYearStart MonthOfFinancialYear SalesTotal
0 2015 1 10
1 2015 2 10
2 2015 5 10
3 2015 6 50
4 2016 1 10
5 2016 3 20
6 2016 2 30
7 2017 6 70
8 2017 7 80
And I would like to calculate the YTD Sales total for each month, producing a table that looks like this:
FinancialYearStart MonthOfFinancialYear SalesTotal YTDTotal
0 2015 1 10 10
1 2015 2 10 20
2 2015 5 10 30
3 2015 6 50 50
4 2016 1 10 60
5 2016 3 20 80
6 2016 2 30 110
7 2017 6 70 70
8 2017 7 80 150
How might I achieve this?
More specifically, I actually need to calculate this on a group by group basis.
For example:
Year Month Customer TotalMonthlySales
2015 1 Dog 10
2015 2 Dog 10
2015 3 Cat 20
2015 4 Dog 30
2015 5 Cat 10
2015 7 Cat 20
2015 7 Dog 10
2016 1 Dog 40
2016 2 Dog 20
2016 3 Cat 70
2016 4 Dog 30
2016 5 Cat 10
2016 6 Cat 20
2016 7 Dog 10
Would give:
Year Month Customer TotalMonthlySales YTDSales
2015 1 Dog 10 10
2015 2 Dog 10 20
2015 3 Cat 20 20
2015 4 Dog 30 50
2015 5 Cat 10 30
2015 7 Cat 20 40
2015 7 Dog 10 60
2016 1 Dog 40 40
2016 2 Dog 20 60
2016 3 Cat 70 70
2016 4 Dog 30 90
2016 5 Cat 10 80
2016 6 Cat 20 100
2016 7 Dog 10 100
Use groupby + cumsum:
df['YTDSales'] = df.groupby(['Year','Customer'])['TotalMonthlySales'].cumsum()
print (df)
Year Month Customer TotalMonthlySales YTDSales
0 2015 1 Dog 10 10
1 2015 2 Dog 10 20
2 2015 3 Cat 20 20
3 2015 4 Dog 30 50
4 2015 5 Cat 10 30
5 2015 7 Cat 20 50
6 2015 7 Dog 10 60
7 2016 1 Dog 40 40
8 2016 2 Dog 20 60
9 2016 3 Cat 70 70
10 2016 4 Dog 30 90
11 2016 5 Cat 10 80
12 2016 6 Cat 20 100
13 2016 7 Dog 10 100
For first:
df['YTDTotal'] = df.groupby('FinancialYearStart')['SalesTotal'].cumsum()
print (df)
FinancialYearStart MonthOfFinancialYear SalesTotal YTDTotal
0 2015 1 10 10
1 2015 2 10 20
2 2015 5 10 30
3 2015 6 50 80
4 2016 1 10 10
5 2016 3 20 30
6 2016 2 30 60
7 2017 6 70 70
8 2017 7 80 150

Categories