We can apply a 30D monthly rolling sum operations as:
df.rolling("30D").sum()
However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion?
Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?
Consider the following database:
Year Month week Revenue
0 2020 1 1 10
1 2020 1 2 20
2 2020 1 3 10
3 2020 1 4 20
4 2020 2 1 10
5 2020 2 2 20
6 2020 2 3 10
7 2020 2 4 20
8 2020 3 1 10
9 2020 3 2 20
10 2020 3 3 10
11 2020 3 4 20
12 2021 1 1 10
13 2021 1 2 20
14 2021 1 3 10
15 2021 1 4 20
16 2021 2 1 10
17 2021 2 2 20
18 2021 2 3 10
19 2021 2 4 20
20 2021 3 1 10
21 2021 3 2 20
22 2021 3 3 10
23 2021 3 4 20
You could use a combination of group_by + cumsum to get what you want:
df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum()
df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum()
Results:
Year Month week Revenue Year_To_date Month_To_date
0 2020 1 1 10 10 10
1 2020 1 2 20 30 30
2 2020 1 3 10 40 40
3 2020 1 4 20 60 60
4 2020 2 1 10 70 10
5 2020 2 2 20 90 30
6 2020 2 3 10 100 40
7 2020 2 4 20 120 60
8 2020 3 1 10 130 10
9 2020 3 2 20 150 30
10 2020 3 3 10 160 40
11 2020 3 4 20 180 60
12 2021 1 1 10 10 10
13 2021 1 2 20 30 30
14 2021 1 3 10 40 40
15 2021 1 4 20 60 60
16 2021 2 1 10 70 10
17 2021 2 2 20 90 30
18 2021 2 3 10 100 40
19 2021 2 4 20 120 60
20 2021 3 1 10 130 10
21 2021 3 2 20 150 30
22 2021 3 3 10 160 40
23 2021 3 4 20 180 60
Note that Month-to-date makes sense only if you have a week/date column in your data model.
EXTRAS:
The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes.
Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order.
df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)
I have a dataframe like the one below, and I have to create a new column year_val that is equal to the values of col2016 through col2019 based on the Years column, so that the value for year_val will be the value of col#### when Years is equal to the suffix of col####
import pandas as pd
sampleDF = pd.DataFrame({'Years':[2016,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
sampleDF['year_val'] = ?????
Use DataFrame.lookup with change values in Years column with prepend col and cast to string:
sampleDF['year_val'] = sampleDF.lookup(sampleDF.index, 'col' + sampleDF['Years'].astype(str))
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2016 1 9 17 25 1
1 2016 2 10 18 26 2
2 2017 3 11 19 27 11
3 2017 4 12 20 28 12
4 2018 5 13 21 29 21
5 2018 6 14 22 30 22
6 2019 7 15 23 31 31
7 2019 8 16 24 32 32
EDIT: If check definition of lookup function:
result = [df.get_value(row, col) for row, col in zip(row_labels, col_labels)]
you can modify it with try-except statement with Series.at for prevent:
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
oup.append(sampleDF.at[row, col] )
sampleDF = pd.DataFrame({'Years':[2015,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
print (sampleDF)
Years col2016 col2017 col2018 col2019
0 2015 1 9 17 25
1 2016 2 10 18 26
2 2017 3 11 19 27
3 2017 4 12 20 28
4 2018 5 13 21 29
5 2018 6 14 22 30
6 2019 7 15 23 31
7 2019 8 16 24 32
out= []
for row, col in zip(sampleDF.index, 'col' + sampleDF['Years'].astype(str)):
try:
out.append(sampleDF.at[row, col] )
except KeyError:
out.append(np.nan)
sampleDF['year_val'] = out
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2015 1 9 17 25 NaN
1 2016 2 10 18 26 2.0
2 2017 3 11 19 27 11.0
3 2017 4 12 20 28 12.0
4 2018 5 13 21 29 21.0
5 2018 6 14 22 30 22.0
6 2019 7 15 23 31 31.0
7 2019 8 16 24 32 32.0
I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.
Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2
it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')
I have a DataFrame that looks like this:
FinancialYearStart MonthOfFinancialYear SalesTotal
0 2015 1 10
1 2015 2 10
2 2015 5 10
3 2015 6 50
4 2016 1 10
5 2016 3 20
6 2016 2 30
7 2017 6 70
8 2017 7 80
And I would like to calculate the YTD Sales total for each month, producing a table that looks like this:
FinancialYearStart MonthOfFinancialYear SalesTotal YTDTotal
0 2015 1 10 10
1 2015 2 10 20
2 2015 5 10 30
3 2015 6 50 50
4 2016 1 10 60
5 2016 3 20 80
6 2016 2 30 110
7 2017 6 70 70
8 2017 7 80 150
How might I achieve this?
More specifically, I actually need to calculate this on a group by group basis.
For example:
Year Month Customer TotalMonthlySales
2015 1 Dog 10
2015 2 Dog 10
2015 3 Cat 20
2015 4 Dog 30
2015 5 Cat 10
2015 7 Cat 20
2015 7 Dog 10
2016 1 Dog 40
2016 2 Dog 20
2016 3 Cat 70
2016 4 Dog 30
2016 5 Cat 10
2016 6 Cat 20
2016 7 Dog 10
Would give:
Year Month Customer TotalMonthlySales YTDSales
2015 1 Dog 10 10
2015 2 Dog 10 20
2015 3 Cat 20 20
2015 4 Dog 30 50
2015 5 Cat 10 30
2015 7 Cat 20 40
2015 7 Dog 10 60
2016 1 Dog 40 40
2016 2 Dog 20 60
2016 3 Cat 70 70
2016 4 Dog 30 90
2016 5 Cat 10 80
2016 6 Cat 20 100
2016 7 Dog 10 100
Use groupby + cumsum:
df['YTDSales'] = df.groupby(['Year','Customer'])['TotalMonthlySales'].cumsum()
print (df)
Year Month Customer TotalMonthlySales YTDSales
0 2015 1 Dog 10 10
1 2015 2 Dog 10 20
2 2015 3 Cat 20 20
3 2015 4 Dog 30 50
4 2015 5 Cat 10 30
5 2015 7 Cat 20 50
6 2015 7 Dog 10 60
7 2016 1 Dog 40 40
8 2016 2 Dog 20 60
9 2016 3 Cat 70 70
10 2016 4 Dog 30 90
11 2016 5 Cat 10 80
12 2016 6 Cat 20 100
13 2016 7 Dog 10 100
For first:
df['YTDTotal'] = df.groupby('FinancialYearStart')['SalesTotal'].cumsum()
print (df)
FinancialYearStart MonthOfFinancialYear SalesTotal YTDTotal
0 2015 1 10 10
1 2015 2 10 20
2 2015 5 10 30
3 2015 6 50 80
4 2016 1 10 10
5 2016 3 20 30
6 2016 2 30 60
7 2017 6 70 70
8 2017 7 80 150
I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.
Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5