We can apply a 30D monthly rolling sum operations as:
df.rolling("30D").sum()
However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion?
Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?
Consider the following database:
Year Month week Revenue
0 2020 1 1 10
1 2020 1 2 20
2 2020 1 3 10
3 2020 1 4 20
4 2020 2 1 10
5 2020 2 2 20
6 2020 2 3 10
7 2020 2 4 20
8 2020 3 1 10
9 2020 3 2 20
10 2020 3 3 10
11 2020 3 4 20
12 2021 1 1 10
13 2021 1 2 20
14 2021 1 3 10
15 2021 1 4 20
16 2021 2 1 10
17 2021 2 2 20
18 2021 2 3 10
19 2021 2 4 20
20 2021 3 1 10
21 2021 3 2 20
22 2021 3 3 10
23 2021 3 4 20
You could use a combination of group_by + cumsum to get what you want:
df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum()
df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum()
Results:
Year Month week Revenue Year_To_date Month_To_date
0 2020 1 1 10 10 10
1 2020 1 2 20 30 30
2 2020 1 3 10 40 40
3 2020 1 4 20 60 60
4 2020 2 1 10 70 10
5 2020 2 2 20 90 30
6 2020 2 3 10 100 40
7 2020 2 4 20 120 60
8 2020 3 1 10 130 10
9 2020 3 2 20 150 30
10 2020 3 3 10 160 40
11 2020 3 4 20 180 60
12 2021 1 1 10 10 10
13 2021 1 2 20 30 30
14 2021 1 3 10 40 40
15 2021 1 4 20 60 60
16 2021 2 1 10 70 10
17 2021 2 2 20 90 30
18 2021 2 3 10 100 40
19 2021 2 4 20 120 60
20 2021 3 1 10 130 10
21 2021 3 2 20 150 30
22 2021 3 3 10 160 40
23 2021 3 4 20 180 60
Note that Month-to-date makes sense only if you have a week/date column in your data model.
EXTRAS:
The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes.
Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order.
df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)
It is posible to convert a dataframe on Pandas like that:
Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"2009": np.random.randn(12),
"2010": np.random.randn(12),
"2011": np.random.randn(12),
},
index=range(1, 13)
)
print(df)
Out[45]:
2009 2010 2011
1 -1.133838 -1.440585 0.570594
2 0.384319 0.773703 0.915420
3 1.496554 -1.027967 -1.669341
4 -0.355382 -0.090986 0.482714
5 -0.787534 0.492003 -0.310473
6 -0.459439 0.424672 2.394690
7 -0.059169 1.283049 1.550931
8 -0.354174 0.315986 -0.646465
9 -0.735523 -0.408082 -0.928937
10 -1.183940 -0.067948 -1.654976
11 0.238894 -0.952427 0.350193
12 -0.589920 -0.110677 -0.141757
df_out = df.unstack(1).reset_index()
df_out.columns = ["year", "month", "value"]
print(df_out)
Out[46]:
year month value
0 2009 1 -1.133838
1 2009 2 0.384319
2 2009 3 1.496554
3 2009 4 -0.355382
4 2009 5 -0.787534
5 2009 6 -0.459439
6 2009 7 -0.059169
7 2009 8 -0.354174
8 2009 9 -0.735523
9 2009 10 -1.183940
10 2009 11 0.238894
11 2009 12 -0.589920
12 2010 1 -1.440585
13 2010 2 0.773703
14 2010 3 -1.027967
15 2010 4 -0.090986
16 2010 5 0.492003
17 2010 6 0.424672
18 2010 7 1.283049
19 2010 8 0.315986
20 2010 9 -0.408082
21 2010 10 -0.067948
22 2010 11 -0.952427
23 2010 12 -0.110677
24 2011 1 0.570594
25 2011 2 0.915420
26 2011 3 -1.669341
27 2011 4 0.482714
28 2011 5 -0.310473
29 2011 6 2.394690
30 2011 7 1.550931
31 2011 8 -0.646465
32 2011 9 -0.928937
33 2011 10 -1.654976
34 2011 11 0.350193
35 2011 12 -0.141757
Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10
I have a dataframe like the one below, and I have to create a new column year_val that is equal to the values of col2016 through col2019 based on the Years column, so that the value for year_val will be the value of col#### when Years is equal to the suffix of col####
import pandas as pd
sampleDF = pd.DataFrame({'Years':[2016,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
sampleDF['year_val'] = ?????
Use DataFrame.lookup with change values in Years column with prepend col and cast to string:
sampleDF['year_val'] = sampleDF.lookup(sampleDF.index, 'col' + sampleDF['Years'].astype(str))
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2016 1 9 17 25 1
1 2016 2 10 18 26 2
2 2017 3 11 19 27 11
3 2017 4 12 20 28 12
4 2018 5 13 21 29 21
5 2018 6 14 22 30 22
6 2019 7 15 23 31 31
7 2019 8 16 24 32 32
EDIT: If check definition of lookup function:
result = [df.get_value(row, col) for row, col in zip(row_labels, col_labels)]
you can modify it with try-except statement with Series.at for prevent:
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
oup.append(sampleDF.at[row, col] )
sampleDF = pd.DataFrame({'Years':[2015,2016,2017,2017,2018,2018,2019,2019],
'col2016':[1,2,3,4,5,6,7,8],
'col2017':[9,10,11,12,13,14,15,16],
'col2018':[17,18,19,20,21,22,23,24],
'col2019':[25,26,27,28,29,30,31,32]})
print (sampleDF)
Years col2016 col2017 col2018 col2019
0 2015 1 9 17 25
1 2016 2 10 18 26
2 2017 3 11 19 27
3 2017 4 12 20 28
4 2018 5 13 21 29
5 2018 6 14 22 30
6 2019 7 15 23 31
7 2019 8 16 24 32
out= []
for row, col in zip(sampleDF.index, 'col' + sampleDF['Years'].astype(str)):
try:
out.append(sampleDF.at[row, col] )
except KeyError:
out.append(np.nan)
sampleDF['year_val'] = out
print (sampleDF)
Years col2016 col2017 col2018 col2019 year_val
0 2015 1 9 17 25 NaN
1 2016 2 10 18 26 2.0
2 2017 3 11 19 27 11.0
3 2017 4 12 20 28 12.0
4 2018 5 13 21 29 21.0
5 2018 6 14 22 30 22.0
6 2019 7 15 23 31 31.0
7 2019 8 16 24 32 32.0
I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.
Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5