Grouping data series by day intervals with Pandas - python

I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.

Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]

you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3

Related

Pandas Month-To-Date rolling sum

We can apply a 30D monthly rolling sum operations as:
df.rolling("30D").sum()
However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion?
Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?
Consider the following database:
Year Month week Revenue
0 2020 1 1 10
1 2020 1 2 20
2 2020 1 3 10
3 2020 1 4 20
4 2020 2 1 10
5 2020 2 2 20
6 2020 2 3 10
7 2020 2 4 20
8 2020 3 1 10
9 2020 3 2 20
10 2020 3 3 10
11 2020 3 4 20
12 2021 1 1 10
13 2021 1 2 20
14 2021 1 3 10
15 2021 1 4 20
16 2021 2 1 10
17 2021 2 2 20
18 2021 2 3 10
19 2021 2 4 20
20 2021 3 1 10
21 2021 3 2 20
22 2021 3 3 10
23 2021 3 4 20
You could use a combination of group_by + cumsum to get what you want:
df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum()
df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum()
Results:
Year Month week Revenue Year_To_date Month_To_date
0 2020 1 1 10 10 10
1 2020 1 2 20 30 30
2 2020 1 3 10 40 40
3 2020 1 4 20 60 60
4 2020 2 1 10 70 10
5 2020 2 2 20 90 30
6 2020 2 3 10 100 40
7 2020 2 4 20 120 60
8 2020 3 1 10 130 10
9 2020 3 2 20 150 30
10 2020 3 3 10 160 40
11 2020 3 4 20 180 60
12 2021 1 1 10 10 10
13 2021 1 2 20 30 30
14 2021 1 3 10 40 40
15 2021 1 4 20 60 60
16 2021 2 1 10 70 10
17 2021 2 2 20 90 30
18 2021 2 3 10 100 40
19 2021 2 4 20 120 60
20 2021 3 1 10 130 10
21 2021 3 2 20 150 30
22 2021 3 3 10 160 40
23 2021 3 4 20 180 60
Note that Month-to-date makes sense only if you have a week/date column in your data model.
EXTRAS:
The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes.
Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order.
df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)

Appending DataFrame columns to another DataFrame at an index/location that meets conditions [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20

Creating a Box-Plot but by value_counts() [Number of events occurred]

I have the following dataframe. Each entry is an event that occurred [550624 events]. Suppose we are interested in a box-plot of the number of events occurring per day each month.
print(df)
Month Day
0 4 1
1 4 1
2 4 1
3 4 1
4 4 1
... ...
550619 10 31
550620 10 31
550621 10 31
550622 10 31
550623 10 31
[550624 rows x 2 columns]
df2 = df.groupby('Month')['Day'].value_counts().sort_index()
Month Day
4 1 2162
2 1564
3 1973
4 1620
5 1860
10 27 2022
28 1606
29 1316
30 1674
31 1726
sns.boxplot(x = df2.index.get_level_values('Month'), y = df2)
Output of sns.boxplot
My question is whether this way is the most efficient/direct way to create this visual info or if I am taking a round-about way of achieving this.
Is there a more direct way to achieve this visual?

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442
​
How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652

Calculate difference from previous year/forecast in pandas dataframe

I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40

Categories