Grouping data series by day intervals with Pandas - python
I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3
Related
Pandas Month-To-Date rolling sum
We can apply a 30D monthly rolling sum operations as: df.rolling("30D").sum() However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion? Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?
Consider the following database: Year Month week Revenue 0 2020 1 1 10 1 2020 1 2 20 2 2020 1 3 10 3 2020 1 4 20 4 2020 2 1 10 5 2020 2 2 20 6 2020 2 3 10 7 2020 2 4 20 8 2020 3 1 10 9 2020 3 2 20 10 2020 3 3 10 11 2020 3 4 20 12 2021 1 1 10 13 2021 1 2 20 14 2021 1 3 10 15 2021 1 4 20 16 2021 2 1 10 17 2021 2 2 20 18 2021 2 3 10 19 2021 2 4 20 20 2021 3 1 10 21 2021 3 2 20 22 2021 3 3 10 23 2021 3 4 20 You could use a combination of group_by + cumsum to get what you want: df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum() df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum() Results: Year Month week Revenue Year_To_date Month_To_date 0 2020 1 1 10 10 10 1 2020 1 2 20 30 30 2 2020 1 3 10 40 40 3 2020 1 4 20 60 60 4 2020 2 1 10 70 10 5 2020 2 2 20 90 30 6 2020 2 3 10 100 40 7 2020 2 4 20 120 60 8 2020 3 1 10 130 10 9 2020 3 2 20 150 30 10 2020 3 3 10 160 40 11 2020 3 4 20 180 60 12 2021 1 1 10 10 10 13 2021 1 2 20 30 30 14 2021 1 3 10 40 40 15 2021 1 4 20 60 60 16 2021 2 1 10 70 10 17 2021 2 2 20 90 30 18 2021 2 3 10 100 40 19 2021 2 4 20 120 60 20 2021 3 1 10 130 10 21 2021 3 2 20 150 30 22 2021 3 3 10 160 40 23 2021 3 4 20 180 60 Note that Month-to-date makes sense only if you have a week/date column in your data model. EXTRAS: The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes. Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order. df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)
Appending DataFrame columns to another DataFrame at an index/location that meets conditions [duplicate]
This question already has answers here: Pandas Merging 101 (8 answers) Closed 2 years ago. I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data. Flasks DataFrame year month day hour minute second... gas1 gas2 gas3 0 2018 4 8 16 27 48... 10 25 191 1 2018 4 8 16 40 20... 45 34 257 ... 229 2018 5 12 14 10 05... 3 72 108 one_sec_flt DataFrame Year Month Day Hour Min Second... temp wind 0 2018 4 8 14 30 20... 300 10 1 2018 4 8 14 45 15... 310 8 ... 305,212 2018 5 12 14 10 05... 308 24 I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp. for i in range(len(flasks)): for j in range(len(one_sec_flt)): if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]): if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]): if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]): print('match') My output goal would look like: Year Month Day Hour Min Second... temp wind gas1 gas2 gas3 0 2018 4 8 14 30 20... 300 10 nan nan nan 1 2018 4 8 14 45 15... 310 8 nan nan nan 2 2018 4 8 15 15 47... ... ... nan nan nan 3 2018 4 8 16 27 48... ... ... 10 25 191 4 2018 4 8 16 30 11... ... ... nan nan nan 5 2018 4 8 16 40 20... ... ... 45 34 257 ... ... ... ... ... ... ... ... ... ... ... ... 305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly). Flasks Out[13]: year month day hour minute second 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20 one_sec Out[14]: year month day hour minute second 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 df_res = pd.concat([Flasks,one_sec]) df_res Out[16]: year month day hour minute second 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 df_res.sort_values(by=['year','month','day','hour','minute','second']) Out[17]: year month day hour minute second 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20
Creating a Box-Plot but by value_counts() [Number of events occurred]
I have the following dataframe. Each entry is an event that occurred [550624 events]. Suppose we are interested in a box-plot of the number of events occurring per day each month. print(df) Month Day 0 4 1 1 4 1 2 4 1 3 4 1 4 4 1 ... ... 550619 10 31 550620 10 31 550621 10 31 550622 10 31 550623 10 31 [550624 rows x 2 columns] df2 = df.groupby('Month')['Day'].value_counts().sort_index() Month Day 4 1 2162 2 1564 3 1973 4 1620 5 1860 10 27 2022 28 1606 29 1316 30 1674 31 1726 sns.boxplot(x = df2.index.get_level_values('Month'), y = df2) Output of sns.boxplot My question is whether this way is the most efficient/direct way to create this visual info or if I am taking a round-about way of achieving this. Is there a more direct way to achieve this visual?
How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month
I have a pandas data with 3 columns: date: from 1/1/2018 up until 8/23/2019, column A and column B. import pandas as pd df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df.set_index('date') df is as follows: date A B 2018-01-01 7 4 2018-01-02 5 4 2018-01-03 3 1 2018-01-04 9 3 2018-01-05 7 8 2018-01-06 0 0 2018-01-07 6 8 2018-01-08 3 7 ... ... ... 2019-08-18 1 0 2019-08-19 8 1 2019-08-20 5 9 2019-08-21 0 7 2019-08-22 3 6 2019-08-23 8 6 I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below: month year monthly_accumulated_of_A monthly_averaged_of_B 0 1 2018 176 1.747947 1 2 2018 110 2.399476 2 3 2018 131 3.976747 3 4 2018 227 2.314923 4 5 2018 234 0.464097 5 6 2018 249 1.662753 6 7 2018 121 1.588865 7 8 2018 165 2.318268 8 9 2018 219 1.060595 9 10 2018 131 0.577268 10 11 2018 179 3.948414 11 12 2018 115 1.750346 12 1 2019 190 3.364003 13 2 2019 215 0.864792 14 3 2019 231 3.219739 15 4 2019 186 2.904413 16 5 2019 232 0.324695 17 6 2019 163 1.334139 18 7 2019 238 1.670644 19 8 2019 112 1.316442 How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex: import pandas as pd import numpy as np np.random.seed(2018) #changed 300 to 600 df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df = df.set_index('date') df1 = (df.groupby([df.index.month.rename('month'), df.index.year.rename('year')]) .agg({'A':'sum', 'B':'mean'}) .sort_index(level=['year', 'month']) .reset_index()) print (df1) month year A B 0 1 2018 147 4.838710 1 2 2018 120 3.678571 2 3 2018 114 4.387097 3 4 2018 143 3.800000 4 5 2018 124 3.870968 5 6 2018 129 4.700000 6 7 2018 143 3.935484 7 8 2018 118 5.483871 8 9 2018 150 5.500000 9 10 2018 139 4.225806 10 11 2018 136 4.933333 11 12 2018 141 4.548387 12 1 2019 137 4.709677 13 2 2019 120 4.964286 14 3 2019 167 4.935484 15 4 2019 121 4.200000 16 5 2019 133 4.129032 17 6 2019 140 5.066667 18 7 2019 189 4.677419 19 8 2019 100 3.695652
Calculate difference from previous year/forecast in pandas dataframe
I wish to compare the output of multiple model runs, calculating these values: Difference between current period revenue and previous period Difference between actual current period revenue and forecasted current period revenue I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this: import pandas as pd ids = [1,2,3] * 5 year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015'] run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual'] revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190] change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140] change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40] d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue} df = pd.DataFrame(data=d, columns=['ids','year','run','revenue']) print df ids year run revenue 0 1 2013 actual 10 1 2 2013 actual 20 2 3 2013 actual 20 3 1 2014 forecast 30 4 2 2014 forecast 50 5 3 2014 forecast 90 6 1 2014 actual 10 7 2 2014 actual 40 8 3 2014 actual 50 9 1 2015 forecast 120 10 2 2015 forecast 210 11 3 2015 forecast 150 12 1 2015 actual 130 13 2 2015 actual 100 14 3 2015 actual 190 ....into this: ids year run revenue chg_from_prev_year chg_from_forecast 0 1 2013 actual 10 NA NA 1 2 2013 actual 20 NA NA 2 3 2013 actual 20 NA NA 3 1 2014 forecast 30 20 NA 4 2 2014 forecast 50 30 NA 5 3 2014 forecast 90 70 NA 6 1 2014 actual 10 0 -20 7 2 2014 actual 40 20 -10 8 3 2014 actual 50 30 -40 9 1 2015 forecast 120 90 NA 10 2 2015 forecast 210 160 NA 11 3 2015 forecast 150 60 NA 12 1 2015 actual 130 120 30 13 2 2015 actual 100 60 -110 14 3 2015 actual 190 140 40 EDIT-- I get pretty close with this: df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue'] df['chg_from_prev_year'] = df['revenue'] - df['prev_year'] df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue'] df['chg_from_forecast'] = df['revenue'] - df['curr_forecast'] The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups: In [11]: g = df.groupby(['ids', 'run']) In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift()) The next part is more complicated, I think you need to do a pivot_table for the next part: In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run') In [14]: df1 Out[14]: run actual forecast ids year 1 2013 10 NaN 2014 10 30 2015 130 120 2 2013 20 NaN 2014 40 50 2015 100 210 3 2013 20 NaN 2014 50 90 2015 190 150 In [15]: g1 = df1.groupby(level='ids', as_index=False) In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast']) In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :) Out[17]: ids ids year 1 1 2013 NaN 2014 -20 2015 10 2 2 2013 NaN 2014 -10 2015 -110 3 3 2013 NaN 2014 -40 2015 40 dtype: float64 Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes: In [21]: df2 = df.set_index(['ids', 'year', 'run']) In [22]: out_by.index = out_by.index.droplevel(0) In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue']) In [24]: out_by_df['run'] = 'forecast' In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue'] and we're done... In [26]: df2.reset_index() Out[26]: ids year run revenue chg_from_prev_year chg_from_forecast 0 1 2013 actual 10 NaN NaN 1 2 2013 actual 20 NaN NaN 2 3 2013 actual 20 NaN NaN 3 1 2014 forecast 30 NaN -20 4 2 2014 forecast 50 NaN -10 5 3 2014 forecast 90 NaN -40 6 1 2014 actual 10 0 NaN 7 2 2014 actual 40 20 NaN 8 3 2014 actual 50 30 NaN 9 1 2015 forecast 120 90 10 10 2 2015 forecast 210 160 -110 11 3 2015 forecast 150 60 40 12 1 2015 actual 130 120 NaN 13 2 2015 actual 100 60 NaN 14 3 2015 actual 190 140 NaN Note: I think the first 6 results of chg_from_prev_year should be NaN. However, I think you may be better off keeping it as a pivot: In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run') In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values In [33]: df3 Out[33]: revenue chg_from_prev_year chg_from_forecast run actual forecast actual forecast ids year 1 2013 10 NaN NaN NaN NaN 2014 10 30 0 NaN -20 2015 130 120 120 90 10 2 2013 20 NaN NaN NaN NaN 2014 40 50 20 NaN -10 2015 100 210 60 160 -110 3 2013 20 NaN NaN NaN NaN 2014 50 90 30 NaN -40 2015 190 150 140 60 40