Calculate difference from previous year/forecast in pandas dataframe - python
I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40
Related
Appending DataFrame columns to another DataFrame at an index/location that meets conditions [duplicate]
This question already has answers here: Pandas Merging 101 (8 answers) Closed 2 years ago. I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data. Flasks DataFrame year month day hour minute second... gas1 gas2 gas3 0 2018 4 8 16 27 48... 10 25 191 1 2018 4 8 16 40 20... 45 34 257 ... 229 2018 5 12 14 10 05... 3 72 108 one_sec_flt DataFrame Year Month Day Hour Min Second... temp wind 0 2018 4 8 14 30 20... 300 10 1 2018 4 8 14 45 15... 310 8 ... 305,212 2018 5 12 14 10 05... 308 24 I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp. for i in range(len(flasks)): for j in range(len(one_sec_flt)): if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]): if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]): if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]): print('match') My output goal would look like: Year Month Day Hour Min Second... temp wind gas1 gas2 gas3 0 2018 4 8 14 30 20... 300 10 nan nan nan 1 2018 4 8 14 45 15... 310 8 nan nan nan 2 2018 4 8 15 15 47... ... ... nan nan nan 3 2018 4 8 16 27 48... ... ... 10 25 191 4 2018 4 8 16 30 11... ... ... nan nan nan 5 2018 4 8 16 40 20... ... ... 45 34 257 ... ... ... ... ... ... ... ... ... ... ... ... 305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly). Flasks Out[13]: year month day hour minute second 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20 one_sec Out[14]: year month day hour minute second 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 df_res = pd.concat([Flasks,one_sec]) df_res Out[16]: year month day hour minute second 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 df_res.sort_values(by=['year','month','day','hour','minute','second']) Out[17]: year month day hour minute second 0 2018 4 8 14 30 20 1 2018 4 8 14 45 15 0 2018 4 8 16 27 48 1 2018 4 8 16 40 20
How replaces values from a column in a Panel Data with values from a list in Python?
I have a database in panel data form: Date id variable1 variable2 2015 1 10 200 2016 1 17 300 2017 1 8 400 2018 1 11 500 2015 2 12 150 2016 2 19 350 2017 2 15 250 2018 2 9 450 2015 3 20 100 2016 3 8 220 2017 3 12 310 2018 3 14 350 And I have a list with the labels of the ID List = ['Argentina', 'Brazil','Chile'] I want to replace values of id with labels from my list. Thanks in advance Date id variable1 variable2 2015 Argentina 10 200 2016 Argentina 17 300 2017 Argentina 8 400 2018 Argentina 11 500 2015 Brazil 12 150 2016 Brazil 19 350 2017 Brazil 15 250 2018 Brazil 9 450 2015 Chile 20 100 2016 Chile 8 220 2017 Chile 12 310 2018 Chile 14 350
map is the way to go, with enumerate: d = {k:v for k,v in enumerate(List, start=1)} df['id'] = df['id'].map(d) Output: Date id variable1 variable2 0 2015 Argentina 10 200 1 2016 Argentina 17 300 2 2017 Argentina 8 400 3 2018 Argentina 11 500 4 2015 Brazil 12 150 5 2016 Brazil 19 350 6 2017 Brazil 15 250 7 2018 Brazil 9 450 8 2015 Chile 20 100 9 2016 Chile 8 220 10 2017 Chile 12 310 11 2018 Chile 14 350
Try df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'}) or df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})
Pandas Rolling mean with GroupBy and Sort
I have a DataFrame that looks like: f_period f_year f_month subject month year value 20140102 2014 1 a 1 2018 10 20140109 2014 1 a 1 2018 12 20140116 2014 1 a 1 2018 8 20140202 2014 2 a 1 2018 20 20140209 2014 2 a 1 2018 15 20140102 2014 1 b 1 2018 10 20140109 2014 1 b 1 2018 12 20140116 2014 1 b 1 2018 8 20140202 2014 2 b 1 2018 20 20140209 2014 2 b 1 2018 15 The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018. I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like: f_period f_year f_month subject month year value mnthly_avg rolling_2_avg 20140102 2014 1 a 1 2018 10 10 13 20140109 2014 1 a 1 2018 12 10 13 20140116 2014 1 a 1 2018 8 10 13 20140202 2014 2 a 1 2018 20 17.5 null 20140209 2014 2 a 1 2018 15 17.5 null 20140102 2014 1 b 1 2018 10 10 13 20140109 2014 1 b 1 2018 12 10 13 20140116 2014 1 b 1 2018 8 10 13 20140202 2014 2 b 1 2018 20 17.5 null 20140209 2014 2 b 1 2018 15 17.5 null Things I tried: I was able to get mnthly_avg by : data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\ value.transform('mean') I tried getting the rolling_2_avg : rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\ groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index() rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\ value.rolling(2).mean().reset_index(drop=True) This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this? grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum()) grp['rolling'] = grp.rolling(window=2).mean() grp Output: value rolling subject month f_month a 1 1 30 NaN 2 35 32.5 b 1 1 30 32.5 2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer). An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after): for unique_subject in df['subject'].unique(): df_subject = df[df['subject'] == unique_subject] df_subject['rolling'] = df_subject['value'].rolling(window=2).mean() print(df_subject) # just to print, you may wanna concatenate these
How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month
I have a pandas data with 3 columns: date: from 1/1/2018 up until 8/23/2019, column A and column B. import pandas as pd df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df.set_index('date') df is as follows: date A B 2018-01-01 7 4 2018-01-02 5 4 2018-01-03 3 1 2018-01-04 9 3 2018-01-05 7 8 2018-01-06 0 0 2018-01-07 6 8 2018-01-08 3 7 ... ... ... 2019-08-18 1 0 2019-08-19 8 1 2019-08-20 5 9 2019-08-21 0 7 2019-08-22 3 6 2019-08-23 8 6 I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below: month year monthly_accumulated_of_A monthly_averaged_of_B 0 1 2018 176 1.747947 1 2 2018 110 2.399476 2 3 2018 131 3.976747 3 4 2018 227 2.314923 4 5 2018 234 0.464097 5 6 2018 249 1.662753 6 7 2018 121 1.588865 7 8 2018 165 2.318268 8 9 2018 219 1.060595 9 10 2018 131 0.577268 10 11 2018 179 3.948414 11 12 2018 115 1.750346 12 1 2019 190 3.364003 13 2 2019 215 0.864792 14 3 2019 231 3.219739 15 4 2019 186 2.904413 16 5 2019 232 0.324695 17 6 2019 163 1.334139 18 7 2019 238 1.670644 19 8 2019 112 1.316442 How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex: import pandas as pd import numpy as np np.random.seed(2018) #changed 300 to 600 df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df = df.set_index('date') df1 = (df.groupby([df.index.month.rename('month'), df.index.year.rename('year')]) .agg({'A':'sum', 'B':'mean'}) .sort_index(level=['year', 'month']) .reset_index()) print (df1) month year A B 0 1 2018 147 4.838710 1 2 2018 120 3.678571 2 3 2018 114 4.387097 3 4 2018 143 3.800000 4 5 2018 124 3.870968 5 6 2018 129 4.700000 6 7 2018 143 3.935484 7 8 2018 118 5.483871 8 9 2018 150 5.500000 9 10 2018 139 4.225806 10 11 2018 136 4.933333 11 12 2018 141 4.548387 12 1 2019 137 4.709677 13 2 2019 120 4.964286 14 3 2019 167 4.935484 15 4 2019 121 4.200000 16 5 2019 133 4.129032 17 6 2019 140 5.066667 18 7 2019 189 4.677419 19 8 2019 100 3.695652
Grouping data series by day intervals with Pandas
I have to perform some data analysis on a seasonal basis. I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons. Here's an example of the data I am working with: Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8 11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4 11/06/2016,2016,6,11,7,21,0,7,1364,818,17 11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5 15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5 15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 As you can see I have data on three different years. What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated). EDIT: A desired output would be: df_spring Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 df_autumn Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter: df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin() # spring df[df['Month'].isin([3,4])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4 3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1 10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2 11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0 12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4 13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5 14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4 # autumn df[df['Month'].isin([11,12])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2 1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2 8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6 9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4 18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6 19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9 20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8 21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3