Pandas - Determine if Churn occurs with missing years - python
I have a large pandas dataframe which contains ids, years, spend values, and a slew of other columns, as shown below:
id year spend .... n_columns
1 2015 321 ... ...
1 2016 342 ... ...
1 2017 843
1 2018 483
2 2015 234
2 2018 321
2 2019 232 ... ...
I am trying to create a new column which classifies the years based on the next years value. Something akin to:
id year spend cat
1 2015 321 increase
1 2016 342 increase
1 2017 843 decrease
1 2018 483 churned #as there is no 2019 data
2 2015 234 churned #as there is no 2016 data
2 2018 321 decreased
2 2019 232 decreased
2 2020 200 nan #max data only goes up to 2020
I have been trying to do this with something like the below, to get the difference between years to determine the category:
def categorize(x):
if math.abs(x['diff']) == x['value']:
return "churned"
elif x['diff'] < 0:
return "decrease"
elif x['diff' > 0:
return "increase"
else:
return None
df = df.sort_values(['id', 'year'], ascending = True)
df['diff'] = df.groupby('id')['spend'].diff(-1)
df = df.apply(categorize, axis = 1)
However, this method and all similar methods seem to fail as there are years missing for some ids (such as id = 2 and year = 2015 above). Is there an easy way to ensure all ids all contain all of the years, even if the values are all zeroed or nulled out? Is there a better way to determine if a year is an increase/decrease/churn than how I am doing it?
Thanks!
Here is one way to solve it:
Expand the dataframe to include the missing rows of years; I'll use the complete function from pyjanitor for this - it exposes explicitly missing values:
# pip install pyjanitor
import janitor
tempo = (df.complete(columns=["id",
{"year": lambda df: np.arange(df.year.min(),
df.year.max() + 1)}]
)
.assign(temp=lambda df: df.spend.ffill(),
temp_diff=lambda df: df.temp.diff(-1)
)
)
tempo
id year spend temp temp_diff
0 1 2015 321.0 321.0 -21.0
1 1 2016 342.0 342.0 -501.0
2 1 2017 843.0 843.0 360.0
3 1 2018 483.0 483.0 0.0
4 1 2019 NaN 483.0 249.0
5 2 2015 234.0 234.0 0.0
6 2 2016 NaN 234.0 0.0
7 2 2017 NaN 234.0 -87.0
8 2 2018 321.0 321.0 89.0
9 2 2019 232.0 232.0 NaN
Next step is to create conditions, and combine with np.select:
cond2 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.ge(0))
cond1 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.lt(0))
cond3 = (tempo.spend.shift(-1).isna()) & (tempo.temp_diff.eq(0))
tempo["cat"] = np.select([cond1, cond2, cond3],
["increase", "decrease", "churn"],
np.nan)
id year spend temp temp_diff cat
0 1 2015 321.0 321.0 -21.0 increase
1 1 2016 342.0 342.0 -501.0 increase
2 1 2017 843.0 843.0 360.0 decrease
3 1 2018 483.0 483.0 0.0 churn
4 1 2019 NaN 483.0 249.0 decrease
5 2 2015 234.0 234.0 0.0 churn
6 2 2016 NaN 234.0 0.0 churn
7 2 2017 NaN 234.0 -87.0 increase
8 2 2018 321.0 321.0 89.0 decrease
9 2 2019 232.0 232.0 NaN nan
Filter out the null rows in spend column:
tempo.query("spend.notna()").drop(columns = ['temp_diff', 'temp'])
id year spend cat
0 1 2015 321.0 increase
1 1 2016 342.0 increase
2 1 2017 843.0 decrease
3 1 2018 483.0 churn
5 2 2015 234.0 churn
8 2 2018 321.0 decrease
9 2 2019 232.0 nan
I used your original dataframe ( which stopped at 2019); let me know how it goes.
Related
How do I plot the frequency of an event overtime with pandas?I
I was trying to plot some data from a pandas dataframe. My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3. I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year. I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year. I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow. This is the code i used to get my dataframe, if it is useful in any way. import json import requests import pandas as pd import numpy as np request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json() data = pd.DataFrame(request) P.S. Thank you for the precious help!
You can filter by rating and use Series.value_counts: s = data.loc[data['rating'].eq(3), 'year'].value_counts() But there is many years of films: print (len(s)) 108 So for plot I filter only counts greatwer like 30, it is 40 years here: print (s.gt(30).sum()) 40 So filter again and plot: s[s.gt(30)].plot.bar() EDIT: Solution with percentages: s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100) print (s) 1899 0.018218 1910 0.018218 1916 0.054655 1917 0.054655 1918 0.054655 2018 3.169976 2019 3.188195 2020 2.040445 2021 1.840044 2022 0.765167 Name: year, Length: 108, dtype: float64 print (s[s.gt(3)]) 2007 3.042449 2009 3.588996 2010 3.825833 2011 4.299508 2012 4.153762 2013 4.937147 2014 4.335945 2015 3.771179 2016 3.752960 2017 3.388595 2018 3.169976 2019 3.188195 Name: year, dtype: float64 s[s.gt(3)].plot.bar() EDIT1: Here is solution for count years vs ratings: df = pd.crosstab(data['year'], data.rating) print (df) rating 0 1 2 3 year 1874 1 0 0 0 1877 1 0 0 0 1878 2 0 0 0 1881 1 0 0 0 1883 1 0 0 0 .. .. .. ... 2018 19 44 24 174 2019 16 47 18 175 2020 10 17 11 112 2021 11 22 13 101 2022 3 14 5 42 [141 rows x 4 columns] EDIT2: df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100) print (df) rating 0 1 2 3 year 1874 100.000000 0.000000 0.000000 0.000000 1877 100.000000 0.000000 0.000000 0.000000 1878 100.000000 0.000000 0.000000 0.000000 1881 100.000000 0.000000 0.000000 0.000000 1883 100.000000 0.000000 0.000000 0.000000 ... ... ... ... 2018 7.279693 16.858238 9.195402 66.666667 2019 6.250000 18.359375 7.031250 68.359375 2020 6.666667 11.333333 7.333333 74.666667 2021 7.482993 14.965986 8.843537 68.707483 2022 4.687500 21.875000 7.812500 65.625000 [141 rows x 4 columns] There is alot of values, here is e.g. filter for column 3 for more like 60% values: print (df[3].gt(60).sum()) 26 df[df[3].gt(60)].plot.bar()
Pandas Python - How to create new columns with MultiIndex from pivot table
I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so? Current code for pivot table: tdf = df.pivot_table(index="States", columns="Year", values=["Number of Apples","Number of People"], aggfunc= lambda x: len(x.unique()), margins=True) tdf Here is my current pivot table: Number of Apples Number of People 2017 2018 2019 2020 2017 2018 2019 2020 California 10 18 20 25 2 3 4 5 West Virginia 8 35 25 12 2 5 5 4 ... I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People. Number of Apples Number of People Number of Apples per Person 2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020 California 10 18 20 25 2 3 4 5 5 6 5 5 West Virginia 8 35 25 12 2 5 5 4 4 7 5 3 I've tried a few things, such as: Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017] Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="? Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack(): s = df.stack() s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People'] df = s.unstack() Output: >>> df Number of Apples Number of People Number of Apples per Person 2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020 California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0 West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0 One-liner: df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given df Number of Apples Number of People 2017 2018 2019 2020 2017 2018 2019 2020 California 10 18 20 25 2 3 4 5 West Virginia 8 35 25 12 2 5 5 4 You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns. df['Number of Apples'] / df['Number of People'] 2017 2018 2019 2020 California 5.0 6.0 5.0 5.0 West Virginia 4.0 7.0 5.0 3.0 Append this back to your DataFrame: pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1) Number of Apples Number of People Result 2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020 California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0 West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0 This is fast since it is completely vectorized.
How to apply a function to multiple columns that iterates through each row
Data I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs) company month year_ly bookings_ly year_ty bookings_ty company a 1 2018 432 2019 253 company a 2 2018 265 2019 635 company a 3 2018 345 2019 525 company a 4 2018 233 2019 company a 5 2018 7664 2019 ... ... ... ... ... ... company a 12 2018 224 2019 321 company b 1 2018 543 2019 576 company b 2 2018 23 2019 43 company b 3 2018 64 2019 156 company b 4 2018 143 2019 company b 5 2018 41 2019 company b 6 2018 90 2019 ... ... ... ... ... ... What I want I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company): ((SUM of previous 3 rows (or months) of bookings_ty) /(SUM of previous 3 rows (or months) of bookings_ly)) * bookings_ly Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty. My attempt df_bkgs.set_index(['operator', 'month'], inplace=True) def calc(df_bkgs): df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) ) df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) ) df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m'] df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly'] return df_bkgs df_bkgs.groupby(level=0).apply(calc) import numpy as np df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty']) Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row. Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement. # dataframe is named f company month year_ly bookings_ly year_ty bookings_ty 0 a 1 2018 432 2019 253.0 1 a 2 2018 265 2019 635.0 2 a 3 2018 345 2019 525.0 3 a 4 2018 233 2019 NaN 4 a 5 2018 7664 2019 NaN 5 a 12 2018 224 2019 321.0 6 b 1 2018 543 2019 576.0 7 b 2 2018 23 2019 43.0 8 b 3 2018 64 2019 156.0 9 b 4 2018 143 2019 NaN 10 b 5 2018 41 2019 NaN 11 b 6 2018 90 2019 NaN f.reset_index(inplace=True) def aggFunct(row, df, last=3): series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0) ssum = series.sum() return ssum f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1) f.drop('index',axis=1,inplace=True) f company month year_ly bookings_ly year_ty bookings_ty 0 a 1 2018 432 2019 253.0 1 a 2 2018 265 2019 635.0 2 a 3 2018 345 2019 525.0 3 a 4 2018 233 2019 1413.0 4 a 5 2018 7664 2019 1160.0 5 a 12 2018 224 2019 321.0 6 b 1 2018 543 2019 576.0 7 b 2 2018 23 2019 43.0 8 b 3 2018 64 2019 156.0 9 b 4 2018 143 2019 775.0 10 b 5 2018 41 2019 199.0 11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work: import numpy as np import pandas as pd df = pd.read_excel('data_file.xlsx') # <-- name of your file. companies = pd.unique(df.company) months = pd.unique(df.month) for c in companies: for m in months: # slice a single row df_row= df[(df['company']==c) & (df['month']==m)] val = df_slice.bookings_ty.values[0] if np.isnan(val): # get the index of the row idx = df_row.index[0] df1 = df.copy() df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))] ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum() projected_value = df_slice.bookings_ly.values[0] * ratio df.loc[idx, 'bookings_ty'] = projected_value else: pass print(df) if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows. df = pd.read_excel('data_file.xlsx') # your input file ly = df.bookings_ly.values.tolist() ty = df.bookings_ty.values.tolist() for val in ty: if np.isnan(val): idx = ty.index(val) # returns the index of the first 'nan' found ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx]) ty[idx] = ratio * ly[idx] df['bookings_ty'] = ty
here is a solution: import numpy as np import pandas as pd #sort values if not df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True) def process(x): while x['bookings_ty'].isnull().any(): x['bookings_ty'] = np.where((x['bookings_ty'].isnull()), (x['bookings_ty'].shift(1) + x['bookings_ty'].shift(2) + x['bookings_ty'].shift(3)) / (x['bookings_ly'].shift(1) + x['bookings_ly'].shift(2) + x['bookings_ly'].shift(3)) * x['bookings_ly'], x['bookings_ty']) return x df = df.groupby(['company']).apply(lambda x: process(x)) #convert to int64 if needed or stay with float values df['bookings_ty'] = df['bookings_ty'].astype(np.int64) print(df) initial DF: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 525 3 company_a 4 2018 233 2019 NaN 4 company_a 5 2018 7664 2019 NaN 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 NaN 10 company_b 5 2018 41 2019 NaN 11 company_b 6 2018 90 2019 NaN result: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 525 3 company_a 4 2018 233 2019 315 ** 4 company_a 5 2018 7664 2019 13418 ** 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 175 ** 10 company_b 5 2018 41 2019 66 ** 11 company_b 6 2018 90 2019 144 ** In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution: df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True) def process(x, m): idx = (x.loc[x['bookings_ty'].isnull()].index.to_list()) for i in idx: id = i - x.index[0] start = 0 if id < m else id - m sum_ty = sum(x['bookings_ty'].to_list()[start:id]) sum_ly = sum(x['bookings_ly'].to_list()[start:id]) ly = x.at[i, 'bookings_ly'] x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly return x rolling_month = 3 df = df.groupby(['company']).apply(lambda x: process(x, rolling_month)) df['bookings_ty'] = df['bookings_ty'].astype(np.int64) print(df) initial df: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253.0 1 company_a 2 2018 265 2019 635.0 2 company_a 3 2018 345 2019 NaN 3 company_a 4 2018 233 2019 NaN 4 company_a 5 2018 7664 2019 NaN 5 company_a 12 2018 224 2019 321.0 6 company_b 1 2018 543 2019 576.0 7 company_b 2 2018 23 2019 43.0 8 company_b 3 2018 64 2019 156.0 9 company_b 4 2018 143 2019 NaN 10 company_b 5 2018 41 2019 NaN 11 company_b 6 2018 90 2019 NaN final result: company month year_ly bookings_ly year_ty bookings_ty 0 company_a 1 2018 432 2019 253 1 company_a 2 2018 265 2019 635 2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows 3 company_a 4 2018 233 2019 296 ** 4 company_a 5 2018 7664 2019 12467 ** 5 company_a 12 2018 224 2019 321 6 company_b 1 2018 543 2019 576 7 company_b 2 2018 23 2019 43 8 company_b 3 2018 64 2019 156 9 company_b 4 2018 143 2019 175 ** 10 company_b 5 2018 41 2019 66 ** 11 company_b 6 2018 90 2019 144 ** if you want to speed up the process you could try: df.set_index(['company'], inplace=True) df = df.groupby(level=(0)).apply(lambda x: process(x)) instead of df = df.groupby(['company']).apply(lambda x: process(x))
Grouping data series by day intervals with Pandas
I have to perform some data analysis on a seasonal basis. I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons. Here's an example of the data I am working with: Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8 11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4 11/06/2016,2016,6,11,7,21,0,7,1364,818,17 11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5 15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5 15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 As you can see I have data on three different years. What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated). EDIT: A desired output would be: df_spring Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 df_autumn Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter: df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin() # spring df[df['Month'].isin([3,4])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4 3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1 10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2 11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0 12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4 13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5 14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4 # autumn df[df['Month'].isin([11,12])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2 1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2 8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6 9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4 18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6 19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9 20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8 21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3
Calculate difference from previous year/forecast in pandas dataframe
I wish to compare the output of multiple model runs, calculating these values: Difference between current period revenue and previous period Difference between actual current period revenue and forecasted current period revenue I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this: import pandas as pd ids = [1,2,3] * 5 year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015'] run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual'] revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190] change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140] change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40] d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue} df = pd.DataFrame(data=d, columns=['ids','year','run','revenue']) print df ids year run revenue 0 1 2013 actual 10 1 2 2013 actual 20 2 3 2013 actual 20 3 1 2014 forecast 30 4 2 2014 forecast 50 5 3 2014 forecast 90 6 1 2014 actual 10 7 2 2014 actual 40 8 3 2014 actual 50 9 1 2015 forecast 120 10 2 2015 forecast 210 11 3 2015 forecast 150 12 1 2015 actual 130 13 2 2015 actual 100 14 3 2015 actual 190 ....into this: ids year run revenue chg_from_prev_year chg_from_forecast 0 1 2013 actual 10 NA NA 1 2 2013 actual 20 NA NA 2 3 2013 actual 20 NA NA 3 1 2014 forecast 30 20 NA 4 2 2014 forecast 50 30 NA 5 3 2014 forecast 90 70 NA 6 1 2014 actual 10 0 -20 7 2 2014 actual 40 20 -10 8 3 2014 actual 50 30 -40 9 1 2015 forecast 120 90 NA 10 2 2015 forecast 210 160 NA 11 3 2015 forecast 150 60 NA 12 1 2015 actual 130 120 30 13 2 2015 actual 100 60 -110 14 3 2015 actual 190 140 40 EDIT-- I get pretty close with this: df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue'] df['chg_from_prev_year'] = df['revenue'] - df['prev_year'] df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue'] df['chg_from_forecast'] = df['revenue'] - df['curr_forecast'] The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups: In [11]: g = df.groupby(['ids', 'run']) In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift()) The next part is more complicated, I think you need to do a pivot_table for the next part: In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run') In [14]: df1 Out[14]: run actual forecast ids year 1 2013 10 NaN 2014 10 30 2015 130 120 2 2013 20 NaN 2014 40 50 2015 100 210 3 2013 20 NaN 2014 50 90 2015 190 150 In [15]: g1 = df1.groupby(level='ids', as_index=False) In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast']) In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :) Out[17]: ids ids year 1 1 2013 NaN 2014 -20 2015 10 2 2 2013 NaN 2014 -10 2015 -110 3 3 2013 NaN 2014 -40 2015 40 dtype: float64 Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes: In [21]: df2 = df.set_index(['ids', 'year', 'run']) In [22]: out_by.index = out_by.index.droplevel(0) In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue']) In [24]: out_by_df['run'] = 'forecast' In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue'] and we're done... In [26]: df2.reset_index() Out[26]: ids year run revenue chg_from_prev_year chg_from_forecast 0 1 2013 actual 10 NaN NaN 1 2 2013 actual 20 NaN NaN 2 3 2013 actual 20 NaN NaN 3 1 2014 forecast 30 NaN -20 4 2 2014 forecast 50 NaN -10 5 3 2014 forecast 90 NaN -40 6 1 2014 actual 10 0 NaN 7 2 2014 actual 40 20 NaN 8 3 2014 actual 50 30 NaN 9 1 2015 forecast 120 90 10 10 2 2015 forecast 210 160 -110 11 3 2015 forecast 150 60 40 12 1 2015 actual 130 120 NaN 13 2 2015 actual 100 60 NaN 14 3 2015 actual 190 140 NaN Note: I think the first 6 results of chg_from_prev_year should be NaN. However, I think you may be better off keeping it as a pivot: In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run') In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values In [33]: df3 Out[33]: revenue chg_from_prev_year chg_from_forecast run actual forecast actual forecast ids year 1 2013 10 NaN NaN NaN NaN 2014 10 30 0 NaN -20 2015 130 120 120 90 10 2 2013 20 NaN NaN NaN NaN 2014 40 50 20 NaN -10 2015 100 210 60 160 -110 3 2013 20 NaN NaN NaN NaN 2014 50 90 30 NaN -40 2015 190 150 140 60 40