Pandas - Determine if Churn occurs with missing years - python

I have a large pandas dataframe which contains ids, years, spend values, and a slew of other columns, as shown below:
id year spend .... n_columns
1 2015 321 ... ...
1 2016 342 ... ...
1 2017 843
1 2018 483
2 2015 234
2 2018 321
2 2019 232 ... ...
I am trying to create a new column which classifies the years based on the next years value. Something akin to:
id year spend cat
1 2015 321 increase
1 2016 342 increase
1 2017 843 decrease
1 2018 483 churned #as there is no 2019 data
2 2015 234 churned #as there is no 2016 data
2 2018 321 decreased
2 2019 232 decreased
2 2020 200 nan #max data only goes up to 2020
I have been trying to do this with something like the below, to get the difference between years to determine the category:
def categorize(x):
if math.abs(x['diff']) == x['value']:
return "churned"
elif x['diff'] < 0:
return "decrease"
elif x['diff' > 0:
return "increase"
else:
return None
df = df.sort_values(['id', 'year'], ascending = True)
df['diff'] = df.groupby('id')['spend'].diff(-1)
df = df.apply(categorize, axis = 1)
However, this method and all similar methods seem to fail as there are years missing for some ids (such as id = 2 and year = 2015 above). Is there an easy way to ensure all ids all contain all of the years, even if the values are all zeroed or nulled out? Is there a better way to determine if a year is an increase/decrease/churn than how I am doing it?
Thanks!

Here is one way to solve it:
Expand the dataframe to include the missing rows of years; I'll use the complete function from pyjanitor for this - it exposes explicitly missing values:
# pip install pyjanitor
import janitor
tempo = (df.complete(columns=["id",
{"year": lambda df: np.arange(df.year.min(),
df.year.max() + 1)}]
)
.assign(temp=lambda df: df.spend.ffill(),
temp_diff=lambda df: df.temp.diff(-1)
)
)
tempo
id year spend temp temp_diff
0 1 2015 321.0 321.0 -21.0
1 1 2016 342.0 342.0 -501.0
2 1 2017 843.0 843.0 360.0
3 1 2018 483.0 483.0 0.0
4 1 2019 NaN 483.0 249.0
5 2 2015 234.0 234.0 0.0
6 2 2016 NaN 234.0 0.0
7 2 2017 NaN 234.0 -87.0
8 2 2018 321.0 321.0 89.0
9 2 2019 232.0 232.0 NaN
Next step is to create conditions, and combine with np.select:
cond2 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.ge(0))
cond1 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.lt(0))
cond3 = (tempo.spend.shift(-1).isna()) & (tempo.temp_diff.eq(0))
tempo["cat"] = np.select([cond1, cond2, cond3],
["increase", "decrease", "churn"],
np.nan)
id year spend temp temp_diff cat
0 1 2015 321.0 321.0 -21.0 increase
1 1 2016 342.0 342.0 -501.0 increase
2 1 2017 843.0 843.0 360.0 decrease
3 1 2018 483.0 483.0 0.0 churn
4 1 2019 NaN 483.0 249.0 decrease
5 2 2015 234.0 234.0 0.0 churn
6 2 2016 NaN 234.0 0.0 churn
7 2 2017 NaN 234.0 -87.0 increase
8 2 2018 321.0 321.0 89.0 decrease
9 2 2019 232.0 232.0 NaN nan
Filter out the null rows in spend column:
tempo.query("spend.notna()").drop(columns = ['temp_diff', 'temp'])
id year spend cat
0 1 2015 321.0 increase
1 1 2016 342.0 increase
2 1 2017 843.0 decrease
3 1 2018 483.0 churn
5 2 2015 234.0 churn
8 2 2018 321.0 decrease
9 2 2019 232.0 nan
I used your original dataframe ( which stopped at 2019); let me know how it goes.

Related

How do I plot the frequency of an event overtime with pandas?I

I was trying to plot some data from a pandas dataframe.
My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3.
I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year.
I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year.
I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow.
This is the code i used to get my dataframe, if it is useful in any way.
import json
import requests
import pandas as pd
import numpy as np
request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json()
data = pd.DataFrame(request)
P.S. Thank you for the precious help!
You can filter by rating and use Series.value_counts:
s = data.loc[data['rating'].eq(3), 'year'].value_counts()
But there is many years of films:
print (len(s))
108
So for plot I filter only counts greatwer like 30, it is 40 years here:
print (s.gt(30).sum())
40
So filter again and plot:
s[s.gt(30)].plot.bar()
EDIT: Solution with percentages:
s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100)
print (s)
1899 0.018218
1910 0.018218
1916 0.054655
1917 0.054655
1918 0.054655
2018 3.169976
2019 3.188195
2020 2.040445
2021 1.840044
2022 0.765167
Name: year, Length: 108, dtype: float64
print (s[s.gt(3)])
2007 3.042449
2009 3.588996
2010 3.825833
2011 4.299508
2012 4.153762
2013 4.937147
2014 4.335945
2015 3.771179
2016 3.752960
2017 3.388595
2018 3.169976
2019 3.188195
Name: year, dtype: float64
s[s.gt(3)].plot.bar()
EDIT1: Here is solution for count years vs ratings:
df = pd.crosstab(data['year'], data.rating)
print (df)
rating 0 1 2 3
year
1874 1 0 0 0
1877 1 0 0 0
1878 2 0 0 0
1881 1 0 0 0
1883 1 0 0 0
.. .. .. ...
2018 19 44 24 174
2019 16 47 18 175
2020 10 17 11 112
2021 11 22 13 101
2022 3 14 5 42
[141 rows x 4 columns]
EDIT2:
df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100)
print (df)
rating 0 1 2 3
year
1874 100.000000 0.000000 0.000000 0.000000
1877 100.000000 0.000000 0.000000 0.000000
1878 100.000000 0.000000 0.000000 0.000000
1881 100.000000 0.000000 0.000000 0.000000
1883 100.000000 0.000000 0.000000 0.000000
... ... ... ...
2018 7.279693 16.858238 9.195402 66.666667
2019 6.250000 18.359375 7.031250 68.359375
2020 6.666667 11.333333 7.333333 74.666667
2021 7.482993 14.965986 8.843537 68.707483
2022 4.687500 21.875000 7.812500 65.625000
[141 rows x 4 columns]
There is alot of values, here is e.g. filter for column 3 for more like 60% values:
print (df[3].gt(60).sum())
26
df[df[3].gt(60)].plot.bar()

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

How to apply a function to multiple columns that iterates through each row

Data
I have a dataset that shows up-to-date bookings data grouped by company and month (empty values are NaNs)
company month year_ly bookings_ly year_ty bookings_ty
company a 1 2018 432 2019 253
company a 2 2018 265 2019 635
company a 3 2018 345 2019 525
company a 4 2018 233 2019
company a 5 2018 7664 2019
... ... ... ... ... ...
company a 12 2018 224 2019 321
company b 1 2018 543 2019 576
company b 2 2018 23 2019 43
company b 3 2018 64 2019 156
company b 4 2018 143 2019
company b 5 2018 41 2019
company b 6 2018 90 2019
... ... ... ... ... ...
What I want
I'd like to create a column or update the bookings_ty column where value is NaN (whichever is easier) that applies the following calculation for each row (grouped by company):
((SUM of previous 3 rows (or months) of bookings_ty)
/(SUM of previous 3 rows (or months) of bookings_ly))
* bookings_ly
Where a row's bookings_ty is NaN, I'd like that iteration of the formula to take the newly calculated field as part of its bookings_ty so essentially what the formula should do is populate NaN values in bookings_ty.
My attempt
df_bkgs.set_index(['operator', 'month'], inplace=True)
def calc(df_bkgs):
df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
return df_bkgs
df_bkgs.groupby(level=0).apply(calc)
import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])
Issue with this code is that it generates the calculated field for only the first empty/NaN bookings_ty. What I'd like is for there to be an iteration or loop type process that then takes the previous 3 rows in the group and if the bookings_ty is empty/NaN then take the calculated field of that row.
Thanks
You can try this. I made a function which found the last 3 records in your dataframe by row. note I had to create a column named index to do this as you can't access the index (as far as I know) within an apply statement.
# dataframe is named f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 NaN
4 a 5 2018 7664 2019 NaN
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 NaN
10 b 5 2018 41 2019 NaN
11 b 6 2018 90 2019 NaN
f.reset_index(inplace=True)
def aggFunct(row, df, last=3):
series = df.loc[(df['index'] < row['index']) & (df['index'] >= row['index'] - last), 'bookings_ty'].fillna(0)
ssum = series.sum()
return ssum
f.loc[f['bookings_ty'].isna(),'bookings_ty'] = f[f['bookings_ty'].isna()].apply(aggFunct, df=f, axis=1)
f.drop('index',axis=1,inplace=True)
f
company month year_ly bookings_ly year_ty bookings_ty
0 a 1 2018 432 2019 253.0
1 a 2 2018 265 2019 635.0
2 a 3 2018 345 2019 525.0
3 a 4 2018 233 2019 1413.0
4 a 5 2018 7664 2019 1160.0
5 a 12 2018 224 2019 321.0
6 b 1 2018 543 2019 576.0
7 b 2 2018 23 2019 43.0
8 b 3 2018 64 2019 156.0
9 b 4 2018 143 2019 775.0
10 b 5 2018 41 2019 199.0
11 b 6 2018 90 2019 156.0
Depending on how many companies you have in your table, I might be inclined to run this on Excel as opposed to doing this on pandas. Iterating through the rows might be slow, but if speed is not a concern, the following solution should work:
import numpy as np
import pandas as pd
df = pd.read_excel('data_file.xlsx') # <-- name of your file.
companies = pd.unique(df.company)
months = pd.unique(df.month)
for c in companies:
for m in months:
# slice a single row
df_row= df[(df['company']==c) & (df['month']==m)]
val = df_slice.bookings_ty.values[0]
if np.isnan(val):
# get the index of the row
idx = df_row.index[0]
df1 = df.copy()
df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))]
ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum()
projected_value = df_slice.bookings_ly.values[0] * ratio
df.loc[idx, 'bookings_ty'] = projected_value
else:
pass
print(df)
if we can assume that the DataFrame is always sorted by 'company' and then by 'month', then we can use the following approach, there is a 20-fold improvement (0.003s vs. 0.07s) with my sample data of 24 rows.
df = pd.read_excel('data_file.xlsx') # your input file
ly = df.bookings_ly.values.tolist()
ty = df.bookings_ty.values.tolist()
for val in ty:
if np.isnan(val):
idx = ty.index(val) # returns the index of the first 'nan' found
ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx])
ty[idx] = ratio * ly[idx]
df['bookings_ty'] = ty
here is a solution:
import numpy as np
import pandas as pd
#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x):
while x['bookings_ty'].isnull().any():
x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
(x['bookings_ty'].shift(1) +
x['bookings_ty'].shift(2) +
x['bookings_ty'].shift(3)) /
(x['bookings_ly'].shift(1) +
x['bookings_ly'].shift(2) +
x['bookings_ly'].shift(3)) *
x['bookings_ly'], x['bookings_ty'])
return x
df = df.groupby(['company']).apply(lambda x: process(x))
#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial DF:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 315 **
4 company_a 5 2018 7664 2019 13418 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
In case of you want another rolling month or maybe a NaN value could exist at the beginning of each company, you could use this generic solution:
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x, m):
idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
for i in idx:
id = i - x.index[0]
start = 0 if id < m else id - m
sum_ty = sum(x['bookings_ty'].to_list()[start:id])
sum_ly = sum(x['bookings_ly'].to_list()[start:id])
ly = x.at[i, 'bookings_ly']
x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
return x
rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
initial df:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253.0
1 company_a 2 2018 265 2019 635.0
2 company_a 3 2018 345 2019 NaN
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321.0
6 company_b 1 2018 543 2019 576.0
7 company_b 2 2018 23 2019 43.0
8 company_b 3 2018 64 2019 156.0
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
final result:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows
3 company_a 4 2018 233 2019 296 **
4 company_a 5 2018 7664 2019 12467 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
if you want to speed up the process you could try:
df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))
instead of
df = df.groupby(['company']).apply(lambda x: process(x))

Grouping data series by day intervals with Pandas

I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3

Calculate difference from previous year/forecast in pandas dataframe

I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40

Categories