Time interval calculation for consecutive days in rows - python

I have a dataframe that looks like this:
Path_Version commitdates Year-Month API Age api_spec_id
168 NaN 2018-10-19 2018-10 39 521
169 NaN 2018-10-19 2018-10 39 521
170 NaN 2018-10-12 2018-10 39 521
171 NaN 2018-10-12 2018-10 39 521
172 NaN 2018-10-12 2018-10 39 521
173 NaN 2018-10-11 2018-10 39 521
174 NaN 2018-10-11 2018-10 39 521
175 NaN 2018-10-11 2018-10 39 521
176 NaN 2018-10-11 2018-10 39 521
177 NaN 2018-10-11 2018-10 39 521
178 NaN 2018-09-26 2018-09 39 521
179 NaN 2018-09-25 2018-09 39 521
I want to calculate the days elapsed from the first commitdate till the last, after sorting the commit dates first, so something like this:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 25
169 NaN 2018-10-19 2018-10 39 521 25
170 NaN 2018-10-12 2018-10 39 521 18
171 NaN 2018-10-12 2018-10 39 521 18
172 NaN 2018-10-12 2018-10 39 521 18
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0
I tried first sorting the commitdates by api_spec_id since it is unique for every API, and then calculating the diff
final_api['commitdates'] = final_api.groupby('api_spec_id')['commitdate'].apply(lambda x: x.sort_values())
final_api['diff'] = final_api.groupby('api_spec_id')['commitdates'].diff() / np.timedelta64(1, 'D')
final_api['diff'] = final_api['diff'].fillna(0)
It just returns me a zero for the entire column. I don't want to group them, I only want to calculate the difference based on the sorted commitdates: starting from the first commitdate till the last in the entire dataset, in days
Any idea how can I achieve this?

Use pandas.to_datetime, sub, min and dt.days:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.min()).dt.days
If you need to group per API:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.groupby(df['api_spec_id']).transform('min')).dt.days
Output:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 24
169 NaN 2018-10-19 2018-10 39 521 24
170 NaN 2018-10-12 2018-10 39 521 17
171 NaN 2018-10-12 2018-10 39 521 17
172 NaN 2018-10-12 2018-10 39 521 17
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0

Related

Calculate mean of data rows in dataframe with date-headers, dictated by a 'datetime'-column

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0.
If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Initial dataframe:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID
0 100.0 122.0 324 632 NaN 2016-03-08 12
1 120.0 159.0 54 452 541.0 2015-04-09 96
2 NaN 164.0 687 165 245.0 2016-02-15 20
3 180.0 421.0 512 184 953.0 2018-05-01 73
4 110.0 654.0 913 173 103.0 2017-08-04 84
5 130.0 NaN 754 124 207.0 2016-07-03 26
6 170.0 256.0 843 97 806.0 2013-02-04 87
7 140.0 754.0 95 101 541.0 2016-06-08 64
8 80.0 985.0 184 84 90.0 2019-03-05 11
9 96.0 65.0 127 130 421.0 2014-05-14 34
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
Tried code: -> I'm still working on it, as I don't really know how to start for this, I only uploaded the dataframe so far, probably something with the 'datetime'-package has to be done to get the desired dataframe?
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
print(df)
Due to your naming convention, one need to extract the years from column names for comparison purpose. Then you can mask the data and taking mean:
# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)
# the years from Date
years = pd.to_datetime(df.Date).dt.year.values
df['mean'] = data.where(data_years<years[:,None]).mean(1)
Output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.00
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.00
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.00
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.00
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.00
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447.00
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.60
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
one more answer:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)

print(df)

Append category to column if date range is between start and end date

I'm sure this is simple, but I can't wrap my head around it. Essentially I have two dataframes, a large df that contains process data every six hours and a smaller df that contains a condition number, a start date and an end date. I need to fill the condition column of the large dataframe with the condition number that corresponds to the date range, or else leave it blank if the dates do not fall between any date range in the small df. So my two frames would look like this:
Large df
Date P1 P2
7/1/2019 11:00 102 240
7/1/2019 17:00 102 247
7/1/2019 23:00 100 219
7/2/2019 5:00 107 213
7/2/2019 11:00 100 226
7/2/2019 17:00 104 239
7/2/2019 23:00 110 240
7/3/2019 5:00 110 232
7/3/2019 11:00 102 215
7/3/2019 17:00 103 219
7/3/2019 23:00 107 243
7/4/2019 5:00 107 246
7/4/2019 11:00 103 219
7/4/2019 17:00 105 220
7/4/2019 23:00 107 220
7/5/2019 5:00 107 227
7/5/2019 11:00 108 208
7/5/2019 17:00 110 248
7/5/2019 23:00 107 235
Small df
Condition Start Time End Time
A 7/1/2019 11:00 7/2/2019 5:00
B 7/3/2019 5:00 7/3/2019 23:00
C 7/4/2019 23:00 7/5/2019 17:00
And I need the result to look like this:
Date P1 P2 Cond
7/1/2019 11:00 102 240 A
7/1/2019 17:00 102 247 A
7/1/2019 23:00 100 219 A
7/2/2019 5:00 107 213 A
7/2/2019 11:00 100 226
7/2/2019 17:00 104 239
7/2/2019 23:00 110 240
7/3/2019 5:00 110 232 B
7/3/2019 11:00 102 215 B
7/3/2019 17:00 103 219 B
7/3/2019 23:00 107 243 B
7/4/2019 5:00 107 246
7/4/2019 11:00 103 219
7/4/2019 17:00 105 220
7/4/2019 23:00 107 220 C
7/5/2019 5:00 107 227 C
7/5/2019 11:00 108 208 C
7/5/2019 17:00 110 248 C
7/5/2019 23:00 107 235
You need:
for i, row in sdf.iterrows():
df.loc[df['Date'].between(row['Start Time'], row['End Time']), 'Cond'] = row['Condition']
Output:
Date P1 P2 Cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
You may try pd.IntervalIndex and map as follows:
inx = pd.IntervalIndex.from_arrays(df2['Start Time'], df2['End Time'], closed='both')
df2.index = inx
df1['cond'] = df1.Date.map(df2.Condition)
Out[423]:
Date P1 P2 cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
You could do something like the following:
df1 = pd.read_csv(io.StringIO(s1), sep='\s\s+', engine='python',
converters={'Date': pd.to_datetime})
df2 = pd.read_csv(io.StringIO(s2), sep='\s\s+', engine='python',
converters={'Start Time': pd.to_datetime, 'End Time': pd.to_datetime})
df2 = df2.set_index('Condition').stack().reset_index()
df = pd.merge_asof(df1, df2, left_on='Date', right_on=0, direction='backward')
df.loc[(df['level_1'].eq('End Time')) & (df['Date'] > df[0]), 'Condition'] = ''
print(df.iloc[:, :-2])
Date P1 P2 Condition
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226
5 2019-07-02 17:00:00 104 239
6 2019-07-02 23:00:00 110 240
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246
12 2019-07-04 11:00:00 103 219
13 2019-07-04 17:00:00 105 220
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235
df1.insert(3, "Cond", [None] * len(df1))
for i in range(len(df2)):
df1.loc[(df1["Date"] >= df2["Start Time"].loc[i]) * (df1["Date"] <= df2["End Time"].loc[i]), "Cond"] = df2["Condition"].loc[i]

How to calculate a moving average in my dataset? [duplicate]

I have a data frame like this which is imported from a CSV.
stock pop
Date
2016-01-04 325.316 82
2016-01-11 320.036 83
2016-01-18 299.169 79
2016-01-25 296.579 84
2016-02-01 295.334 82
2016-02-08 309.777 81
2016-02-15 317.397 75
2016-02-22 328.005 80
2016-02-29 315.504 81
2016-03-07 328.802 81
2016-03-14 339.559 86
2016-03-21 352.160 82
2016-03-28 348.773 84
2016-04-04 346.482 83
2016-04-11 346.980 80
2016-04-18 357.140 75
2016-04-25 357.439 77
2016-05-02 356.443 78
2016-05-09 365.158 78
2016-05-16 352.160 72
2016-05-23 344.540 74
2016-05-30 354.998 81
2016-06-06 347.428 77
2016-06-13 341.053 78
2016-06-20 363.515 80
2016-06-27 349.669 80
2016-07-04 371.583 82
2016-07-11 358.335 81
2016-07-18 362.021 79
2016-07-25 368.844 77
... ... ...
I wanted to add a new column MA which calculates Rolling mean for the column pop. I tried the following
df['MA']=data.rolling(5,on='pop').mean()
I get an error
ValueError: Wrong number of items passed 2, placement implies 1
So I thought let me try if it just works without adding a column. I used
data.rolling(5,on='pop').mean()
I got the output
stock pop
Date
2016-01-04 NaN 82
2016-01-11 NaN 83
2016-01-18 NaN 79
2016-01-25 NaN 84
2016-02-01 307.2868 82
2016-02-08 304.1790 81
2016-02-15 303.6512 75
2016-02-22 309.4184 80
2016-02-29 313.2034 81
2016-03-07 319.8970 81
2016-03-14 325.8534 86
2016-03-21 332.8060 82
2016-03-28 336.9596 84
2016-04-04 343.1552 83
2016-04-11 346.7908 80
2016-04-18 350.3070 75
2016-04-25 351.3628 77
2016-05-02 352.8968 78
2016-05-09 356.6320 78
2016-05-16 357.6680 72
2016-05-23 355.1480 74
2016-05-30 354.6598 81
2016-06-06 352.8568 77
2016-06-13 348.0358 78
2016-06-20 350.3068 80
2016-06-27 351.3326 80
2016-07-04 354.6496 82
2016-07-11 356.8310 81
2016-07-18 361.0246 79
2016-07-25 362.0904 77
... ... ...
I can't seem to apply Rolling mean on the column pop. What am I doing wrong?
To assign a column, you can create a rolling object based on your Series:
df['new_col'] = data['column'].rolling(5).mean()
The answer posted by ac2001 is not the most performant way of doing this. He is calculating a rolling mean on every column in the dataframe, then he is assigning the "ma" column using the "pop" column. The first method of the following is much more efficient:
%timeit df['ma'] = data['pop'].rolling(5).mean()
%timeit df['ma_2'] = data.rolling(5).mean()['pop']
1000 loops, best of 3: 497 µs per loop
100 loops, best of 3: 2.6 ms per loop
I would not recommend using the second method unless you need to store computed rolling means on all other columns.
Edit: pd.rolling_mean is deprecated in pandas and will be removed in future. Instead: Using pd.rolling you can do:
df['MA'] = df['pop'].rolling(window=5,center=False).mean()
for a dataframe df:
Date stock pop
0 2016-01-04 325.316 82
1 2016-01-11 320.036 83
2 2016-01-18 299.169 79
3 2016-01-25 296.579 84
4 2016-02-01 295.334 82
5 2016-02-08 309.777 81
6 2016-02-15 317.397 75
7 2016-02-22 328.005 80
8 2016-02-29 315.504 81
9 2016-03-07 328.802 81
To get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
Old: Although it is deprecated you can use:
df['MA']=pd.rolling_mean(df['pop'], window=5)
to get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html
This solution worked for me.
data['MA'] = data.rolling(5).mean()['pop']
I think the issue may be that the on='pop' is just changing the column to perform the rolling window from the index.
From the doc string: " For a DataFrame, column on which to calculate the rolling window, rather than the index"

Pandas if/then aggregation

I've been searching SO and haven't figured this out yet. Hoping someone can aide this python newb to solving my problem.
I'm trying to figure out how to write an if/then statement in python and perform an aggregation off that if/then statement. My end goal is to say if the date = 1/7/2017 then use the value in the "fake" column. If date = all else then average the two columns together.
Here is what I have so far:
import pandas as pd
import numpy as np
import datetime
np.random.seed(42)
dte=pd.date_range(start=datetime.date(2017,1,1), end= datetime.date(2017,1,15))
fake=np.random.randint(15,100, size=15)
fake2=np.random.randint(300,1000,size=15)
so_df=pd.DataFrame({'date':dte,
'fake':fake,
'fake2':fake2})
so_df['avg']= so_df[['fake','fake2']].mean(axis=1)
so_df.head()
Assuming you have already computed the average column:
so_df['fake'].where(so_df['date']=='20170107', so_df['avg'])
Out:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 449.0
9 395.5
10 197.0
11 438.5
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
If not, you can replace the column reference with the same calculation:
so_df['fake'].where(so_df['date']=='20170107', so_df[['fake','fake2']].mean(axis=1))
To check for multiple dates, you need to use the element-wise version of the or operator (which is pipe: |). Otherwise it will raise an error.
so_df['fake'].where((so_df['date']=='20170107') | (so_df['date']=='20170109'), so_df['avg'])
The above checks for two dates. In the case of 3 or more, you may want to use isin with a list:
so_df['fake'].where(so_df['date'].isin(['20170107', '20170109', '20170112']), so_df['avg'])
Out[42]:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 38.0
9 395.5
10 197.0
11 67.0
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
Let's use np.where:
so_df['avg'] = np.where(so_df['date'] == pd.to_datetime('2017-01-07'),
so_df['fake'], so_df[['fake',
'fake2']].mean(1))
Output:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
One way to do if-else in pandas is by using np.where
There are three values inside, condition, if and else
so_df['avg']= np.where(so_df['date'] == '2017-01-07',so_df['fake'],so_df[['fake','fake2']].mean(axis=1))
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
we can also use Series.where() method:
In [141]: so_df['avg'] = so_df['fake'] \
...: .where(so_df['date'].isin(['2017-01-07','2017-01-09']))
...: .fillna(so_df[['fake','fake2']].mean(1))
...:
In [142]: so_df
Out[142]:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 38.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5

Python Pandas Dataframe assignment

I am following a Lynda tutorial where they use the following code:
import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
and it works perfectly. However, in my case it seems that the code is not compiling, for the last line I keep getting an error.
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I know in the video they are using Python 2, however I have Python 3 since I am learning for work (which uses Python 3). Most of the differences I have been able to figure out, however I cannot figure out how to create this new column called 'total' with the sums of the passengers.
The root cause of this error message is the categorical nature of the month column:
In [42]: flights.dtypes
Out[42]:
year int64
month category
passengers int64
dtype: object
In [43]: flights.month.cat.categories
Out[43]: Index(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], d
type='object')
and you are trying to add a category total - Pandas doesn't like that.
Workaround:
In [45]: flights.month.cat.add_categories('total', inplace=True)
In [46]: x = flights.pivot(index='year', columns='month', values='passengers')
In [47]: x['total'] = x.sum(1)
In [48]: x
Out[48]:
month January February March April May June July August September October November December total
year
1949 112.0 118.0 132.0 129.0 121.0 135.0 148.0 148.0 136.0 119.0 104.0 118.0 1520.0
1950 115.0 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 114.0 140.0 1676.0
1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0 2042.0
1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0 2364.0
1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0 2700.0
1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0 2867.0
1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0 3408.0
1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0 3939.0
1957 315.0 301.0 356.0 348.0 355.0 422.0 465.0 467.0 404.0 347.0 305.0 336.0 4421.0
1958 340.0 318.0 362.0 348.0 363.0 435.0 491.0 505.0 404.0 359.0 310.0 337.0 4572.0
1959 360.0 342.0 406.0 396.0 420.0 472.0 548.0 559.0 463.0 407.0 362.0 405.0 5140.0
1960 417.0 391.0 419.0 461.0 472.0 535.0 622.0 606.0 508.0 461.0 390.0 432.0 5714.0
UPDATE: alternatively if you don't want to touch the original DF you can get rid of categorical columns in the flights_unstacked DF:
In [76]: flights_unstacked.columns = \
...: flights_unstacked.columns \
...: .set_levels(flights_unstacked.columns.get_level_values(1).categories,
...: level=1)
...:
In [77]: flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
In [78]: flights_unstacked
Out[78]:
passengers
month January February March April May June July August September October November December total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714

Categories