Generate weeks from column with dates - python

I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)

IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0

You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

pandas dataframe interpolate for Nans with groupby using window of discrete days of the year

The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.
Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]
# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index
# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))
This answers both parts
build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})
df[df.isna().any(axis=1)]
ystart = df.Date.dt.year.min()
# generate rolling means for month/day. bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
.agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
.to_frame().rename_axis(["month","day"])
)
# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()
# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)
# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])
analysis
taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22
dfm.query("key_0==4 & key_1==11").head(7)
key_0
key_1
key_2
Date
vals
0
100
4
11
1918
1918-04-11 00:00:00
nan
22
465
4
11
1919
1919-04-11 00:00:00
12
22
831
4
11
1920
1920-04-11 00:00:00
2
22
1196
4
11
1921
1921-04-11 00:00:00
47
27
1561
4
11
1922
1922-04-11 00:00:00
47
36
1926
4
11
1923
1923-04-11 00:00:00
2
34.6
2292
4
11
1924
1924-04-11 00:00:00
37
29.4
I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements
Need an arbitrary number of averages
Use those averages to fill in the NA
I have addressed the
Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)
df.loc[df['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 NaN NaN 34.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN 49.0 NaN NaN NaN 32.0 NaN NaN 76 21.6
04-11 NaN 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 NaN 5.0 33.0 3 29.7
04-12 NaN 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 NaN 1.0 12.0 3 21.3
04-13 NaN 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 NaN 2.0 46.0 3 21.3
04-14 NaN 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 NaN 16.0 8.0 3 24.7
df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 21.6 21.6 34.0 21.6 21.6 21.6 2.0 21.6 21.6 21.6 ... 21.6 49.0 21.6 21.6 21.6 32.0 21.6 21.6 76.0 21.6
04-11 29.7 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 29.7 5.0 33.0 3.0 29.7
04-12 21.3 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 21.3 1.0 12.0 3.0 21.3
04-13 21.3 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 21.3 2.0 46.0 3.0 21.3
04-14 24.7 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 24.7 16.0 8.0 3.0 24.7

Pandas rolling function with only dates in the dataframe

Hey I have a doubt on pandas rolling function.
I am currently using it to get mean for last 10 days of my time series data.
Example df:
column
2020-12-04 14
2020-12-05 15
2020-12-06 16
2020-12-07 17
2020-12-08 18
2020-12-09 19
2020-12-13 20
2020-12-14 11
2020-12-16 12
2020-12-17 13
Usage:
df['column'].rolling('10D').mean()
But the function calculates the rolling mean over the 10 calendar days. like if the current row date is 2020-12-17 it calculates till 2020-12-07.
However I would like the rolling mean on the last 10 days that are in the data frame. i.e I would want till 2020-12-04.
How can I acheive it?
Edit: So I can also have a 15 mins interval datetime index so doing window=10 is not helping in that case. Though it works here.
As said in the comments by #cs95, if you want to consider only the rows that are in the dataframe, you can ignore that your data is part of a timeseries and just specify a window sized by a number of rows, instead of by a number of days. In essence
df['column'].rolling(window=10).mean()
Just one little detail to remember. You have missing dates in you dataframe. You should fill that, otherwise it will not be a 10 day window. Instead you would have a 10-dates rolling window,which would be pretty meaningless if dates are randoly missing.
r = pd.date_range(start=df1.Date.min(), end=df1.Date.max())
df1 = df1.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index()
which gives you the dataframe:
Date column
0 2020-12-04 14.0
1 2020-12-05 15.0
2 2020-12-06 16.0
3 2020-12-07 17.0
4 2020-12-08 18.0
5 2020-12-09 19.0
6 2020-12-10 0.0
7 2020-12-11 0.0
8 2020-12-12 0.0
9 2020-12-13 20.0
10 2020-12-14 11.0
11 2020-12-15 0.0
12 2020-12-16 12.0
13 2020-12-17 13.0
Then applying:
df1['Mean']=df1['column'].rolling(window=10).mean()
returns
Date column Mean
0 2020-12-04 14.0 NaN
1 2020-12-05 15.0 NaN
2 2020-12-06 16.0 NaN
3 2020-12-07 17.0 NaN
4 2020-12-08 18.0 NaN
5 2020-12-09 19.0 NaN
6 2020-12-10 0.0 NaN
7 2020-12-11 0.0 NaN
8 2020-12-12 0.0 NaN
9 2020-12-13 20.0 11.9
10 2020-12-14 11.0 11.6
11 2020-12-15 0.0 10.1
12 2020-12-16 12.0 9.7
13 2020-12-17 13.0 9.3

Calculate sum of Column grouped by hour

I am trying to calculate the total cost of staffing requirements over a day. My attempt is to group People required throughout the day and multiply the cost. I then try to group this cost per/hour. But my output isn't correct.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dates
d = ({
'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
df = pd.DataFrame(data = d)
df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')
formatter = dates.DateFormatter('%Y-%m-%d %H:%M:%S')
df = df.groupby(pd.Grouper(freq='15T',key='Time'))['People'].max().ffill()
df = df.reset_index(level=['Time'])
df['Cost'] = df['People'] * 26
cost = df.groupby([df['Time'].dt.hour])['Cost'].sum()
#For reference. This plot displays people required throughout the day
fig, ax = plt.subplots(figsize = (10,5))
plt.plot(df['Time'], df['People'], color = 'blue')
plt.locator_params(axis='y', nbins=6)
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))
plt.ylabel('People Required', labelpad = 10)
plt.xlabel('Time', labelpad = 10)
print(cost)
Out:
0 416.0
1 416.0
2 416.0
3 130.0
8 104.0
9 104.0
10 208.0
11 208.0
12 260.0
13 312.0
14 312.0
15 312.0
16 312.0
17 312.0
18 364.0
19 312.0
20 312.0
21 312.0
22 416.0
23 416.0
I have done the calculations manually an the total cost output should be:
$1456
I think the wrong numbers in your question is most likely caused by the incorrect datetime values that you have. Once you have fixed that, you should get the correct numbers. Here's an attempt from my end, with a little tweak to the Time column.
import pandas as pd
df = pd.DataFrame({
'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:00:00','1/1/1900 12:29:00','1/1/1900 12:30:00','1/1/1900 13:00:00','1/1/1900 13:02:00','1/1/1900 13:15:00','1/1/1900 13:20:00','1/1/1900 18:10:00','1/1/1900 18:15:00','1/1/1900 18:20:00','1/1/1900 18:25:00','1/1/1900 18:45:00','1/1/1900 18:50:00','1/1/1900 19:05:00','1/1/1900 19:07:00','1/1/1900 21:57:00','1/1/1900 22:00:00','1/1/1900 22:30:00','1/1/1900 22:35:00','1/2/1900 3:00:00','1/2/1900 3:05:00','1/2/1900 3:20:00','1/2/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
>>>df
Time People
0 1/1/1900 8:00:00 1
1 1/1/1900 9:59:00 1
2 1/1/1900 10:00:00 2
3 1/1/1900 12:29:00 2
4 1/1/1900 12:30:00 3
5 1/1/1900 13:00:00 3
6 1/1/1900 13:02:00 2
7 1/1/1900 13:15:00 2
8 1/1/1900 13:20:00 3
9 1/1/1900 18:10:00 3
10 1/1/1900 18:15:00 4
11 1/1/1900 18:20:00 4
12 1/1/1900 18:25:00 3
13 1/1/1900 18:45:00 3
14 1/1/1900 18:50:00 2
15 1/1/1900 19:05:00 2
16 1/1/1900 19:07:00 3
17 1/1/1900 21:57:00 3
18 1/1/1900 22:00:00 4
19 1/1/1900 22:30:00 4
20 1/1/1900 22:35:00 3
21 1/2/1900 3:00:00 3
22 1/2/1900 3:05:00 2
23 1/2/1900 3:20:00 2
24 1/2/1900 3:25:00 1
df.Time = pd.to_datetime(df.Time)
df.Time.set_index('Time', inplace=True)
df_group = df.resample('15T').max().ffill()
df_hour = df_group.resample('1h').max()
df_hour['Cost'] = df_hour['People'] * 26
>>>df_hour
People Cost
Time
1900-01-01 08:00:00 1.0 26.0
1900-01-01 09:00:00 1.0 26.0
1900-01-01 10:00:00 2.0 52.0
1900-01-01 11:00:00 2.0 52.0
1900-01-01 12:00:00 3.0 78.0
1900-01-01 13:00:00 3.0 78.0
1900-01-01 14:00:00 3.0 78.0
1900-01-01 15:00:00 3.0 78.0
1900-01-01 16:00:00 3.0 78.0
1900-01-01 17:00:00 3.0 78.0
1900-01-01 18:00:00 4.0 104.0
1900-01-01 19:00:00 3.0 78.0
1900-01-01 20:00:00 3.0 78.0
1900-01-01 21:00:00 3.0 78.0
1900-01-01 22:00:00 4.0 104.0
1900-01-01 23:00:00 4.0 104.0
1900-01-02 00:00:00 4.0 104.0
1900-01-02 01:00:00 4.0 104.0
1900-01-02 02:00:00 4.0 104.0
1900-01-02 03:00:00 3.0 78.0
>>>df_hour.sum()
People 60.0
Cost 1560.0
dtype: float64
Edit: Took me reading the second time to realize the methodology that you're using. The incorrect number that you got is likely due to grouping by sum() after you performed a ffill() on your aggregated People column. Since ffill() fills the holes from the last valid value, you actually overestimated your cost for these periods. You should be using max() again, to find the maximum number of headcount required for that hour.

Want MultiIndex for rows and columns with read_csv

My .csv file looks like:
Area When Year Month Tickets
City Day 2015 1 14
City Night 2015 1 5
Rural Day 2015 1 18
Rural Night 2015 1 21
Suburbs Day 2015 1 15
Suburbs Night 2015 1 21
City Day 2015 2 13
containing 75 rows. I want both a row multiindex and column multiindex that looks like:
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 5.0 3.0 22.0 11.0 13.0 2.0
2 22.0 8.0 4.0 16.0 6.0 18.0
3 26.0 25.0 22.0 23.0 22.0 2.0
2016 1 20.0 25.0 39.0 14.0 3.0 10.0
2 4.0 14.0 16.0 26.0 1.0 24.0
3 22.0 17.0 7.0 24.0 12.0 20.0
I've read the .read_csv doc at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I can get the row multiindex with:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3])
I've tried:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3], header=[1, 3, 5])
thinking [1, 3, 5] fetches 'City', 'Rural', and 'Suburbs'. How do I get the desired column multiindex shown above?
Seems like you need to pivot_table with multiple indexes and multiple columns.
Start with just reading you csv plainly
df = pd.read_csv('Tickets.csv')
Then
df.pivot_table(index=['Year', 'Month'], columns=['Area', 'When'], values=['Tickets'])
With the input data you provided, you'd get
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 14.0 5.0 18.0 21.0 15.0 21.0
2 13.0 NaN NaN NaN NaN NaN

Categories