Pandas DataFrame Pivot Using Dates and Counts - python

I've taken a large data file and managed to use groupby and value_counts to get the dataframe below. However, I want to format it so the company is on the left, with the months on top, and each number would be the number of calls that month, the third column.
Here is my code to sort:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
Here is my output df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
I want
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
I know that there is a nifty pivot function for dataframes from this documentation http://pandas.pydata.org/pandas-docs/stable/reshaping.html for pandas, so I've been trying to use df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
One problem is that the third column doesn't have a name, so I can't use it for values = 'NumberCalls'. The second problem is figuring out how to take the datetime format in my dataframe and make it display by month only.
Edit:
CompanyName is the first column, recvd_dttm is the 15th column. This is my code after some more attempts:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
It is throwing this error: KeyError: 'recvd_dttm' and won't get to the result line.

You need to aggregate the data before creating the pivot table. If there is no column name, you can either refer it to df.iloc[:, 1] (the 2nd column) or simply rename the df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294

Related

How to replicate index/match (with multiple criteria) in in python with multiple dataframes?

I am trying to replicate a excel model I have in python to automate it as I scale it up but I am stuck on how to translate the complex formula's into python.
I have information in three dataframes:
DF1:
ID type 1
ID type 2
Unit
a
1_a
400
b
1_b
26
c
1_c
23
d
1_b
45
e
1_d
24
f
1_b
85
g
1_a
98
DF2:
ID type 1
ID type 2
Tech
a
1_a
wind
b
1_b
solar
c
1_c
gas
d
1_b
coal
e
1_d
wind
f
1_b
gas
g
1_a
coal
And DF 3, the main DF:
Date
Time
ID type 1
ID type 2
Period
output
Unit *
Tech *
03/01/2022
02:30:00
a
1_a
1
254
03/01/2022
02:30:00
b
1_b
1
456
03/01/2022
02:30:00
c
1_c
2
3325
03/01/2022
02:30:00
d
1_b
2
1254
05/01/2022
02:30:00
e
1_d
3
489
05/01/2022
02:30:00
a
1_a
3
452
05/01/2022
02:30:00
b
1_b
4
12
05/01/2022
02:30:00
c
1_c
4
1
05/01/2022
03:00:00
d
1_b
35
54
05/01/2022
03:00:00
e
1_d
35
48
05/01/2022
03:00:00
a
1_a
48
56
I wish to get the information from each ID type in DF 3 for "unit" and "Tech" from DF 1 & 2 into DF 3. The conditional statements I have in excel atm are based on INDEX and MATCH and INFA, as some of the ID types in DF will be from either ID type 1 or ID type 2 so the function checks both columns and based on a positve match yields the required result.
For more context, DF1 and DF2 do not change but DF3 changes and I need a function for that which I will explain later.
The excel function I use to fill in Unit* from DF1 is (note I have replaced the excel sheet name to DF1 to help conceptualize the problem:
=IFNA(INDEX('DF1'!$K$3:$K$1011,MATCH(N2,'DF1'!$E$3:$E$1011,0)),INDEX('DF1'!$K$3:$K$1011,MATCH(M2,'DF1'!$D$3:$D$1011,0)))
The excel function I use to fill in Tech * is a bit more straight forward:
=IFNA(INDEX('DF2'$L:$L,MATCH(O3,'DF2'$K:$K,0)),INDEX('DF2'$L:$L,MATCH(N3,'DF2'$J:$J,0)))
That is the main stumbling block at the moment, but after this is achieved I need a function that for each day produces the following DF:
ID type 1
Tech
Period 1
Period 2
Period 3
Period 4
Period 5
Period 6
Period 7
…
a
wind
Sum of output for this ID Type 1 and Period 1
b
solar
c
gas
d
coal
e
wind
a
gas
…
…
The idea here is that I can use conditional function again to sum the "output" column of DF3 under the condition of date, ID type and period number.
EDIT: Output based on possible solution:
time settlementDate BM Unit ID 1 BM Unit ID 2 settlementPeriod \
0 00:00:00 03/01/2022 RCBKO-1 T_RCBKO-1 1
1 00:00:00 03/01/2022 LARYO-3 T_LARYW-3 1
2 00:00:00 03/01/2022 LAGA-1 T_LAGA-1 1
3 00:00:00 03/01/2022 CRMLW-1 T_CRMLW-1 1
4 00:00:00 03/01/2022 GRIFW-1 T_GRIFW-1 1
... ... ... ... ... ...
52533 23:30:00 08/01/2022 CRMLW-1 T_CRMLW-1 48
52534 23:30:00 08/01/2022 LARYO-4 T_LARYW-4 48
52535 23:30:00 08/01/2022 HOWBO-3 T_HOWBO-3 48
52536 23:30:00 08/01/2022 BETHW-1 E_BETHW-1 48
52537 23:30:00 08/01/2022 HMGTO-1 T_HMGTO-1 48
quantity Capacity_x Technology Technology_x \
0 278.658 NaN NaN WIND
1 162.940 NaN NaN WIND
2 262.200 NaN NaN CCGT
3 3.002 NaN NaN WIND
4 9.972 NaN NaN WIND
... ... ... ... ...
52533 8.506 NaN NaN WIND
52534 159.740 NaN NaN WIND
52535 32.554 NaN NaN NaN
52536 5.010 NaN NaN WIND
52537 92.094 NaN NaN WIND
Registered Resource Name_x Capacity_y Technology_y \
0 NaN NaN WIND
1 NaN NaN WIND
2 NaN NaN CCGT
3 NaN NaN WIND
4 NaN NaN WIND
... ... ... ...
52533 NaN NaN WIND
52534 NaN NaN WIND
52535 NaN NaN NaN
52536 NaN NaN WIND
52537 NaN NaN WIND
Registered Resource Name_y Capacity
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
52533 NaN NaN
52534 NaN NaN
52535 NaN NaN
52536 NaN NaN
52537 NaN NaN
[52538 rows x 14 columns]
EDIT: New query
ID Type 1
Tech
Period_1
Period_2
Period_3
Period_4
Period_35
Period_48
a
wind
450
0
0
0
0
0
>>> These are mean of all dates*
b
wind
0
0
550
0
0
85
b
wind
0
0
895
0
452
0
For the first part of your question you want to do a left merge on those 2 columns twice like this:
df3 = (
df3
.merge(df1, on=['ID type 1', 'ID type 2'], how='left')
.merge(df2, on=['ID type 1', 'ID type 2'], how='left')
)
print(df3)
Date Time ID type 1 ID type 2 Period output Unit Tech
0 03/01/2022 02:30:00 a 1_a 1 254 400 wind
1 03/01/2022 02:30:00 b 1_b 1 456 26 solar
2 03/01/2022 02:30:00 c 1_c 2 3325 23 gas
3 03/01/2022 02:30:00 d 1_b 2 1254 45 coal
4 05/01/2022 02:30:00 e 1_d 3 489 24 wind
5 05/01/2022 02:30:00 a 1_a 3 452 400 wind
6 05/01/2022 02:30:00 b 1_b 4 12 26 solar
7 05/01/2022 02:30:00 c 1_c 4 1 23 gas
8 05/01/2022 03:00:00 d 1_b 35 54 45 coal
9 05/01/2022 03:00:00 e 1_d 35 48 24 wind
10 05/01/2022 03:00:00 a 1_a 48 56 400 wind
For the next part you could use a pandas.pivot_table.
out = (
df3
.pivot_table(
index=['Date', 'ID type 1', 'Tech'],
columns='Period',
values='output',
aggfunc=sum,
fill_value=0)
.add_prefix('Period_')
)
print(out)
Output:
Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48
Date ID type 1 Tech
03/01/2022 a wind 254 0 0 0 0 0
b solar 456 0 0 0 0 0
c gas 0 3325 0 0 0 0
d coal 0 1254 0 0 0 0
05/01/2022 a wind 0 0 452 0 0 56
b solar 0 0 0 12 0 0
c gas 0 0 0 1 0 0
d coal 0 0 0 0 54 0
e wind 0 0 489 0 48 0
I used fill_value to show you that option, without it you get 'NaN' in those cells.
UPDATE:
From question in comments, only get pivot data of one Technology (e.g. 'wind'):
out.loc[out.index.get_level_values('Tech')=='wind']
Period Period_1 Period_2 Period_3 Period_4 Period_35 Period_48
Date ID type 1 Tech
03/01/2022 a wind 254 0 0 0 0 0
05/01/2022 a wind 0 0 452 0 0 56
e wind 0 0 489 0 48 0

Check if value follow a criteria for subsequent timeseries values

I have a dataframe that looks like
ID DATE PROFIT
2342 2017-03-01 457
2342 2017-06-01 658
2342 2017-09-01 3456
2342 2017-12-01 345
2342 2018-03-01 235
2342 2018-06-01 23
808 2017-03-01 9346
808 2017-06-01 54
808 2017-09-01 314
808 2017-12-01 57
....
....
For each ID:
Lets say I want to find out if the Profit has stayed between 200 and 1000.
I want to do it in such a way that the counter( a new column) indicates how many quarters (latest and previous) in succession have satisfied this condition.
If for some reason, one of the intermediate quarters does not match the condition, the counter should reset.
I am thinking of using the shift functionality to access/condition on the previous rows, however if there is a better way to check if condition in datetime values, it will be good to know.
Solution if all datetimes are consecutive:
Use GroupBy.tail with 5 for last and previous 4 quarters, compare by Series.lt, add missing values with Series.reindex and if encessary cast to integer for True/False to 1/0 mapping:
df['flag'] = (df.groupby('ID')['PROFIT']
.tail(5)
.lt(1000)
.reindex(df.index, fill_value=False)
.astype(int))
print (df)
ID DATE PROFIT flag
0 2342 2017-03-01 457 0 #<-6.th value no match
1 2342 2017-06-01 658 1
2 2342 2017-09-01 3456 0
3 2342 2017-12-01 345 1
4 2342 2018-03-01 235 1
5 2342 2018-06-01 23 1
6 808 2017-03-01 9346 0
7 808 2017-06-01 54 1
8 808 2017-09-01 314 1
9 808 2017-12-01 57 1
EDIT: for counter column by Series.between function is possible create consecutive groups by compare by DataFrame.ne (!=) with DataFrame.shift and DataFrame.cumsum and last use GroupBy.cumcount with multiple by Series.mul for set to 0 groups with consecutive 0:
df['flag'] = df['PROFIT'].between(200, 1000).astype(int)
df1 = df[['ID','flag']].ne(df[['ID','flag']].shift()).cumsum()
g = df.groupby([df1['ID'], df1['flag']])
df['counter1'] = g.cumcount().add(1).mul(df['flag'])
df['counter2'] = g.cumcount(ascending=False).add(1).mul(df['flag'])
print (df)
ID DATE PROFIT flag counter1 counter2
0 2342 2017-03-01 457 1 1 2
1 2342 2017-06-01 658 1 2 1
2 2342 2017-09-01 3456 0 0 0
3 2342 2017-12-01 345 1 1 3
4 2342 2018-03-01 235 1 2 2
5 2342 2018-06-01 230 1 3 1
6 808 2017-03-01 934 1 1 2
7 808 2017-06-01 540 1 2 1
8 808 2017-09-01 34 0 0 0
9 808 2017-12-01 57 0 0 0

Parsing week of year to datetime objects with pandas

A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]

Sum set of values from pandas dataframe within certain time frame

I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.
So I start off with with data frame, df:
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
and define some start and end dates:
import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.
So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.
It might look like this:
id start_date end_date weekly_sum
65 2016-02-13 2016-02-20 100
Any direction is greatly appreciated!
P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!
You can use
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
# Return first id and sum of values
return n['id'].values[0],n['value'].sum()
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
'time_stamp': dates,
'value': np.random.randint(2,60,size=(1000,))
})
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
id time_stamp value
0 770 2015-01-01 02:00:00 59
1 781 2015-01-02 02:00:00 32
2 761 2015-01-03 02:00:00 40
3 317 2015-01-04 02:00:00 16
4 538 2015-01-05 02:00:00 20
print(ndf.head(5))
end start id value
0 2016-02-20 2016-02-13 569 221
1 2016-03-01 2016-02-23 28 216
2 2016-03-24 2016-03-17 152 258
3 2016-03-31 2016-03-24 892 265
4 2016-04-02 2016-03-26 606 244
You can calculate a weekly summary with the following code. The code below is based on Monday.
import pandas as pd
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
# _id time_stamp value day_of_week start week
# 0 396 2015-01-01 02:00:00 59 Thursday 0 0
# 1 396 2015-01-02 02:00:00 59 Friday 0 0
# 2 396 2015-01-03 02:00:00 59 Saturday 0 0
# 3 396 2015-01-04 02:00:00 59 Sunday 0 0
# 4 396 2015-01-05 02:00:00 59 Monday 1 1
# 5 396 2015-01-06 02:00:00 59 Tuesday 0 1
# 6 396 2015-01-07 02:00:00 59 Wednesday 0 1
# 7 396 2015-01-08 02:00:00 59 Thursday 0 1
# 8 396 2015-01-09 02:00:00 59 Friday 0 1
# 9 396 2015-01-10 02:00:00 59 Saturday 0 1
# 10 396 2015-01-11 02:00:00 59 Sunday 0 1
# 11 396 2015-01-12 02:00:00 59 Monday 1 2
# 12 396 2015-01-13 02:00:00 59 Tuesday 0 2
# 13 396 2015-01-14 02:00:00 59 Wednesday 0 2
# 14 396 2015-01-15 02:00:00 59 Thursday 0 2
# 15 396 2015-01-16 02:00:00 59 Friday 0 2
# 16 396 2015-01-17 02:00:00 59 Saturday 0 2
# 17 396 2015-01-18 02:00:00 59 Sunday 0 2
# 18 396 2015-01-19 02:00:00 59 Monday 1 3
# 19 396 2015-01-20 02:00:00 59 Tuesday 0 3
aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
# week start_date end_date weekly_sum
# 58 58 2016-02-08 02:00:00 2016-02-14 02:00:00 413
# 59 59 2016-02-15 02:00:00 2016-02-21 02:00:00 413
# 60 60 2016-02-22 02:00:00 2016-02-28 02:00:00 413

Grouping daily data by month in python/pandas while firstly grouping by user id

I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})

Categories