How to group pandas DataFrame by varying dates? - python

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:
Company Period Quarter_End
M 2016Q1 05/02/2015
M 2016Q2 08/01/2015
M 2016Q3 10/31/2015
M 2016Q4 01/30/2016
WFM 2015Q2 04/12/2015
WFM 2015Q3 07/05/2015
WFM 2015Q4 09/27/2015
WFM 2016Q1 01/17/2016
and a table of daily data:
Company Date Price
M 06/20/2015 1.05
M 06/22/2015 4.05
M 07/10/2015 3.45
M 07/29/2015 1.86
M 08/24/2015 1.58
M 09/02/2015 8.64
M 09/22/2015 2.56
M 10/20/2015 5.42
M 11/02/2015 1.58
M 11/24/2015 4.58
M 12/03/2015 6.48
M 12/05/2015 4.56
M 01/03/2016 7.14
M 01/30/2016 6.34
WFM 06/20/2015 1.05
WFM 06/22/2015 4.05
WFM 07/10/2015 3.45
WFM 07/29/2015 1.86
WFM 08/24/2015 1.58
WFM 09/02/2015 8.64
WFM 09/22/2015 2.56
WFM 10/20/2015 5.42
WFM 11/02/2015 1.58
WFM 11/24/2015 4.58
WFM 12/03/2015 6.48
WFM 12/05/2015 4.56
WFM 01/03/2016 7.14
WFM 01/17/2016 6.34
And I would like to create the table below.
Company Period Quarter_end Sum(Price)
M 2016Q2 8/1/2015 10.41
M 2016Q3 10/31/2015 18.2
M 2016Q4 1/30/2016 30.68
WFM 2015Q3 7/5/2015 5.1
WFM 2015Q4 9/27/2015 18.09
WFM 2016Q1 1/17/2016 36.1
However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.
Thanks!

I think you can use merge_ordered:
#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)
df = pd.merge_ordered(df1,
df2,
left_on=['Company','Quarter_End'],
right_on=['Company','Date'],
how='outer')
print (df)
Company Period Quarter_End Date Price
0 M 2016Q1 2015-05-02 NaT NaN
1 M NaN NaT 2015-06-20 1.05
2 M NaN NaT 2015-06-22 4.05
3 M NaN NaT 2015-07-10 3.45
4 M NaN NaT 2015-07-29 1.86
5 M 2016Q2 2015-08-01 NaT NaN
6 M NaN NaT 2015-08-24 1.58
7 M NaN NaT 2015-09-02 8.64
8 M NaN NaT 2015-09-22 2.56
9 M NaN NaT 2015-10-20 5.42
10 M 2016Q3 2015-10-31 NaT NaN
11 M NaN NaT 2015-11-02 1.58
12 M NaN NaT 2015-11-24 4.58
13 M NaN NaT 2015-12-03 6.48
14 M NaN NaT 2015-12-05 4.56
15 M NaN NaT 2016-01-03 7.14
16 M 2016Q4 2016-01-30 2016-01-30 6.34
17 WFM 2015Q2 2015-04-12 NaT NaN
18 WFM NaN NaT 2015-06-20 1.05
19 WFM NaN NaT 2015-06-22 4.05
20 WFM 2015Q3 2015-07-05 NaT NaN
21 WFM NaN NaT 2015-07-10 3.45
22 WFM NaN NaT 2015-07-29 1.86
23 WFM NaN NaT 2015-08-24 1.58
24 WFM NaN NaT 2015-09-02 8.64
25 WFM NaN NaT 2015-09-22 2.56
26 WFM 2015Q4 2015-09-27 NaT NaN
27 WFM NaN NaT 2015-10-20 5.42
28 WFM NaN NaT 2015-11-02 1.58
29 WFM NaN NaT 2015-11-24 4.58
30 WFM NaN NaT 2015-12-03 6.48
31 WFM NaN NaT 2015-12-05 4.56
32 WFM NaN NaT 2016-01-03 7.14
33 WFM 2016Q1 2016-01-17 2016-01-17 6.34
Then backfill NaN in columns Period and Quarter_End by bfill and aggregate sum. If need remove all NaN values, add Series.dropna and last reset_index:
df.Period = df.Period.bfill()
df.Quarter_End = df.Quarter_End.bfill()
print (df.groupby(['Company','Period','Quarter_End'])['Price'].sum().dropna().reset_index())
Company Period Quarter_End Price
0 M 2016Q2 2015-08-01 10.41
1 M 2016Q3 2015-10-31 18.20
2 M 2016Q4 2016-01-30 30.68
3 WFM 2015Q3 2015-07-05 5.10
4 WFM 2015Q4 2015-09-27 18.09
5 WFM 2016Q1 2016-01-17 36.10

set_index
pd.concat to align indices
groupby with agg
prd_df = period_df.set_index(['Company', 'Quarter_End'])
prc_df = price_df.set_index(['Company', 'Date'], drop=False)
df = pd.concat([prd_df, prc_df], axis=1)
df.groupby([df.index.get_level_values(0), df.Period.bfill()]) \
.agg(dict(Date='last', Price='sum')).dropna()

Related

pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps

Create well structured pandas dataframe using dataframe

I have a Panda DataFreme data from 2018 to 2020. I want to structure these data as follows.
Month | 2018 | 2019
Jan 115 73
Feb 112 63
....
up to December.
How can I solve this issue using panda data frame syntax?
Date
2018-01-01 115.0
2018-02-01 112.0
2018-03-01 104.5
2018-04-01 91.1
2018-05-01 85.5
2018-06-01 76.5
2018-07-01 86.5
2018-08-01 77.9
2018-09-01 65.0
2018-10-01 71.0
2018-11-01 76.0
2018-12-01 72.5
2019-01-01 73.0
2019-02-01 63.0
2019-03-01 63.0
2019-04-01 61.0
2019-05-01 58.3
2019-06-01 59.0
2019-07-01 67.0
2019-08-01 64.0
2019-09-01 59.9
2019-10-01 70.4
2019-11-01 78.9
2019-12-01 75.0
2020-01-01 73.9
Name: Close, dtype: float64
This is more like pivot but with crosstab
s = pd.crosstab(df.index.strftime('%b'),df.index.year,df.values,aggfunc='sum')
Out[87]:
col_0 2018 2019 2020
row_0
Apr 91.1 61.0 NaN
Aug 77.9 64.0 NaN
Dec 72.5 75.0 NaN
Feb 112.0 63.0 NaN
Jan 115.0 73.0 73.9
Jul 86.5 67.0 NaN
Jun 76.5 59.0 NaN
Mar 104.5 63.0 NaN
May 85.5 58.3 NaN
Nov 76.0 78.9 NaN
Oct 71.0 70.4 NaN
Sep 65.0 59.9 NaN
You can use groupby and unstack:
(s.groupby([s.index.month, s.index.year]).first().unstack()
.rename_axis(columns='Year',index='Month')
)
Output:
Year 2018 2019 2020
Month
1 115.0 73.0 73.9
2 112.0 63.0 NaN
3 104.5 63.0 NaN
4 91.1 61.0 NaN
5 85.5 58.3 NaN
6 76.5 59.0 NaN
7 86.5 67.0 NaN
8 77.9 64.0 NaN
9 65.0 59.9 NaN
10 71.0 70.4 NaN
11 76.0 78.9 NaN
12 72.5 75.0 NaN

Add to values of a DataFrame using cooridnates

I have a dataframe a:
Out[68]:
p0_4 p5_7 p8_9 p10_14 p15 p16_17 p18_19 p20_24 p25_29 \
0 1360.0 921.0 676.0 1839.0 336.0 668.0 622.0 1190.0 1399.0
1 308.0 197.0 187.0 411.0 67.0 153.0 172.0 336.0 385.0
2 76.0 59.0 40.0 72.0 16.0 36.0 20.0 56.0 82.0
3 765.0 608.0 409.0 1077.0 220.0 359.0 342.0 873.0 911.0
4 1304.0 906.0 660.0 1921.0 375.0 725.0 645.0 1362.0 1474.0
5 195.0 135.0 78.0 262.0 44.0 97.0 100.0 265.0 229.0
6 1036.0 965.0 701.0 1802.0 335.0 701.0 662.0 1321.0 1102.0
7 5072.0 3798.0 2865.0 7334.0 1399.0 2732.0 2603.0 4976.0 4575.0
8 1360.0 962.0 722.0 1758.0 357.0 710.0 713.0 1761.0 1660.0
9 743.0 508.0 369.0 1118.0 286.0 615.0 429.0 738.0 885.0
10 1459.0 1015.0 679.0 1732.0 337.0 746.0 677.0 1493.0 1546.0
11 828.0 519.0 415.0 1057.0 190.0 439.0 379.0 788.0 1024.0
12 1042.0 690.0 503.0 1204.0 219.0 451.0 465.0 1193.0 1406.0
p30_44 p45_59 p60_64 p65_74 p75_84 p85_89 p90plus
0 4776.0 8315.0 2736.0 5463.0 2819.0 738.0 451.0
1 1004.0 2456.0 988.0 2007.0 1139.0 313.0 153.0
2 291.0 529.0 187.0 332.0 108.0 31.0 10.0
3 2807.0 5505.0 2060.0 4104.0 2129.0 516.0 252.0
4 4524.0 9406.0 3034.0 6003.0 3366.0 840.0 471.0
5 806.0 1490.0 606.0 1288.0 664.0 185.0 108.0
6 4127.0 8311.0 2911.0 6111.0 3525.0 1029.0 707.0
7 16917.0 27547.0 8145.0 15950.0 9510.0 2696.0 1714.0
8 5692.0 9380.0 3288.0 6458.0 3830.0 1050.0 577.0
9 2749.0 5696.0 2014.0 4165.0 2352.0 603.0 288.0
10 4676.0 7654.0 2502.0 5077.0 3004.0 754.0 461.0
11 2799.0 4880.0 1875.0 3951.0 2294.0 551.0 361.0
12 3288.0 5661.0 1974.0 4007.0 2343.0 623.0 303.0
and a series d:
Out[70]:
2 p45_59
10 p45_59
11 p45_59
Is there a simple way to add 1 to number in a with the same index and column labels in d?
I have tried:
a[d] +=1
However this adds 1 to every value in the column, not just the values with indices 2, 10 and 11.
Thanking you in advance.
You might want to try this.
a.loc[list(d.index), list(d.values)] += 1

delete the integer in timeindex

This is a part of a dataframe.as you can see, there are some Integer in the timeindex. It should not be a timestamp. So I want to just delete it.So how can we delete the records which has the integer as a timeindex?
rent_time rent_price_per_square_meter
0 2016-11-28 09:01:58 0.400000
1 2016-11-28 09:02:35 0.400000
2 2016-11-28 09:02:43 0.400000
3 2016-11-28 09:03:21 0.400000
4 2016-11-28 09:03:21 0.400000
5 2016-11-28 09:03:34 0.400000
6 2016-11-28 09:03:34 0.400000
7 2017-06-17 02:49:33 0.933333
8 2017-03-19 01:30:03 0.490196
9 2017-03-10 06:39:03 11.111111
10 2017-03-09 14:40:03 16.666667
11 908797 11.000000
12 2017-06-08 03:27:52 22.000000
13 2017-06-30 03:03:11 22.000000
14 2017-02-20 11:04:48 12.000000
15 2017-03-05 13:53:39 6.842105
16 2017-03-06 14:00:01 6.842105
17 2017-03-15 02:38:54 20.000000
18 2017-03-15 02:19:07 13.043478
19 2017-02-24 15:10:00 25.000000
20 2017-06-26 02:17:31 13.043478
21 82368 11.111111
22 2017-06-30 07:53:55 4.109589
23 2017-07-17 02:42:43 20.000000
24 2017-06-30 07:38:00 5.254237
25 2017-06-30 07:49:00 4.920635
26 2017-06-30 05:26:26 4.189189
You can use boolean indexing with to_datetime and parameter errors=coerce for return NaNs for no datetime values and then add notnull for return all datetimes:
df1 = df[pd.to_datetime(df['rent_time'], errors='coerce').notnull()]
print (df1)
rent_time rent_price_per_square_meter
0 2016-11-28 09:01:58 0.400000
1 2016-11-28 09:02:35 0.400000
2 2016-11-28 09:02:43 0.400000
3 2016-11-28 09:03:21 0.400000
4 2016-11-28 09:03:21 0.400000
5 2016-11-28 09:03:34 0.400000
6 2016-11-28 09:03:34 0.400000
7 2017-06-17 02:49:33 0.933333
8 2017-03-19 01:30:03 0.490196
9 2017-03-10 06:39:03 11.111111
10 2017-03-09 14:40:03 16.666667
12 2017-06-08 03:27:52 22.000000
13 2017-06-30 03:03:11 22.000000
14 2017-02-20 11:04:48 12.000000
15 2017-03-05 13:53:39 6.842105
16 2017-03-06 14:00:01 6.842105
17 2017-03-15 02:38:54 20.000000
18 2017-03-15 02:19:07 13.043478
19 2017-02-24 15:10:00 25.000000
20 2017-06-26 02:17:31 13.043478
22 2017-06-30 07:53:55 4.109589
23 2017-07-17 02:42:43 20.000000
24 2017-06-30 07:38:00 5.254237
25 2017-06-30 07:49:00 4.920635
26 2017-06-30 05:26:26 4.189189
EDIT:
For next data procesing if need DatetimeIndex:
df['rent_time'] = pd.to_datetime(df['rent_time'], errors='coerce')
df = df.dropna(subset=['rent_time']).set_index('rent_time')
print (df)
rent_price_per_square_meter
rent_time
2016-11-28 09:01:58 0.400000
2016-11-28 09:02:35 0.400000
2016-11-28 09:02:43 0.400000
2016-11-28 09:03:21 0.400000
2016-11-28 09:03:21 0.400000
2016-11-28 09:03:34 0.400000
2016-11-28 09:03:34 0.400000
2017-06-17 02:49:33 0.933333
2017-03-19 01:30:03 0.490196
2017-03-10 06:39:03 11.111111
2017-03-09 14:40:03 16.666667
2017-06-08 03:27:52 22.000000
2017-06-30 03:03:11 22.000000
2017-02-20 11:04:48 12.000000
2017-03-05 13:53:39 6.842105
2017-03-06 14:00:01 6.842105
2017-03-15 02:38:54 20.000000
2017-03-15 02:19:07 13.043478
2017-02-24 15:10:00 25.000000
2017-06-26 02:17:31 13.043478
2017-06-30 07:53:55 4.109589
2017-07-17 02:42:43 20.000000
2017-06-30 07:38:00 5.254237
2017-06-30 07:49:00 4.920635
2017-06-30 05:26:26 4.189189

How to split a column into multiple columns in pandas?

I have this data in a pandas dataframe,
name date close quantity daily_cumm_returns
0 AARTIIND 2000-01-03 3.84 21885.82 0.000000
1 AARTIIND 2000-01-04 3.60 56645.64 -0.062500
2 AARTIIND 2000-01-05 3.52 24460.62 -0.083333
3 AARTIIND 2000-01-06 3.58 42484.24 -0.067708
4 AARTIIND 2000-01-07 3.42 16736.21 -0.109375
5 AARTIIND 2000-01-10 3.42 20598.42 -0.109375
6 AARTIIND 2000-01-11 3.41 20598.42 -0.111979
7 AARTIIND 2000-01-12 3.27 100417.29 -0.148438
8 AARTIIND 2000-01-13 3.43 20598.42 -0.106771
9 AARTIIND 2000-01-14 3.60 5149.61 -0.062500
10 AARTIIND 2000-01-17 3.46 14161.42 -0.098958
11 AARTIIND 2000-01-18 3.50 136464.53 -0.088542
12 AARTIIND 2000-01-19 3.52 21885.82 -0.083333
13 AARTIIND 2000-01-20 3.73 75956.66 -0.028646
14 AARTIIND 2000-01-21 3.84 77244.07 0.000000
15 AARTIIND 2000-02-01 4.21 90118.08 0.000000
16 AARTIIND 2000-02-02 4.52 238169.21 0.073634
17 AARTIIND 2000-02-03 4.38 163499.94 0.040380
18 AARTIIND 2000-02-04 4.44 108141.71 0.054632
19 AARTIIND 2000-02-07 4.26 68232.27 0.011876
20 AARTIIND 2000-02-08 4.00 108141.71 -0.049881
21 AARTIIND 2000-02-09 3.96 32185.04 -0.059382
22 AARTIIND 2000-02-10 4.13 43771.63 -0.019002
23 AARTIIND 2000-02-11 3.96 3862.20 -0.059382
24 AARTIIND 2000-02-14 3.94 12874.01 -0.064133
25 AARTIIND 2000-02-15 3.90 33472.42 -0.073634
26 AARTIIND 2000-02-16 3.90 25748.02 -0.073634
27 AARTIIND 2000-02-17 3.90 60507.86 -0.073634
28 AARTIIND 2000-02-18 4.22 45059.04 0.002375
29 AARTIIND 2000-02-21 4.42 81106.27 0.049881
I wish to select every months data and transpose that into a new row,
for e.g. the first 15 rows should become one row with name AARTIIND, date 2000-01-03 and then 15 columns having daily cummulative returns.
name date first second third fourth fifth .... fifteenth
0 AARTIIND 2000-01-03 0.00 -0.062 -0.083 -0.067 -0.109 .... 0.00
To group the data month wise I am using,
group = df.groupby([pd.Grouper(freq='1M', key='date'), 'name'])
Setting the rows individually by using the code below is very slow and my dataset has 1 million rows
data = pd.DataFrame(columns = ('name', 'date', 'daily_zscore_1', 'daily_zscore_2', 'daily_zscore_3', 'daily_zscore_4', 'daily_zscore_5', 'daily_zscore_6', 'daily_zscore_7', 'daily_zscore_8', 'daily_zscore_9', 'daily_zscore_10', 'daily_zscore_11', 'daily_zscore_12', 'daily_zscore_13', 'daily_zscore_14', 'daily_zscore_15'))
data.loc[0] = [x['name'].iloc[0], x['date'].iloc[0]].extend(x['daily_cumm_returns'])
Is there any other faster way to accomplish this, as I see it this is just transposing one column and hence should be very fast. I tried pivot and melt but don't understand how to use them in this situation.
This is a bit sloppy but it gets the job done.
# grab AAPL data
from pandas_datareader import data
df = data.DataReader('AAPL', 'google', start='2014-01-01')[['Close', 'Volume']]
# add name column
df['name'] = 'AAPL'
# get daily return relative to first of month
df['daily_cumm_return'] = df.resample('M')['Close'].transform(lambda x: (x - x[0]) / x[0])
# get the first of the month for each date
df['first_month_date'] = df.assign(index_col=df.index).resample('M')['index_col'].transform('first')
# get a ranking of the days 1 to n
df['day_rank']= df.resample('M')['first_month_date'].rank(method='first')
# pivot to get final
df_final = df.pivot_table(index=['name', 'first_month_date'], columns='day_rank', values='daily_cumm_return')
Sample Output
day_rank 1.0 2.0 3.0 4.0 5.0 6.0 \
name first_month_date
AAPL 2014-01-02 0.0 -0.022020 -0.016705 -0.023665 -0.017464 -0.029992
2014-02-03 0.0 0.014375 0.022052 0.021912 0.036148 0.054710
2014-03-03 0.0 0.006632 0.008754 0.005704 0.005173 0.006102
2014-04-01 0.0 0.001680 -0.005299 -0.018222 -0.033600 -0.033600
2014-05-01 0.0 0.001775 0.015976 0.004970 0.001420 -0.005917
2014-06-02 0.0 0.014141 0.025721 0.029729 0.026834 0.043314
2014-07-01 0.0 -0.000428 0.005453 0.026198 0.019568 0.019996
day_rank 7.0 8.0 9.0 10.0 11.0 \
name first_month_date
AAPL 2014-01-02 -0.036573 -0.031511 -0.012149 0.007593 0.002025
2014-02-03 0.068667 0.068528 0.085555 0.084578 0.088625
2014-03-03 0.015785 0.016846 0.005571 -0.005704 -0.001857
2014-04-01 -0.020936 -0.033600 -0.040708 -0.036831 -0.043810
2014-05-01 -0.010059 0.002249 0.003787 0.004024 -0.004497
2014-06-02 0.049438 0.045095 0.027614 0.016368 0.026612
2014-07-01 0.016253 0.018178 0.031330 0.019247 0.013473
day_rank 12.0 13.0 14.0 15.0 16.0 \
name first_month_date
AAPL 2014-01-02 -0.022526 -0.007340 -0.002911 0.005442 -0.012782
2014-02-03 0.071458 0.059037 0.047313 0.051779 0.040893
2014-03-03 0.006897 0.006632 0.001857 0.009683 0.021754
2014-04-01 -0.041871 -0.030887 -0.019385 -0.018351 -0.031274
2014-05-01 0.010178 0.022130 0.022367 0.025089 0.026627
2014-06-02 0.025276 0.026389 0.022826 0.012248 0.011357
2014-07-01 -0.004598 0.009731 0.004491 0.012831 0.039243
day_rank 17.0 18.0 19.0 20.0 21.0 \
name first_month_date
AAPL 2014-01-02 -0.004809 -0.084282 -0.094660 -0.096431 -0.095039
2014-02-03 0.031542 0.052059 0.049267 NaN NaN
2014-03-03 0.032763 0.022815 0.018437 0.017244 0.017111
2014-04-01 0.048204 0.055958 0.096795 0.093564 0.089429
2014-05-01 0.038225 0.057751 0.054911 0.074201 0.070178
2014-06-02 0.005233 0.006124 0.012137 0.024162 0.034740
2014-07-01 0.037532 0.044376 0.058811 0.051967 0.049508
day_rank 22.0 23.0
name first_month_date
AAPL 2014-01-02 NaN NaN
2014-02-03 NaN NaN
2014-03-03 NaN NaN
2014-04-01 NaN NaN
2014-05-01 NaN NaN
2014-06-02 NaN NaN
2014-07-01 0.022241 NaN
Admittedly this does not get exactly what you want...
I think one way to handle this problem would be to create new columns of month and day based on the datetime (date) column, then set a multiindex on the month and name, then pivot the table.
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df.set_index(['month', 'name'], inplace=True)
df[['day', 'daily_cumm_returns']].pivot(index=df.index, columns='day')
Result is:
daily_cumm_returns \
day 1 2 3 4 5
month name
1 AARTIIND NaN NaN 0.00000 -0.062500 -0.083333
2 AARTIIND 0.0 0.073634 0.04038 0.054632 NaN
I can't figure out a way to keep the first date of each month group as a column, otherwise I think this is more or less what you're after.

Categories