We need to find the average of the column in pandas - python

time1 x y z GMT- 5 key time2 a b c GMT cut_off time_diff new_column
1 1.674841e+09 -1.10 64.11 -1.33 2023-01-27 12:43:22 PM 0 1.674841e+09 2.96 606.270614 2.80 2023-01-27 12:43:24 PM 1.674841e+09 2.308100 NaN
2 1.674841e+09 -1.10 64.11 -1.33 2023-01-27 12:43:22 PM 0 1.674841e+09 2.96 584.696883 2.80 2023-01-27 12:43:26 PM 1.674841e+09 4.303636 NaN
3 1.674841e+09 -1.10 64.11 -1.33 2023-01-27 12:43:22 PM 0 1.674841e+09 2.96 615.295633 2.80 2023-01-27 12:43:28 PM 1.674841e+09 6.298568 NaN
4 1.674841e+09 -1.10 64.11 -1.33 2023-01-27 12:43:22 PM 0 1.674841e+09 2.96 587.050575 2.80 2023-01-27 12:43:30 PM 1.674841e+09 8.293623 NaN
5 1.674841e+09 -2.24 93.51 -2.36 2023-01-27 12:43:46 PM 0 1.674841e+09 2.96 584.700016 2.80 2023-01-27 12:43:46 PM 1.674841e+09 0.007554 0.007554
100 1.674842e+09 -1.24 84.73 -2.44 2023-01-27 12:49:07 PM 0 1.674843e+09 2.30 1024.363758 2.64 2023-01-27 01:13:11 PM 1.674843e+09 1444.068500 NaN
101 1.674842e+09 -1.24 84.73 -2.44 2023-01-27 12:49:07 PM 0 1.674843e+09 2.31 1011.438119 2.64 2023-01-27 01:13:13 PM 1.674843e+09 1446.063470 NaN
102 1.674842e+09 -1.24 84.73 -2.44 2023-01-27 12:49:07 PM 0 1.674843e+09 2.32 1005.181835 2.64 2023-01-27 01:13:15 PM 1.674843e+09 1448.058710 NaN
103 1.674842e+09 -1.24 84.73 -2.44 2023-01-27 12:49:07 PM 0 1.674843e+09 2.34 989.515657 2.64 2023-01-27 01:13:17 PM 1.674843e+09 1450.053643 NaN
104 1.674842e+09 -1.24 84.73 -2.44 2023-01-27 12:49:07 PM 0 1.674843e+09 2.34 1016.183097 2.64 2023-01-27 01:13:19 PM 1.674843e+09 1452.048679 NaN
105 1.674842e+09 -1.57 80.04 -1.96 2023-01-27 12:49:06 PM 0 1.674842e+09 2.02 1652.185708 2.88 2023-01-27 12:49:06 PM 1.674842e+09 0.001867 0.001867
We actually need the row without nan value in column: 'new_column". here it is the rows: 5 and 105
Bu we need the average of 'x', 'y', 'z' of the (rows 1 to 5) and (rows 100 to 105) in the 5th row and 105th row
The desired output:
time1 x y z GMT- 5 key time2 a b c GMT cut_off time_diff new_column
5 1.674841e+09 -1.328 69.99 -1.536 2023-01-27 12:43:46 PM 0 1.674841e+09 2.96 584.700016 2.80 2023-01-27 12:43:46 PM 1.674841e+09 0.007554 0.007554
105 1.674842e+09 -1.295 69.82 -2.36 2023-01-27 12:49:06 PM 0 1.674842e+09 2.02 1652.185708 2.88 2023-01-27 12:49:06 PM 1.674842e+09 0.001867 0.001867
​

First lets try and create a group. We can use cumulative sum to do this on the "new column". Just replace the Nan values with 0 and the others with 1 and shift it down by 1
df["binary"] = df["new_column"].fillna(0)
df.loc[df.binary!=0,"binary"] = 1
df["binary"] = df["binary"].shift(1,fill_value=0)
df["cumsum"] = df["binary"].cumsum()
time1 x y z ... time_diff new_column binary cumsum
0 1.670000e+09 -1.10 64.11 -1.33 ... 2.308100 NaN 0.0 0.0
1 1.670000e+09 -1.10 64.11 -1.33 ... 4.303636 NaN 0.0 0.0
2 1.670000e+09 -1.10 64.11 -1.33 ... 6.298568 NaN 0.0 0.0
3 1.670000e+09 -1.10 64.11 -1.33 ... 8.293623 NaN 0.0 0.0
4 1.670000e+09 -2.24 93.51 -2.36 ... 0.007554 0.007554 0.0 0.0
5 1.670000e+09 -1.24 84.73 -2.44 ... 1444.068500 NaN 1.0 1.0
6 1.670000e+09 -1.24 84.73 -2.44 ... 1446.063470 NaN 0.0 1.0
7 1.670000e+09 -1.24 84.73 -2.44 ... 1448.058710 NaN 0.0 1.0
8 1.670000e+09 -1.24 84.73 -2.44 ... 1450.053643 NaN 0.0 1.0
9 1.670000e+09 -1.24 84.73 -2.44 ... 1452.048679 NaN 0.0 1.0
10 1.670000e+09 -1.57 80.04 -1.96 ... 0.001867 0.001867 0.0 1.0
After this it is a simple groupby on the cumulative sum
G = df.groupby("cumsum")
df["x_avg"] = G['x'].transform('mean')
df["y_avg"] = G['y'].transform('mean')
df["z_avg"] = G['z'].transform('mean')
filtered_df = df[~pd.isna(df["new_column"])]
time1 x y z ... cumsum x_avg y_avg z_avg
4 1.670000e+09 -2.24 93.51 -2.36 ... 0.0 -1.328 69.990000 -1.536
10 1.670000e+09 -1.57 80.04 -1.96 ... 1.0 -1.295 83.948333 -2.360

Might be this.
df['new_column'] = df[['x', 'y', 'z']].mean(axis=1)

Related

pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:
This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334
Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]

How to compute a column depending on previous values of one and current values of another column

I have not too much experience with pandas, and I have the following DataFrame:
month A B
2/28/2017 0.7377573034 0
3/31/2017 0.7594787565 3.7973937824
4/30/2017 0.7508308808 3.7541544041
5/31/2017 0.7038814004 7.0388140044
6/30/2017 0.6920212254 11.0723396061
7/31/2017 0.6801610503 11.5627378556
8/31/2017 0.6683008753 10.6928140044
9/30/2017 0.7075915026 11.3214640415
10/31/2017 0.6989436269 7.6883798964
11/30/2017 0.6259514607 4.3816602247
12/31/2017 0.6119757303 3.671854382
1/31/2018 0.633 3.798
2/28/2018 0.598 4.784
3/31/2018 0.673 5.384
4/30/2018 0.673 1.346
5/31/2018 0.609 0
6/30/2018 0.609 0
7/31/2018 0.609 0
8/31/2018 0.609 0
9/30/2018 0.673 0
10/31/2018 0.673 0
11/30/2018 0.598 0
12/31/2018 0.598 0
I need to compute column C which basically is column A times column B, but the value of column B is the value of the previous year of the corresponding month. In addition, for values not having the corresponding month in the previous year, this value should be zero. To be more specific, this is what I expect C to be:
C
0 # these values are zero because the corresponding month in the previous year is not in column A
0
0
0
0
0
0
0
0
0
0
0
0 # 0.598 * 0
2.5556460155552 # 0.673 * 3.7973937824
2.5265459139593 # 0.673 * 3.7541544041
4.2866377286796 # 0.609 * 7.0388140044
6.7430548201149 # 0.609 * 11.0723396061
7.0417073540604 # 0.609 * 11.5627378556
6.5119237286796 # 0.609 * 10.6928140044
7.6193452999295 # 0.673 * 11.3214640415
5.1742796702772 # 0.673 * 7.6883798964
2.6202328143706 # 0.598 * 4.3816602247
2.195768920436 # 0.598 * 3.671854382
How can I achieve this? I am sure there might be a way to do it not using a for loop. Thanks in advance.
In [73]: (df.drop('B',1)
...: .merge(df.drop('A',1)
...: .assign(month=df.month + pd.offsets.MonthEnd(12)),
...: on='month', how='left')
...: .eval("C = A * B", inplace=False)
...: .fillna(0)
...: )
...:
Out[73]:
month A B C
0 2017-02-28 0.737757 0.000000 0.000000
1 2017-03-31 0.759479 0.000000 0.000000
2 2017-04-30 0.750831 0.000000 0.000000
3 2017-05-31 0.703881 0.000000 0.000000
4 2017-06-30 0.692021 0.000000 0.000000
5 2017-07-31 0.680161 0.000000 0.000000
6 2017-08-31 0.668301 0.000000 0.000000
7 2017-09-30 0.707592 0.000000 0.000000
8 2017-10-31 0.698944 0.000000 0.000000
9 2017-11-30 0.625951 0.000000 0.000000
10 2017-12-31 0.611976 0.000000 0.000000
11 2018-01-31 0.633000 0.000000 0.000000
12 2018-02-28 0.598000 0.000000 0.000000
13 2018-03-31 0.673000 3.797394 2.555646
14 2018-04-30 0.673000 3.754154 2.526546
15 2018-05-31 0.609000 7.038814 4.286638
16 2018-06-30 0.609000 11.072340 6.743055
17 2018-07-31 0.609000 11.562738 7.041707
18 2018-08-31 0.609000 10.692814 6.511924
19 2018-09-30 0.673000 11.321464 7.619345
20 2018-10-31 0.673000 7.688380 5.174280
21 2018-11-30 0.598000 4.381660 2.620233
22 2018-12-31 0.598000 3.671854 2.195769
Explanation:
we can generate a helper DF like this (we have added 12 months to month column and dropped A column):
In [77]: df.drop('A',1).assign(month=df.month + pd.offsets.MonthEnd(12))
Out[77]:
month B
0 2018-02-28 0.000000
1 2018-03-31 3.797394
2 2018-04-30 3.754154
3 2018-05-31 7.038814
4 2018-06-30 11.072340
5 2018-07-31 11.562738
6 2018-08-31 10.692814
7 2018-09-30 11.321464
8 2018-10-31 7.688380
9 2018-11-30 4.381660
10 2018-12-31 3.671854
11 2019-01-31 3.798000
12 2019-02-28 4.784000
13 2019-03-31 5.384000
14 2019-04-30 1.346000
15 2019-05-31 0.000000
16 2019-06-30 0.000000
17 2019-07-31 0.000000
18 2019-08-31 0.000000
19 2019-09-30 0.000000
20 2019-10-31 0.000000
21 2019-11-30 0.000000
22 2019-12-31 0.000000
now we can merge it with the original DF (we don't need B column in the original DF):
In [79]: (df.drop('B',1)
...: .merge(df.drop('A',1)
...: .assign(month=df.month + pd.offsets.MonthEnd(12)),
...: on='month', how='left'))
Out[79]:
month A B
0 2017-02-28 0.737757 NaN
1 2017-03-31 0.759479 NaN
2 2017-04-30 0.750831 NaN
3 2017-05-31 0.703881 NaN
4 2017-06-30 0.692021 NaN
5 2017-07-31 0.680161 NaN
6 2017-08-31 0.668301 NaN
7 2017-09-30 0.707592 NaN
8 2017-10-31 0.698944 NaN
9 2017-11-30 0.625951 NaN
10 2017-12-31 0.611976 NaN
11 2018-01-31 0.633000 NaN
12 2018-02-28 0.598000 0.000000
13 2018-03-31 0.673000 3.797394
14 2018-04-30 0.673000 3.754154
15 2018-05-31 0.609000 7.038814
16 2018-06-30 0.609000 11.072340
17 2018-07-31 0.609000 11.562738
18 2018-08-31 0.609000 10.692814
19 2018-09-30 0.673000 11.321464
20 2018-10-31 0.673000 7.688380
21 2018-11-30 0.598000 4.381660
22 2018-12-31 0.598000 3.671854
then using .eval("C = A * B", inplace=False) we cann generate a new column "on the fly"

How to split a column into multiple columns in pandas?

I have this data in a pandas dataframe,
name date close quantity daily_cumm_returns
0 AARTIIND 2000-01-03 3.84 21885.82 0.000000
1 AARTIIND 2000-01-04 3.60 56645.64 -0.062500
2 AARTIIND 2000-01-05 3.52 24460.62 -0.083333
3 AARTIIND 2000-01-06 3.58 42484.24 -0.067708
4 AARTIIND 2000-01-07 3.42 16736.21 -0.109375
5 AARTIIND 2000-01-10 3.42 20598.42 -0.109375
6 AARTIIND 2000-01-11 3.41 20598.42 -0.111979
7 AARTIIND 2000-01-12 3.27 100417.29 -0.148438
8 AARTIIND 2000-01-13 3.43 20598.42 -0.106771
9 AARTIIND 2000-01-14 3.60 5149.61 -0.062500
10 AARTIIND 2000-01-17 3.46 14161.42 -0.098958
11 AARTIIND 2000-01-18 3.50 136464.53 -0.088542
12 AARTIIND 2000-01-19 3.52 21885.82 -0.083333
13 AARTIIND 2000-01-20 3.73 75956.66 -0.028646
14 AARTIIND 2000-01-21 3.84 77244.07 0.000000
15 AARTIIND 2000-02-01 4.21 90118.08 0.000000
16 AARTIIND 2000-02-02 4.52 238169.21 0.073634
17 AARTIIND 2000-02-03 4.38 163499.94 0.040380
18 AARTIIND 2000-02-04 4.44 108141.71 0.054632
19 AARTIIND 2000-02-07 4.26 68232.27 0.011876
20 AARTIIND 2000-02-08 4.00 108141.71 -0.049881
21 AARTIIND 2000-02-09 3.96 32185.04 -0.059382
22 AARTIIND 2000-02-10 4.13 43771.63 -0.019002
23 AARTIIND 2000-02-11 3.96 3862.20 -0.059382
24 AARTIIND 2000-02-14 3.94 12874.01 -0.064133
25 AARTIIND 2000-02-15 3.90 33472.42 -0.073634
26 AARTIIND 2000-02-16 3.90 25748.02 -0.073634
27 AARTIIND 2000-02-17 3.90 60507.86 -0.073634
28 AARTIIND 2000-02-18 4.22 45059.04 0.002375
29 AARTIIND 2000-02-21 4.42 81106.27 0.049881
I wish to select every months data and transpose that into a new row,
for e.g. the first 15 rows should become one row with name AARTIIND, date 2000-01-03 and then 15 columns having daily cummulative returns.
name date first second third fourth fifth .... fifteenth
0 AARTIIND 2000-01-03 0.00 -0.062 -0.083 -0.067 -0.109 .... 0.00
To group the data month wise I am using,
group = df.groupby([pd.Grouper(freq='1M', key='date'), 'name'])
Setting the rows individually by using the code below is very slow and my dataset has 1 million rows
data = pd.DataFrame(columns = ('name', 'date', 'daily_zscore_1', 'daily_zscore_2', 'daily_zscore_3', 'daily_zscore_4', 'daily_zscore_5', 'daily_zscore_6', 'daily_zscore_7', 'daily_zscore_8', 'daily_zscore_9', 'daily_zscore_10', 'daily_zscore_11', 'daily_zscore_12', 'daily_zscore_13', 'daily_zscore_14', 'daily_zscore_15'))
data.loc[0] = [x['name'].iloc[0], x['date'].iloc[0]].extend(x['daily_cumm_returns'])
Is there any other faster way to accomplish this, as I see it this is just transposing one column and hence should be very fast. I tried pivot and melt but don't understand how to use them in this situation.
This is a bit sloppy but it gets the job done.
# grab AAPL data
from pandas_datareader import data
df = data.DataReader('AAPL', 'google', start='2014-01-01')[['Close', 'Volume']]
# add name column
df['name'] = 'AAPL'
# get daily return relative to first of month
df['daily_cumm_return'] = df.resample('M')['Close'].transform(lambda x: (x - x[0]) / x[0])
# get the first of the month for each date
df['first_month_date'] = df.assign(index_col=df.index).resample('M')['index_col'].transform('first')
# get a ranking of the days 1 to n
df['day_rank']= df.resample('M')['first_month_date'].rank(method='first')
# pivot to get final
df_final = df.pivot_table(index=['name', 'first_month_date'], columns='day_rank', values='daily_cumm_return')
Sample Output
day_rank 1.0 2.0 3.0 4.0 5.0 6.0 \
name first_month_date
AAPL 2014-01-02 0.0 -0.022020 -0.016705 -0.023665 -0.017464 -0.029992
2014-02-03 0.0 0.014375 0.022052 0.021912 0.036148 0.054710
2014-03-03 0.0 0.006632 0.008754 0.005704 0.005173 0.006102
2014-04-01 0.0 0.001680 -0.005299 -0.018222 -0.033600 -0.033600
2014-05-01 0.0 0.001775 0.015976 0.004970 0.001420 -0.005917
2014-06-02 0.0 0.014141 0.025721 0.029729 0.026834 0.043314
2014-07-01 0.0 -0.000428 0.005453 0.026198 0.019568 0.019996
day_rank 7.0 8.0 9.0 10.0 11.0 \
name first_month_date
AAPL 2014-01-02 -0.036573 -0.031511 -0.012149 0.007593 0.002025
2014-02-03 0.068667 0.068528 0.085555 0.084578 0.088625
2014-03-03 0.015785 0.016846 0.005571 -0.005704 -0.001857
2014-04-01 -0.020936 -0.033600 -0.040708 -0.036831 -0.043810
2014-05-01 -0.010059 0.002249 0.003787 0.004024 -0.004497
2014-06-02 0.049438 0.045095 0.027614 0.016368 0.026612
2014-07-01 0.016253 0.018178 0.031330 0.019247 0.013473
day_rank 12.0 13.0 14.0 15.0 16.0 \
name first_month_date
AAPL 2014-01-02 -0.022526 -0.007340 -0.002911 0.005442 -0.012782
2014-02-03 0.071458 0.059037 0.047313 0.051779 0.040893
2014-03-03 0.006897 0.006632 0.001857 0.009683 0.021754
2014-04-01 -0.041871 -0.030887 -0.019385 -0.018351 -0.031274
2014-05-01 0.010178 0.022130 0.022367 0.025089 0.026627
2014-06-02 0.025276 0.026389 0.022826 0.012248 0.011357
2014-07-01 -0.004598 0.009731 0.004491 0.012831 0.039243
day_rank 17.0 18.0 19.0 20.0 21.0 \
name first_month_date
AAPL 2014-01-02 -0.004809 -0.084282 -0.094660 -0.096431 -0.095039
2014-02-03 0.031542 0.052059 0.049267 NaN NaN
2014-03-03 0.032763 0.022815 0.018437 0.017244 0.017111
2014-04-01 0.048204 0.055958 0.096795 0.093564 0.089429
2014-05-01 0.038225 0.057751 0.054911 0.074201 0.070178
2014-06-02 0.005233 0.006124 0.012137 0.024162 0.034740
2014-07-01 0.037532 0.044376 0.058811 0.051967 0.049508
day_rank 22.0 23.0
name first_month_date
AAPL 2014-01-02 NaN NaN
2014-02-03 NaN NaN
2014-03-03 NaN NaN
2014-04-01 NaN NaN
2014-05-01 NaN NaN
2014-06-02 NaN NaN
2014-07-01 0.022241 NaN
Admittedly this does not get exactly what you want...
I think one way to handle this problem would be to create new columns of month and day based on the datetime (date) column, then set a multiindex on the month and name, then pivot the table.
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df.set_index(['month', 'name'], inplace=True)
df[['day', 'daily_cumm_returns']].pivot(index=df.index, columns='day')
Result is:
daily_cumm_returns \
day 1 2 3 4 5
month name
1 AARTIIND NaN NaN 0.00000 -0.062500 -0.083333
2 AARTIIND 0.0 0.073634 0.04038 0.054632 NaN
I can't figure out a way to keep the first date of each month group as a column, otherwise I think this is more or less what you're after.

How to group pandas DataFrame by varying dates?

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:
Company Period Quarter_End
M 2016Q1 05/02/2015
M 2016Q2 08/01/2015
M 2016Q3 10/31/2015
M 2016Q4 01/30/2016
WFM 2015Q2 04/12/2015
WFM 2015Q3 07/05/2015
WFM 2015Q4 09/27/2015
WFM 2016Q1 01/17/2016
and a table of daily data:
Company Date Price
M 06/20/2015 1.05
M 06/22/2015 4.05
M 07/10/2015 3.45
M 07/29/2015 1.86
M 08/24/2015 1.58
M 09/02/2015 8.64
M 09/22/2015 2.56
M 10/20/2015 5.42
M 11/02/2015 1.58
M 11/24/2015 4.58
M 12/03/2015 6.48
M 12/05/2015 4.56
M 01/03/2016 7.14
M 01/30/2016 6.34
WFM 06/20/2015 1.05
WFM 06/22/2015 4.05
WFM 07/10/2015 3.45
WFM 07/29/2015 1.86
WFM 08/24/2015 1.58
WFM 09/02/2015 8.64
WFM 09/22/2015 2.56
WFM 10/20/2015 5.42
WFM 11/02/2015 1.58
WFM 11/24/2015 4.58
WFM 12/03/2015 6.48
WFM 12/05/2015 4.56
WFM 01/03/2016 7.14
WFM 01/17/2016 6.34
And I would like to create the table below.
Company Period Quarter_end Sum(Price)
M 2016Q2 8/1/2015 10.41
M 2016Q3 10/31/2015 18.2
M 2016Q4 1/30/2016 30.68
WFM 2015Q3 7/5/2015 5.1
WFM 2015Q4 9/27/2015 18.09
WFM 2016Q1 1/17/2016 36.1
However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.
Thanks!
I think you can use merge_ordered:
#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)
df = pd.merge_ordered(df1,
df2,
left_on=['Company','Quarter_End'],
right_on=['Company','Date'],
how='outer')
print (df)
Company Period Quarter_End Date Price
0 M 2016Q1 2015-05-02 NaT NaN
1 M NaN NaT 2015-06-20 1.05
2 M NaN NaT 2015-06-22 4.05
3 M NaN NaT 2015-07-10 3.45
4 M NaN NaT 2015-07-29 1.86
5 M 2016Q2 2015-08-01 NaT NaN
6 M NaN NaT 2015-08-24 1.58
7 M NaN NaT 2015-09-02 8.64
8 M NaN NaT 2015-09-22 2.56
9 M NaN NaT 2015-10-20 5.42
10 M 2016Q3 2015-10-31 NaT NaN
11 M NaN NaT 2015-11-02 1.58
12 M NaN NaT 2015-11-24 4.58
13 M NaN NaT 2015-12-03 6.48
14 M NaN NaT 2015-12-05 4.56
15 M NaN NaT 2016-01-03 7.14
16 M 2016Q4 2016-01-30 2016-01-30 6.34
17 WFM 2015Q2 2015-04-12 NaT NaN
18 WFM NaN NaT 2015-06-20 1.05
19 WFM NaN NaT 2015-06-22 4.05
20 WFM 2015Q3 2015-07-05 NaT NaN
21 WFM NaN NaT 2015-07-10 3.45
22 WFM NaN NaT 2015-07-29 1.86
23 WFM NaN NaT 2015-08-24 1.58
24 WFM NaN NaT 2015-09-02 8.64
25 WFM NaN NaT 2015-09-22 2.56
26 WFM 2015Q4 2015-09-27 NaT NaN
27 WFM NaN NaT 2015-10-20 5.42
28 WFM NaN NaT 2015-11-02 1.58
29 WFM NaN NaT 2015-11-24 4.58
30 WFM NaN NaT 2015-12-03 6.48
31 WFM NaN NaT 2015-12-05 4.56
32 WFM NaN NaT 2016-01-03 7.14
33 WFM 2016Q1 2016-01-17 2016-01-17 6.34
Then backfill NaN in columns Period and Quarter_End by bfill and aggregate sum. If need remove all NaN values, add Series.dropna and last reset_index:
df.Period = df.Period.bfill()
df.Quarter_End = df.Quarter_End.bfill()
print (df.groupby(['Company','Period','Quarter_End'])['Price'].sum().dropna().reset_index())
Company Period Quarter_End Price
0 M 2016Q2 2015-08-01 10.41
1 M 2016Q3 2015-10-31 18.20
2 M 2016Q4 2016-01-30 30.68
3 WFM 2015Q3 2015-07-05 5.10
4 WFM 2015Q4 2015-09-27 18.09
5 WFM 2016Q1 2016-01-17 36.10
set_index
pd.concat to align indices
groupby with agg
prd_df = period_df.set_index(['Company', 'Quarter_End'])
prc_df = price_df.set_index(['Company', 'Date'], drop=False)
df = pd.concat([prd_df, prc_df], axis=1)
df.groupby([df.index.get_level_values(0), df.Period.bfill()]) \
.agg(dict(Date='last', Price='sum')).dropna()

Categories