Getting NaN when using pandas groupby - python

I have a dataframe like so:
index date symbol stock_id open high low close volume vwap
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539
... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256
[227779 rows x 10 columns]
And then a function to return a boolean mask on whether or not the df is consolidating inside of a range:
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
Which I then call like:
df['t'] = df.groupby("stock_id").apply(is_consolidating)
The problem is when I print the df I am getting NaN for the values of my new column:
dan#danalgo:~/Documents/code/wolfhound$ python3 add_indicators_daily.py
index date symbol stock_id open high low close volume vwap t
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174 NaN
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061 NaN
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286 NaN
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899 NaN
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539 NaN
... ... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157 NaN
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965 NaN
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198 NaN
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976 NaN
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256 NaN
[227779 rows x 11 columns]
Full code:
import pandas as pd
from IPython.display import display
import sqlite3 as sql
import numpy as np
conn = sql.connect('allStockData.db')
# get everything inside daily_ohlc and add to a dataframe
df = pd.read_sql_query("SELECT * from daily_ohlc_init", conn)
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
df['t'] = df.groupby("stock_id").apply(is_consolidating)
print(df)
df.to_sql('daily_ohlc_init_with_indicators', if_exists='replace', con=conn, index=True)

You could do it like this:
def is_consolidating(grp, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(grp).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(grp).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return pd.Series(consolidation, index=grp.index)
df['t'] = df.groupby("stock_id")['close'].apply(is_consolidating)
print(df)
Output (part of it):
volume vwap t
0 879710 7.782174 False
1 794436 7.967061 False
2 716012 8.231286 True
3 586091 8.185899 True
4 1278409 8.284539 True
227774 14655189 6.280157 False
227775 10569193 6.219965 True
227776 7918812 6.217198 True
227777 9671252 6.135976 True
227778 6310661 6.066256 True

Related

NaN when converting df to a series

I have a dataframe with OHLC data. I need to get the close price into the pandas series, using the timestamp column as the index.
I am reading from a sqlite db into my df:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
print(price)
Which returns:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-16 08:00:00+00:00 3.19 3.570 3.19 3.350 66475 458 3.404240 AAOI NaN
1 2022-09-16 08:05:00+00:00 3.35 3.440 3.33 3.430 28925 298 3.381131 AAOI NaN
2 2022-09-16 08:10:00+00:00 3.44 3.520 3.35 3.400 62901 643 3.445096 AAOI NaN
3 2022-09-16 08:15:00+00:00 3.37 3.390 3.31 3.360 17943 184 3.339721 AAOI NaN
4 2022-09-16 08:20:00+00:00 3.36 3.410 3.34 3.400 29123 204 3.383370 AAOI NaN
... ... ... ... ... ... ... ... ... ... ...
8759 2022-09-08 23:35:00+00:00 1.35 1.360 1.35 1.355 3835 10 1.350613 RUBY 515994.5
8760 2022-09-08 23:40:00+00:00 1.36 1.360 1.35 1.350 2780 7 1.353687 RUBY 515994.5
8761 2022-09-08 23:45:00+00:00 1.35 1.355 1.35 1.355 7080 11 1.350424 RUBY 515994.5
8762 2022-09-08 23:50:00+00:00 1.35 1.360 1.33 1.360 11664 30 1.351104 RUBY 515994.5
8763 2022-09-08 23:55:00+00:00 1.36 1.360 1.33 1.340 21394 32 1.348223 RUBY 515994.5
[8764 rows x 10 columns]
When I try to get the close into a series with the timestamp:
price = pd.Series(price['close'], index=price['timestamp'])
It returns a bunch of NaNs:
2022-09-16 08:00:00+00:00 NaN
2022-09-16 08:05:00+00:00 NaN
2022-09-16 08:10:00+00:00 NaN
2022-09-16 08:15:00+00:00 NaN
2022-09-16 08:20:00+00:00 NaN
..
2022-09-08 23:35:00+00:00 NaN
2022-09-08 23:40:00+00:00 NaN
2022-09-08 23:45:00+00:00 NaN
2022-09-08 23:50:00+00:00 NaN
2022-09-08 23:55:00+00:00 NaN
Name: close, Length: 8764, dtype: float64
If I remove the index:
price = pd.Series(price['close'])
The close is returned normally:
0 3.350
1 3.430
2 3.400
3 3.360
4 3.400
...
8759 1.355
8760 1.350
8761 1.355
8762 1.360
8763 1.340
Name: close, Length: 8764, dtype: float64
How can I return the close column as a pandas series, using my timestamp column as the index?
It's because price['close'] has it's own index which is incompatible with timestamp. Try use .values instead:
price = pd.Series(price['close'].values, index=price['timestamp'])
I needed to set the timestamp to the index before getting the the close as a series:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
price = price.set_index('timestamp')
print(price)
price = pd.Series(price['close'])
print(price)
Gives:
2022-09-16 08:00:00+00:00 3.350
2022-09-16 08:05:00+00:00 3.430
2022-09-16 08:10:00+00:00 3.400
2022-09-16 08:15:00+00:00 3.360
2022-09-16 08:20:00+00:00 3.400
...
2022-09-08 23:35:00+00:00 1.355
2022-09-08 23:40:00+00:00 1.350
2022-09-08 23:45:00+00:00 1.355
2022-09-08 23:50:00+00:00 1.360
2022-09-08 23:55:00+00:00 1.340
Name: close, Length: 8764, dtype: float64

pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps

Calculate average revenue per user per month

I have the following dataframe:
Timestamp userid Prices_USD
0 2016-12-01 6.270941895 1.08
1 2016-12-01 6.609813209 1.12
2 2016-12-01 6.632094115 9.70
3 2016-12-01 6.655789772 1.08
4 2016-12-01 6.764640751 9.33
... ... ... ...
1183 2017-03-27 6.529604089 1.08
1184 2017-03-27 6.682639674 6.72
1185 2017-03-27 6.773815105 10.0
I want to calculate, for each unique userid, their monthly spending.
I've tried the following:
sales_per_user.set_index('Timestamp',inplace=True)
sales_per_user.index = pd.to_datetime(sales_per_user.index)
m = sales_per_user.index.month
monthly_avg = sales_per_user.groupby(['userid', m]).Prices_USD.mean().to_frame()
But the resulting dataframe is this:
userid Timestamp Prices_USD
3.43964843 12 10.91
3.885813375 1 10.91
2 10.91
12 21.82
However, the timestamp column doesn't have the desired outcome. Ideally I would like
userid Timestamp Prices_USD
3.43964843 2016-12 10.91
3.885813375 2017-01 10.91
2017-02 10.91
2017-12 21.82
How do I fix that?
Try:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
res = df.groupby([df['userid'], df['Timestamp'].dt.to_period('M')])['Prices_USD'].sum()
print(res)
Output
userid Timestamp
6.270942 2016-12 1.08
6.529604 2017-03 1.08
6.609813 2016-12 1.12
6.632094 2016-12 9.70
6.655790 2016-12 1.08
6.682640 2017-03 6.72
6.764641 2016-12 9.33
6.773815 2017-03 10.00
Name: Prices_USD, dtype: float64

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:
This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334
Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]

How to group pandas DataFrame by varying dates?

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:
Company Period Quarter_End
M 2016Q1 05/02/2015
M 2016Q2 08/01/2015
M 2016Q3 10/31/2015
M 2016Q4 01/30/2016
WFM 2015Q2 04/12/2015
WFM 2015Q3 07/05/2015
WFM 2015Q4 09/27/2015
WFM 2016Q1 01/17/2016
and a table of daily data:
Company Date Price
M 06/20/2015 1.05
M 06/22/2015 4.05
M 07/10/2015 3.45
M 07/29/2015 1.86
M 08/24/2015 1.58
M 09/02/2015 8.64
M 09/22/2015 2.56
M 10/20/2015 5.42
M 11/02/2015 1.58
M 11/24/2015 4.58
M 12/03/2015 6.48
M 12/05/2015 4.56
M 01/03/2016 7.14
M 01/30/2016 6.34
WFM 06/20/2015 1.05
WFM 06/22/2015 4.05
WFM 07/10/2015 3.45
WFM 07/29/2015 1.86
WFM 08/24/2015 1.58
WFM 09/02/2015 8.64
WFM 09/22/2015 2.56
WFM 10/20/2015 5.42
WFM 11/02/2015 1.58
WFM 11/24/2015 4.58
WFM 12/03/2015 6.48
WFM 12/05/2015 4.56
WFM 01/03/2016 7.14
WFM 01/17/2016 6.34
And I would like to create the table below.
Company Period Quarter_end Sum(Price)
M 2016Q2 8/1/2015 10.41
M 2016Q3 10/31/2015 18.2
M 2016Q4 1/30/2016 30.68
WFM 2015Q3 7/5/2015 5.1
WFM 2015Q4 9/27/2015 18.09
WFM 2016Q1 1/17/2016 36.1
However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.
Thanks!
I think you can use merge_ordered:
#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)
df = pd.merge_ordered(df1,
df2,
left_on=['Company','Quarter_End'],
right_on=['Company','Date'],
how='outer')
print (df)
Company Period Quarter_End Date Price
0 M 2016Q1 2015-05-02 NaT NaN
1 M NaN NaT 2015-06-20 1.05
2 M NaN NaT 2015-06-22 4.05
3 M NaN NaT 2015-07-10 3.45
4 M NaN NaT 2015-07-29 1.86
5 M 2016Q2 2015-08-01 NaT NaN
6 M NaN NaT 2015-08-24 1.58
7 M NaN NaT 2015-09-02 8.64
8 M NaN NaT 2015-09-22 2.56
9 M NaN NaT 2015-10-20 5.42
10 M 2016Q3 2015-10-31 NaT NaN
11 M NaN NaT 2015-11-02 1.58
12 M NaN NaT 2015-11-24 4.58
13 M NaN NaT 2015-12-03 6.48
14 M NaN NaT 2015-12-05 4.56
15 M NaN NaT 2016-01-03 7.14
16 M 2016Q4 2016-01-30 2016-01-30 6.34
17 WFM 2015Q2 2015-04-12 NaT NaN
18 WFM NaN NaT 2015-06-20 1.05
19 WFM NaN NaT 2015-06-22 4.05
20 WFM 2015Q3 2015-07-05 NaT NaN
21 WFM NaN NaT 2015-07-10 3.45
22 WFM NaN NaT 2015-07-29 1.86
23 WFM NaN NaT 2015-08-24 1.58
24 WFM NaN NaT 2015-09-02 8.64
25 WFM NaN NaT 2015-09-22 2.56
26 WFM 2015Q4 2015-09-27 NaT NaN
27 WFM NaN NaT 2015-10-20 5.42
28 WFM NaN NaT 2015-11-02 1.58
29 WFM NaN NaT 2015-11-24 4.58
30 WFM NaN NaT 2015-12-03 6.48
31 WFM NaN NaT 2015-12-05 4.56
32 WFM NaN NaT 2016-01-03 7.14
33 WFM 2016Q1 2016-01-17 2016-01-17 6.34
Then backfill NaN in columns Period and Quarter_End by bfill and aggregate sum. If need remove all NaN values, add Series.dropna and last reset_index:
df.Period = df.Period.bfill()
df.Quarter_End = df.Quarter_End.bfill()
print (df.groupby(['Company','Period','Quarter_End'])['Price'].sum().dropna().reset_index())
Company Period Quarter_End Price
0 M 2016Q2 2015-08-01 10.41
1 M 2016Q3 2015-10-31 18.20
2 M 2016Q4 2016-01-30 30.68
3 WFM 2015Q3 2015-07-05 5.10
4 WFM 2015Q4 2015-09-27 18.09
5 WFM 2016Q1 2016-01-17 36.10
set_index
pd.concat to align indices
groupby with agg
prd_df = period_df.set_index(['Company', 'Quarter_End'])
prc_df = price_df.set_index(['Company', 'Date'], drop=False)
df = pd.concat([prd_df, prc_df], axis=1)
df.groupby([df.index.get_level_values(0), df.Period.bfill()]) \
.agg(dict(Date='last', Price='sum')).dropna()

Categories