I have the following data:
Date Qty
01/01/2019 4.15
02/01/2019 12.39
03/01/2019 14.15
04/01/2019 12.15
05/01/2019 3.26
06/01/2019 6.23
07/01/2019 15.89
08/01/2019 5.55
09/01/2019 12.49
10/01/2019 9.4
11/01/2019 9.11
12/01/2019 9.18
13/01/2019 13.45
14/01/2019 4.52
15/01/2019 0
16/01/2019 0
17/01/2019 8.41
18/01/2019 9.55
19/01/2019 15.43
20/01/2019 16.45
21/01/2019 9.28
22/01/2019 9.55
23/01/2019 7.87
24/01/2019 12.58
25/01/2019 6.12
26/01/2019 6.15
27/01/2019 6.07
28/01/2019 15.53
The output that I'm trying to achieve is this:
Date Window_Sum
01/01/2019
02/01/2019
03/01/2019
04/01/2019
05/01/2019
06/01/2019
07/01/2019
08/01/2019
09/01/2019
10/01/2019
11/01/2019 100.62
12/01/2019 109.8
13/01/2019 110.86
14/01/2019 101.23
15/01/2019 101.23
16/01/2019 101.23
17/01/2019 109.64
18/01/2019 103.78
19/01/2019 112.98
20/01/2019 107.99
21/01/2019 104.78
22/01/2019 104.93
23/01/2019 103.69
24/01/2019 107.09
25/01/2019 113.21
26/01/2019 101.39
27/01/2019 107.46
28/01/2019 105.03
Let me just briefly explain the logic to get the output:
So on 01/01/2019, the Qty is 4.15, and looking back there are no other values, so the cumulative sum is not greater than 100. Hence, the output value is a NULL.
Fast forward to 10/01/2019, the Qty is 9.4, and looking back the cumulative sum is 95.66. Since the cumulative sum is not greater than 100, the output will be a NULL value.
Next, we'll look at 11/01/2019. The Qty here is 9.11, and looking back the cumulative sum is 100.62. Reason why it is 100.62 and not 104.77 is because the sum of Qty from 11/01/2019 to 02/01/2019 (looking backwards), hits 100/slightly above 100 first.
Similarly, at 12/01/2019, the Qty here is 9.18, and looking back the cumulative sum is 100.8 because the sum of Qty from 12/01/2019 to 02/01/2019(looking backwards), hits 100/slightly above 100 first.
Is there a solution which allows a loop into the pandas rolling sum function to achieve this result?
What I'm trying to achieve here is to ensure that once the cumulative sum reaches 100 or slightly over 100, then I will take the value and append it into the "Window_Sum".
Update: Managed to get the code running with help. Here's the solution:
#get last row index
start=len(data)-1
#initialise cumulative sum
cumsum = 0
for i in range(start,-1,-1):
j=i
while cumsum < 100:
cumsum += data.loc[j,'Qty']
if j!=0:
j-=1
else:
cumsum=None
break
data.loc[i,'Window_Sum']=cumsum
cumsum=0
Just use the cumsum() function:
In [7]: df['Window_Sum'] = df['Qty'].cumsum()
In [8]: df.head()
Out[8]:
Date Qty Window_Sum
0 01-Jan-19 4.0 4.0
1 02-Jan-19 1.0 5.0
2 03-Jan-19 6.0 11.0
3 04-Jan-19 3.0 14.0
4 05-Jan-19 3.0 17.0
Hope this is what you were looking for!
Related
I have a dataset of games with critic scores and categorical data of whether the game was featured in a publication - take the following as a simplified version of the dataset:
df = pd.DataFrame(
{'titleName': ['game_A', 'game_B', 'game_C', 'game_D'],
'reviewScore': [88.1, 70.3, 91.3, 66.1],
'mediaAppearances': [['Pub_A', 'Pub_C'], ['Pub_B'], ['Pub_B', 'Pub_C'], ['Pub_A', 'Pub_B', 'Pub_C']]}
)
mediaAppearances is a categorical feature with multiple potential values for any record - it captures whether the game appeared in a given Publisher's reporting. The feature is one-hot encoded to produce discrete boolean columns for each Publisher (i.e. 'True' if the game appeared on that Publisher, 'False' if it didn't):
final_df = pd.concat([df,pd.get_dummies(df['mediaAppearances'].apply(pd.Series).stack()).groupby(level=0).sum()], axis=1)
This produces the following DataFrame:
titleName
reviewScore
mediaAppearances
Pub_A
Pub_B
Pub_C
game_A
88.1
"Pub_A, Pub_C"
1
0
1
game_B
70.3
"Pub_B"
0
1
0
game_C
91.3
"Pub_B, Pub_C"
0
1
1
game_D
66.1
"Pub_A, Pub_B, Pub_C"
1
1
1
I want to group the DataFrame by each Publisher, so I can analyze reviewScore for games that were featured by a specific Publisher. The end result should have three groups (where Pub_n equals True) where reviewScore can be further aggregated / analyzed.
I can independently filter the Data Frame by each unique Publisher as follows:
for publisher in ['Pub_A', 'Pub_B', 'Pub_C']:
_mean = final_df[final_df[publisher] == True]['reviewScore'].mean()
print(f"Mean reviewScore for games appearing in {publisher}: {_mean:.1f}")
Output:
Mean reviewScore for games appearing in Pub_A: 77.1
Mean reviewScore for games appearing in Pub_B: 75.9
Mean reviewScore for games appearing in Pub_C: 81.8
This works fine for calculating single summary stats; however, the workflow is burdensome when attempting to use custom aggregation functions, analyze multiple summary stats at once (e.g. with pandas' describe function), or quickly switch between grouping by Publisher vs. other dataset variables.
Ideally, I'd be able to use Pandas' standard groupby and aggregate syntax; however, since Publisher values are not exclusive after one-hot encoding the mediaAppearance variable, grouping by the Publisher columns yields an unwieldy matrix with all unique True/False combinations
final_df.groupby(['Pub_A', 'Pub_B', 'Pub_C']).describe().reset_index()
Output:
Pub_A
Pub_B
Pub_C
count
mean
std
min
25%
50%
75%
max
0
1
0
1.0
70.3
NaN
70.3
70.3
70.3
70.3
70.3
0
1
1
1.0
91.3
NaN
91.3
91.3
91.3
91.3
91.3
1
0
1
1.0
88.1
NaN
88.1
88.1
88.1
88.1
88.1
1
1
1
1.0
66.1
NaN
66.1
66.1
66.1
66.1
66.1
Is there a groupBy query that will produce a single grouped DataFrame that allows application of aggregation functions to the entire frame vs. requiring aggregation of individually filtered data frames? For example, is there a query that would produce the following for pandas builtin describe function?
Publiser
count
mean
std
min
25%
50%
75%
max
Pub_A
2.0
77.1
15.6
66.1
71.6
77.1
82.6
88.1
Pub_B
3.0
75.9
13.5
66.1
68.2
70.3
80.8
91.3
Pub_C
3.0
81.8
13.7
66.1
77.1
88.1
89.7
91.3
Not using groupby, but this alternative method might help.
In the one-hot encoded columns you can change the "1" by the reviewScore and remove the "0".
Based on your version of final_df:
Input:
final_df_2 = final_df.copy()
pub_list = ['Pub_A', 'Pub_B', 'Pub_C']
final_df_2[pub_list] = final_df_2[pub_list].replace(0, np.NAN)
for pub_n in pub_list:
final_df_2[pub_n] = final_df_2[pub_n] * final_df_2['reviewScore']
final_df_2
Output:
titleName reviewScore mediaAppearances Pub_A Pub_B Pub_C
0 game_A 88.10 [Pub_A, Pub_C] 88.10 NaN 88.10
1 game_B 70.30 [Pub_B] NaN 70.30 NaN
2 game_C 91.30 [Pub_B, Pub_C] NaN 91.30 91.30
3 game_D 66.10 [Pub_A, Pub_B, Pub_C] 66.10 66.10 66.10
And then use describe:
Input:
final_df_2.describe()
Output:
reviewScore Pub_A Pub_B Pub_C
count 4.00 2.00 3.00 3.00
mean 78.95 77.10 75.90 81.83
std 12.60 15.56 13.50 13.72
min 66.10 66.10 66.10 66.10
25% 69.25 71.60 68.20 77.10
50% 79.20 77.10 70.30 88.10
75% 88.90 82.60 80.80 89.70
max 91.30 88.10 91.30 91.30
I have data of the following format:
station_number date river_height river_flow
0 1 2005-01-01 08:09:00 0.285233 0.782065
1 1 2005-01-01 11:28:12 0.129994 0.386652
2 4 2005-01-01 17:33:36 0.457168 0.167025
3 2 2005-01-01 23:21:00 0.359086 0.851716
4 4 2005-01-02 04:18:36 0.332998 0.830749
5 1 2005-01-02 09:28:12 0.867262 0.855507
6 3 2005-01-02 13:15:36 0.352409 0.023737
7 2 2005-01-02 17:31:12 0.696562 0.846762
8 1 2005-01-02 21:15:36 0.910944 0.096999
9 4 2005-01-03 02:13:12 0.981430 0.152109
I need to calculate a daily average of the river height and river flow per unique station number, so as a result something like this:
station_number date river_height river_flow
0 1 2005-01-01 0.285 0.782
1 1 2005-01-02 0.233 0.753
2 2 2005-01-01 0.129 0.386
3 2 2005-01-02 0.994 0.386
4 3 2005-01-01 0.457 0.167
5 3 2005-01-02 0.168 0.134
6 4 2005-01-01 0.356 0.321
7 4 2005-01-02 0.086 0.716
Keep in mind that the above numbers are random, and not actually the averages I'm looking for. I need an entry for each day for each station. I hope I have clarified what I need!
I have tried aggregating using groupby such as below:
monthly_flow_data_mean = df.groupby(pd.PeriodIndex(df['date'], freq="M"))['river_flow'].mean()
But this obviously just takes all river_flow measurements not considering the station numbers. I have had trouble finding what combination of groupby and aggregations I need to properly achieve what I need.
I tried this as well:
daily_flow_df = df.groupby(pd.PeriodIndex(df['date'], freq="D")).agg({"river_flow": "mean", "river_height": "mean", "station_number": "first"})
But I am pretty sure this also doesn't really work as we are not really using the station number to aggregate, but merely choosing how to aggregate it while aggregating all river flow measurements.
I can obviously also just split the dataframe into 4 classes and then do the aggregation per dataframe, and merge it back together. But I am wondering if there is some smart little groupby trick that can help me achieve this in less lines, as it will be useful later in my project(s) as well where I might have way more classes in the data.
You can use either of the following solutions to groupby 'station_number' and date on the 'Date' column using pd.Grouper or dt.normalize:
df.groupby(['station_number', pd.Grouper(key='date', freq='D')]).mean()
or
df.groupby(['station_number', df['date'].dt.normalize()]).mean()
For example, let's consider the following dataframe:
Restaurant_ID Floor Cust_Arrival_Datetime
0 100 1 2021-11-17 17:20:00
1 100 1 2021-11-17 17:22:00
2 100 1 2021-11-17 17:25:00
3 100 1 2021-11-17 17:30:00
4 100 1 2021-11-17 17:50:00
5 100 1 2021-11-17 17:51:00
6 100 2 2021-11-17 17:25:00
7 100 2 2021-11-17 18:00:00
8 100 2 2021-11-17 18:50:00
9 100 2 2021-11-17 18:56:00
For the above toy example we can consider that the Cust_Arrival_Datetime is sorted as well as grouped by store and floor (as seen above). How could we, now, calculate things such as the median time interval that passes for a customer arrival for each unique store and floor group?
The desired output would be:
Restaurant_ID Floor Median Arrival Interval(in minutes)
0 100 1 3
1 100 2 35
The Median Arrival Interval is calculated as follows: for the first floor of the store we can see that by the time the second customer arrives 2 minutes have already passed since the first one arrived. Similarly, 3 minutes have elapsed between the 2nd and the 3rd customer and 5 minutes for the 3rd and 4th customer etc. The median for floor 1 and restaurant 100 would be 3.
I have tried something like this:
df.groupby(['Restaurant_ID', 'Floor'].apply(lambda row: row['Customer_Arrival_Datetime'].shift() - row['Customer_Arrival_Datetime']).apply(np.median)
but this does not work!
Any help is welcome!
IIUC, you can do
(df.groupby(['Restaurant_ID', 'Floor'])['Cust_Arrival_Datetime']
.agg(lambda x: x.diff().dt.total_seconds().median()/60))
and you get
Restaurant_ID Floor
100 1 3.0
2 35.0
Name: Cust_Arrival_Datetime, dtype: float64
you can chain with reset_index if needed
Consider the following data frame:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'time': pd.to_datetime(
['14:14', '14:17', '14:25', '17:29', '17:40','17:43']
)
})
Suppose, you'd like to apply a range of transformations:
def stats(group):
diffs = group.diff().dt.total_seconds()/60
return {
'min': diffs.min(),
'mean': diffs.mean(),
'median': diffs.median(),
'max': diffs.max()
}
Then you simply have to apply these:
>>> df.groupby('group')['time'].agg(stats).apply(pd.Series)
min mean median max
group
1 3.0 5.5 5.5 8.0
2 3.0 7.0 7.0 11.0
I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you
Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0
procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0
Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1