Pandas to calculate rolling aggregate rate - python

I'm trying to calculate a rolling aggregate rate for a time series.
The way to think about the data is that it is the results of a bunch of multigame series against a different teams. We don't know who wins the series until the last game. I'm trying to calculate the win rate as it evolves against each of the opposing teams.
series_id date opposing_team won_series
1 1/1/2000 a 0
1 1/3/2000 a 0
1 1/5/2000 a 1
2 1/4/2000 a 0
2 1/7/2000 a 0
2 1/9/2000 a 0
3 1/6/2000 b 0
Becomes:
series_id date opposing_team won_series percent_win_against_team
1 1/1/2000 a 0 NA
1 1/3/2000 a 0 NA
1 1/5/2000 a 1 100
2 1/4/2000 a 0 NA
2 1/7/2000 a 0 100
2 1/9/2000 a 0 50
3 1/6/2000 b 0 0

I still don't feel like I understand the rule for how you decide when a series is over. Is 3 over? Why is it NA, I would have thought 1/3rd. Still, here is a way to keep track of the number of completed series and (a) win rate.
Define 26472215table.csv:
series_id,date,opposing_team,won_series
1,1/1/2000,a,0
1,1/3/2000,a,0
1,1/5/2000,a,1
2,1/4/2000,a,0
2,1/7/2000,a,0
2,1/9/2000,a,0
3,1/6/2000,b,0
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('26472215table.csv')
grp2 = df.groupby(['series_id'])
sr = grp2['date'].max()
sr.name = 'LastGame'
df2 = df.join( sr, on=['series_id'], how='left')
df2.sort('date')
df2['series_comp'] = df2['date'] == df2['LastGame']
df2['running_sr_cnt'] = df2.groupby(['opposing_team'])['series_comp'].cumsum()
df2['running_win_cnt'] = df2.groupby(['opposing_team'])['won_series'].cumsum()
winrate = lambda x: x[1]/ x[0] if (x[0] > 0) else None
df2['winrate'] = df2[['running_sr_cnt','running_win_cnt']].apply(winrate, axis = 1 )
Results df2[['date', 'winrate']]:
date winrate
0 1/1/2000 NaN
1 1/3/2000 NaN
2 1/5/2000 1.0
3 1/4/2000 1.0
4 1/7/2000 1.0
5 1/9/2000 0.5
6 1/6/2000 0.0

Related

pandas groupby with condition depending on same column

I have a dataframe with ID, date and number columns and would like to create a new column that takes the mean of all numbers for this specific ID BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date))
.assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
)
this is what i would like to get as output
ID number date mean_no_from_previous_dts
0 1 1 2021-10-19 3.0 = mean(4+1+4)
1 1 4 2021-10-16 2.5 = mean(4+1)
2 1 1 2021-10-16 4.0 = mean(1)
3 1 4 2021-10-15 0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4 2 2 2021-10-19 5.0 = mean(5)
5 2 5 2021-10-10 0.0 = 0 (as it's the first entry for this date and ID)
so for example the first entry of the column mean_no_from_previous_dts is the mean of (4+1+4): the first 4 comes from the column number and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1 comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4 comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1 the the same for ID = 2
Here is solution with numpy broadcasting per groups:
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
)
def f(x):
arr = x['date'].to_numpy()
m = arr <= arr[:, None]
#remove rows with same values - set mask to False
np.fill_diagonal(m, False)
#set greater values to `NaN` and get mean without NaNs
m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
#assign to new column
x['no_of_previous_dts'] = m
return x
#last value is set to 0 per groups
df = df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
print (df)
ID number date no_of_previous_dts
0 1 1 2021-10-19 3.0
1 1 4 2021-10-16 2.5
2 1 1 2021-10-16 4.0
3 1 4 2021-10-15 0.0
4 2 2 2021-10-19 5.0
5 2 5 2021-10-10 0.0

Change numeric column based on category after group by

I have a df like this below:
dff = pd.DataFrame({'id':[1,1,2,2], 'categ':['A','B','A','B'],'cost':[20,5, 30,10] })
dff
id categ cost
0 1 A 20
1 1 B 5
2 2 A 30
3 2 B 10
What i want is to make a new df where I group by id and then the cost of category B takes the
20% of the price of category A, and at the same time category A loses this amount. I would like my desired output to be like this:
id category price
0 1 A 16
1 1 B 9
2 2 A 24
3 2 B 16
I have done this below but it only reduces the price of by 20%. Any idea how to do what i want?
dff['price'] = np.where(dff['category'] == 'A', dff['price'] * 0.8, dff['price'])
Do pivot then modify and stack back
s = df.pivot(*df)
s['B'] = s['B'] + s['A'] * 0.2
s['A'] *= 0.8
s = s.stack().reset_index(name='cost')
s
Out[446]:
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0
You can transform to broadcast the 'A' value to every row in the group and take 20% of it. Then using map you can subtract for 'A' and add for 'B'
s = df['cost'].where(df.categ.eq('A')).groupby(df['id']).transform('first')*0.2
df['cost'] = df['cost'] + s*df['categ'].map({'A': -1, 'B': 1})
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0

How to generate a new column which the values are between two existing columns

I need to add a new column based on the values of two existing columns.
My data set looks like this:
Date Bid Ask Last Volume
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0
... ... ... ... ... ...
80356 2021.02.01 23:58:54.332 1.20603 1.20605 0.0 0
and I need to generate a new column named "New" and the values of column "New" needs to have a random number between Column "Bid" and Column "Ask". For each value of the column "New", it has to be in the range from Bid to Ask (can equal to Bid or Ask).
I have tried to do like this
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Ask,x.Bid), axis=1)
But I got this
Exception has occurred: ValueError
low >= high
I am new to Python.
Use np.random.uniform so you get a random float with equal probability between your high and low bounds with closure [low_bound, high_bound).
Also ditch the apply; np.random.uniform can generate the numbers using arrays of bounds. (I added a row at the bottom to make this obvious).
import numpy as np
df['New'] = np.random.uniform(df.Bid, df.Ask, len(df))
Date Bid Ask Last Volume New
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0 1.213114
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0 1.212969
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0 1.213342
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0 1.212933
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0 1.212948
5 2021.02.01 00:02:08.920 100.00000 115.00000 0.0 0 100.552836
All you need to do is switch the order of x.Ask and x.Bid in your code. In your dataframe, the ask prices are always higher than the bid, that's why you are getting the error:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask), axis=1)
If your ask value is sometimes greater and sometimes less than the bid, use:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask) if x.Ask > x.Bid else np.random.randint(x.Ask,x.Bid), axis=1)
Finally, if it is possible for ask to be greater, less than or equal to bis, use:
def helper(x):
if x.Ask > x.Bid:
return np.random.randint(x.Bid,x.Ask)
elif x.Bid > x.Ask:
return np.random.randint(x.Ask, x.Bid)
else:
return None
df['rand_between'] = df.apply(helper, axis=1)
You can loop through the rows using apply and then use your randint function (for floats you might want to use random.uniform). For example:
In [1]: import pandas as pd
...: from random import randint
...: df = pd.DataFrame({'bid':range(10),'ask':range(0,20,2)})
...:
...: df['new'] = df.apply(lambda x: randint(x['bid'],x['ask']), axis=1)
...: df
Out[1]:
bid ask new
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 6
4 4 8 6
5 5 10 9
6 6 12 9
7 7 14 12
8 8 16 13
9 9 18 9
The axis=1 is telling the apply function to loop over rows, not columns.

python pandas - transform custom aggregation

Having the following data frame, of user activity across 2 days:
user score
0 A 10
1 A 0
2 B 5
I would like to calculate the average user score for that time and transform the result to all the rows:
import pandas as pd
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"]
df.head()
This could gives me the some of each user:
user score avg
0 A 10 10
1 A 0 10
2 B 5 5
And now I would like to divide each score by the number of days (2) to get:
user score avg
0 A 10 5
1 A 0 5
2 B 5 2.5
Can this be done on the same line where I calculated the sum?
You can divide output Series by 2:
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"] / 2
print (df)
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5
here you can something like that
df["avg"] = df.groupby(['user']).transform("sum")["score"]/2
In [54]: df.head()
Out[54]:
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5

How to calculate the distance between two points, for many consequent points, within groups in python

I have the following df
id xx yy time
0 1 553343.041098 4.178420e+06 1
1 1 553343.069815 4.178415e+06 2
2 1 553343.069815 4.178415e+06 3
3 2 553343.950755 4.178415e+06 1
4 2 553341.343829 4.178410e+06 6
The xx and yy is the position of each id at a certain point in time.
I would like to create an extra column in this df, which will be the difference in distance from one point of time to another (going from the smallest value of time to the next bigger, to the next bigger one etc), within the id group.
Is there a Pythonic way of doing so ?
You can do like below.
I didn't do df['distance_meters'] because it is straight forward.
df['xx_diff']=df.groupby('id')['xx'].diff()**2
df['yy_diff']=df.groupby('id')['yy'].diff()**2
If you don't need ['xx_diff'] & ['yy_diff'] columns in your dataframe, you can directly use the code below.
df['distance']= np.sqrt(df.groupby('id')['xx'].diff()**2+df.groupby('id')['yy'].diff()**2)
Output
id xx yy time xx_diff3 yy_diff3 distance
0 1 553343.041098 4178420.0 1 NaN NaN NaN
1 1 553343.069815 4178415.0 2 0.000825 25.0 5.000082
2 1 553343.069815 4178415.0 3 0.000000 0.0 0.000000
3 2 553343.950755 4178415.0 1 NaN NaN NaN
4 2 553341.343829 4178410.0 6 6.796063 25.0 5.638800
I dont know if there is a more efficient way to do this, but here is a solution:
import numpy as np
df['xx_diff'] = df.groupby('id')['xx'].rolling(window=2).apply(lambda x: (x[1] - x[0])**2).reset_index(drop=True)
df['yy_diff'] = df.groupby('id')['yy'].rolling(window=2).apply(lambda x: (x[1] - x[0])**2).reset_index(drop=True)
df['distance_meters'] = np.sqrt(df['xx_diff'] + df['yy_diff'])
A more pythonic answer will be accepted :)
Try This:
import pandas as pd
import math
def calc_distance(values):
values.sort_values('id', inplace = True)
values['distance_diff'] = 0
values.reset_index(drop=True, inplace=True)
for i in range(values.shape[0]-1):
p1 = list(values.loc[i, ['xx', 'yy']])
p2 = list(values.loc[i+1, ['xx', 'yy']])
values.loc[i,'distance_diff'] = math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2))
return values
lt = []
lt.append(df.groupby(['id']).apply(calc_distance))
print(pd.concat(lt, ignore_index=True))
Output:
id xx yy time distance_diff
0 1 553343.041098 4178420.0 1 5.000082
1 1 553343.069815 4178415.0 2 0.000000
2 1 553343.069815 4178415.0 3 0.000000
3 2 553343.950755 4178415.0 1 5.638800
4 2 553341.343829 4178410.0 6 0.000000

Categories