i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
Related
I have a dataframe with a column that randomly starts a "count" back at 1. My goal is to produce a new_col that divides my current column by the the last value in a count. See below for an example.
This is my current DataFrame:
col
0 1.0
1 2.0
2 3.0
3 1.0
4 2.0
5 1.0
6 2.0
7 3.0
8 4.0
9 5.0
10 1.0
11 2.0
12 3.0
Trying to get an output like so:
col new_col
0 1.0 0.333
1 2.0 0.667
2 3.0 1.000
3 1.0 0.500
4 2.0 1.000
5 1.0 0.200
6 2.0 0.400
7 3.0 0.600
8 4.0 0.800
9 5.0 1.000
10 1.0 0.333
11 2.0 0.667
12 3.0 1.000
This is what I have tried so far:
df['col_bool'] = pd.DataFrame(df['col'] == 1.0)
idx_lst = [x - 2 for x in df.index[df['col_bool']].tolist()]
idx_lst = idx_lst[1:]
mask = (df['col'] != 1.0)
df_valid = df[mask]
for i in idx_lst:
df['new_col'] = 1.0 / df_valid.iloc[i]['col']
df.loc[mask, 'new_col'] = df_valid['col'] / df_valid.iloc[i]['col']
This understandably results in an index error. Maybe I need to make a copy of a DataFrame each time and concat. I believe this would work but I want to ask if I am missing any shortcuts here?
Try:
df['new_col'] = df['col'].div(df.groupby((df['col'] == 1).cumsum()).transform('last'))
Output:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000
You can try:
df['new_col'] = df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())[
'col'].transform(lambda x: x.div(len(x)))
Or:
df['new_col'] = df.col.div(df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())
['col'].transform('count'))
OUTPUT:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000
I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0
I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
I have created two dataframes, lets call them, df1 df2 where
print(df1)
a b
1 0.375241 1.0
2 NaN 2.0
3 0.448792 3.0
4 NaN 4.0
df1 = df1[np.isfinite(df1['a'])]
print(df1)
a b
1 0.375241 1.0
3 0.448792 3.0
print(df2)
aa bb
1 0.047606 1.0
2 0.202927 1.0
3 0.205663 1.0
4 NaN 1.0
5 1.388080 1.0
6 0.097084 1.0
7 0.136873 1.0
8 NaN 1.0
9 NaN 1.0
10 NaN 1.0
12 0.084676 2.0
13 0.236850 2.0
14 0.532835 2.0
15 NaN 2.0
16 NaN 2.0
17 0.035106 2.0
18 NaN 2.0
19 NaN 2.0
20 NaN 2.0
22 0.419956 3.0
23 0.267132 3.0
24 0.944217 3.0
25 0.403024 3.0
26 NaN 3.0
27 0.184425 3.0
28 0.473998 3.0
29 NaN 3.0
30 NaN 3.0
31 NaN 3.0
33 0.465454 4.0
34 0.240867 4.0
35 NaN 4.0
36 NaN 4.0
37 0.323195 4.0
38 0.193764 4.0
39 NaN 4.0
40 NaN 4.0
41 NaN 4.0
42 NaN 4.0
based on the results from df1['b'] where I only have 1 & 3 as valid numbers now, how would I go about keeping df2['bb'] 1 & 3, and setting all other values in df2 to np.nan so that this would be the final product
df2 = df2[np.isfinite(df2['aa'])]
print(df2)
aa bb
1 0.047606 1.0
2 0.202927 1.0
3 0.205663 1.0
5 1.388080 1.0
6 0.097084 1.0
7 0.136873 1.0
22 0.419956 3.0
23 0.267132 3.0
24 0.944217 3.0
25 0.403024 3.0
27 0.184425 3.0
28 0.473998 3.0
I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0