Adding column to Pandas DataFrame based on dynamic indexing condition - python

I have a dataframe with a column that randomly starts a "count" back at 1. My goal is to produce a new_col that divides my current column by the the last value in a count. See below for an example.
This is my current DataFrame:
col
0 1.0
1 2.0
2 3.0
3 1.0
4 2.0
5 1.0
6 2.0
7 3.0
8 4.0
9 5.0
10 1.0
11 2.0
12 3.0
Trying to get an output like so:
col new_col
0 1.0 0.333
1 2.0 0.667
2 3.0 1.000
3 1.0 0.500
4 2.0 1.000
5 1.0 0.200
6 2.0 0.400
7 3.0 0.600
8 4.0 0.800
9 5.0 1.000
10 1.0 0.333
11 2.0 0.667
12 3.0 1.000
This is what I have tried so far:
df['col_bool'] = pd.DataFrame(df['col'] == 1.0)
idx_lst = [x - 2 for x in df.index[df['col_bool']].tolist()]
idx_lst = idx_lst[1:]
mask = (df['col'] != 1.0)
df_valid = df[mask]
for i in idx_lst:
df['new_col'] = 1.0 / df_valid.iloc[i]['col']
df.loc[mask, 'new_col'] = df_valid['col'] / df_valid.iloc[i]['col']
This understandably results in an index error. Maybe I need to make a copy of a DataFrame each time and concat. I believe this would work but I want to ask if I am missing any shortcuts here?

Try:
df['new_col'] = df['col'].div(df.groupby((df['col'] == 1).cumsum()).transform('last'))
Output:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000

You can try:
df['new_col'] = df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())[
'col'].transform(lambda x: x.div(len(x)))
Or:
df['new_col'] = df.col.div(df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())
['col'].transform('count'))
OUTPUT:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000

Related

Relative minimum values in pandas

i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000

Pandas shift based on different values to calculate percentages

I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN

How to fill a particular value with mean value of the column between first row and the corresponding row in pandas dataframe

I have a df like this,
A B C D E
1 2 3 0 2
2 0 7 1 1
3 4 0 3 0
0 0 3 4 3
I am trying to replace all the 0 with mean() value between the first row and the 0 value row for the corresponding column,
My expected output is,
A B C D E
1.0 2.00 3.000000 0.0 2.0
2.0 1.00 7.000000 1.0 1.0
3.0 4.00 3.333333 3.0 1.0
1.5 1.75 3.000000 4.0 3.0
Here is main problem need previous mean value if multiple 0 per column, so realy problematic create vectorized solution:
def f(x):
for i, v in enumerate(x):
if v == 0:
x.iloc[i] = x.iloc[:i+1].mean()
return x
df1 = df.astype(float).apply(f)
print (df1)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
Better solution:
#create indices of zero values to helper DataFrame
a, b = np.where(df.values == 0)
df1 = pd.DataFrame({'rows':a, 'cols':b})
#for first row is not necessary count means
df1 = df1[df1['rows'] != 0]
print (df1)
rows cols
1 1 1
2 2 2
3 2 4
4 3 0
5 3 1
#loop by each row of helper df and assign means
for i in df1.itertuples():
df.iloc[i.rows, i.cols] = df.iloc[:i.rows+1, i.cols].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0 2.0
1 2.0 1.00 7.000000 1 1.0
2 3.0 4.00 3.333333 3 1.0
3 1.5 1.75 3.000000 4 3.0
Another similar solution (with mean of all pairs):
for i, j in zip(*np.where(df.values == 0)):
df.iloc[i, j] = df.iloc[:i+1, j].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
IIUC
def f(x):
for z in range(x.size):
if x[z] == 0: x[z] = np.mean(x[:z+1])
return x
df.astype(float).apply(f)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0

Basic Python: How do I normalize a data series?

I have a dataframe with 5 columns indexed by date. I would like to normalize these data series by the first item in their lists.
A B C D E
1/1/2017 3 4 1 2 3
1/2/2017 7 4 4 3 3
1/3/2017 2 5 5 4 3
1/4/2017 2 5 3 6 3
1/5/2017 2 2 2 6 6
for example, in column A, i would like to divided everything by 3, the first item on the list. Same for column B to E.
thanks for your help!
In [100]: df.div(df.iloc[0])
Out[100]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0
or
In [101]: df / df.iloc[0]
Out[101]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0
By using div
df.div(df.iloc[0,:],1)
Out[496]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Categories