I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0
Related
i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
I'm trying to multiply every 2nd row by -1 but in a specified column only. Using below, I'm hoping to multiply every 2nd row in column c by -1.
df = pd.DataFrame({
'a' : [2.0,1.0,3.5,2.0,5.0,3.0,1.0,1.0],
'b' : [1.0,-1.0,3.5,3.0,4.0,2.0,3.0,2.0],
'c' : [2.0,2.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0],
})
df['c'] = df['c'][::2] * -1
Intended output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
One way using pandas.DataFrame.update:
df.update(df['c'][1::2] * -1)
print(df)
Output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Use DataFrame.iloc for slicing with Index.get_loc for position of column c:
df.iloc[1::2, df.columns.get_loc('c')] *= -1
#working same like
#df.iloc[1::2, df.columns.get_loc('c')] = df.iloc[1::2, df.columns.get_loc('c')] * -1
Or use DataFrame.loc with select values in df.index:
df.loc[df.index[1::2], 'c'] *= -1
Or:
df.loc[df.index % 2 == 1, 'c'] *= -1
print (df)
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Or you can write your own function:
def multiple(df):
new_df = pd.DataFrame()
for i in range(0, len(df)):
if i // 2 == 0:
new_row = pd.DataFrame(data = df.iloc[i]*(-1)).T
new_df = new_df.append(new_row, ignore_index=True)
else:
new_row = pd.DataFrame(data = df.iloc[i]).T
new_df = new_df.append(new_row, ignore_index=True)
i+=1
return new_df
You can use this code :
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 2.0
2 3.5 3.5 2.0
3 2.0 3.0 2.0
4 5.0 4.0 -1.0
5 3.0 2.0 -1.0
6 1.0 3.0 -2.0
7 1.0 2.0 -2.0
df.loc[df.index % 2 == 1, "c" ] = df.c * - 1
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
You can use divmod with a series:
s = 2*np.arange(len(df))%2 - 1
df["c"] = -df.c*s
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0
I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.