I'm trying to multiply every 2nd row by -1 but in a specified column only. Using below, I'm hoping to multiply every 2nd row in column c by -1.
df = pd.DataFrame({
'a' : [2.0,1.0,3.5,2.0,5.0,3.0,1.0,1.0],
'b' : [1.0,-1.0,3.5,3.0,4.0,2.0,3.0,2.0],
'c' : [2.0,2.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0],
})
df['c'] = df['c'][::2] * -1
Intended output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
One way using pandas.DataFrame.update:
df.update(df['c'][1::2] * -1)
print(df)
Output:
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Use DataFrame.iloc for slicing with Index.get_loc for position of column c:
df.iloc[1::2, df.columns.get_loc('c')] *= -1
#working same like
#df.iloc[1::2, df.columns.get_loc('c')] = df.iloc[1::2, df.columns.get_loc('c')] * -1
Or use DataFrame.loc with select values in df.index:
df.loc[df.index[1::2], 'c'] *= -1
Or:
df.loc[df.index % 2 == 1, 'c'] *= -1
print (df)
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Or you can write your own function:
def multiple(df):
new_df = pd.DataFrame()
for i in range(0, len(df)):
if i // 2 == 0:
new_row = pd.DataFrame(data = df.iloc[i]*(-1)).T
new_df = new_df.append(new_row, ignore_index=True)
else:
new_row = pd.DataFrame(data = df.iloc[i]).T
new_df = new_df.append(new_row, ignore_index=True)
i+=1
return new_df
You can use this code :
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 2.0
2 3.5 3.5 2.0
3 2.0 3.0 2.0
4 5.0 4.0 -1.0
5 3.0 2.0 -1.0
6 1.0 3.0 -2.0
7 1.0 2.0 -2.0
df.loc[df.index % 2 == 1, "c" ] = df.c * - 1
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
You can use divmod with a series:
s = 2*np.arange(len(df))%2 - 1
df["c"] = -df.c*s
a b c
0 2.0 1.0 2.0
1 1.0 -1.0 -2.0
2 3.5 3.5 2.0
3 2.0 3.0 -2.0
4 5.0 4.0 -1.0
5 3.0 2.0 1.0
6 1.0 3.0 -2.0
7 1.0 2.0 2.0
Related
I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0
Say I have the following pandas dataframe:
df = pd.DataFrame([[3, 2, np.nan, 0],
[5, 4, 2, np.nan],
[7, np.nan, np.nan, 5],
[9, 3, np.nan, 4]],
columns=list('ABCD'))
which returns this:
A B C D
0 3 2.0 NaN 0.0
1 5 4.0 2.0 NaN
2 7 NaN NaN 5.0
3 9 3.0 NaN 4.0
I'd like that if a np.nan is found, that the value is replaced by a value in the A column. So that would mean the result to be this:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
I've tried multiple things, but I could not get anything to work. Can anyone help?
Here is necessary double transpose:
cols = ['B','C', 'D']
df[cols] = df[cols].T.fillna(df['A']).T
print(df)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
because:
df[cols] = df[cols].fillna(df['A'], axis=1)
print(df)
NotImplementedError: Currently only can fill with dict/Series column by column
Another solution with numpy.where and broadcasting column A:
df = pd.DataFrame(np.where(df.isnull(), df['A'].values[:, None], df),
index=df.index,
columns=df.columns)
print (df)
A B C D
0 3.0 2.0 3.0 0.0
1 5.0 4.0 2.0 5.0
2 7.0 7.0 7.0 5.0
3 9.0 3.0 9.0 4.0
Thank you #pir for another solution:
df = pd.DataFrame(np.where(df.isnull(), df[['A']], df),
index=df.index,
columns=df.columns)
Currently, fillna doesn't allow for broadcasting a series across columns while aligning the indices.
pandas.DataFrame.mask
This functions exactly like what we'd want fillna to do. Finds the the nulls, fills it in with df.A along axis=0
df.mask(df.isna(), df.A, axis=0)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
pandas.DataFrame.fillna using a dictionary
However, you can pass a dictionary to fillna that tells it what to do for each column.
df.fillna({k: df.A for k in df})
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
DO fillna with reindex
df.fillna(df[['A']].reindex(columns=df.columns).ffill(1))
Out[20]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
Or combine_first
df.combine_first(df.fillna(0).add(df.A,0))
Out[35]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
# for each column...
for col in df.columns:
# I select the np.nan and I replace then with the value of A
df.loc[df[col].isnull(), col] = df["A"]
I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0
I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.