I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.
Related
I want to multiply column values by a specific scalar based on the name of the column:
if column name = "Math", then all the values in 'Math" column should be multiply by 5;
if column name = "Physique", values in that column should be multiply by 4;
if column name = "Bio", values in that column should be multiplied by 3;
all the remaining columns should be multiplied by 2
What I have:
This is what I should have :
listm = ['Math', 'Physique', 'Bio']
def note_coef(row):
for m in listm:
if 'Math' in listm:
result = df['Math']*5
return result
df2=df.apply(note_coef)
df2
Note I stopped with only 1 if to test my code but the outcome is not what I expected. I am quite new in programming and here as well.
I think the most elegant solution is to define a dictionary (or a pandas.Series) with the multiplying factor for each column of your DataFrame (factors). Then you can multiply all the columns with the corresponding factor simply using df *= factors.
The multiplication is done via column axis alignment, i.e. by aligning the df.columns with the dictionary keys.
For instance, given the following DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ones(shape=(4, 5)), columns=['Math', 'Physique', 'Bio', 'Algo', 'Archi'])
>>> df
Math Physique Bio Algo Archi
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
You can do:
factors = {'Math': 5, 'Physique': 4, 'Bio': 3}
default_factor = 2
factors.update({col: default_factor for col in df.columns if col not in factors})
df *= factors
print(df)
Output:
Math Physique Bio Algo Archi
0 5.0 4.0 3.0 2.0 2.0
1 5.0 4.0 3.0 2.0 2.0
2 5.0 4.0 3.0 2.0 2.0
3 5.0 4.0 3.0 2.0 2.0
Fake data
n=5
d = {'a':np.ones(n),
'b':np.ones(n),
'c':np.ones(n),
'd':np.ones(n)}
df = pd.DataFrame(d)
print(df)
Select the columns and multiply by a tuple.
df[['a','c']] = df[['a','c']] * (2,4)
print(df)
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
a b c d
0 2.0 1.0 4.0 1.0
1 2.0 1.0 4.0 1.0
2 2.0 1.0 4.0 1.0
3 2.0 1.0 4.0 1.0
4 2.0 1.0 4.0 1.0
You can use df['col_name'].multiply(value) to apply on a whole column. The remaining columns can be modified in a loop of all columns except listm.
listm = ['Math', 'Physique', 'Bio']
for i, head in enumerate(listm):
df[head] = df[head].multiply(5-i)
heads = df.head()
for head in heads:
if not head in listm:
df[head] = df[head].multiply(2)
here is another way to do it using array multiplication
The data was not provided as a text, so created the test data in a patter of the screen shot
mul = [5,4,3,2,2,2,2,1] # multipliers
df1=df.iloc[:,1:].mul(mul)
df1.total = df1.iloc[:,:7].sum(axis=1)
df.update(df1, join='left', overwrite=True)
df
source Math Physics Bio Algo Archi Sport eng total
0 A 50.0 60.0 60.0 50.0 60.0 70.0 80.0 430.0
1 B 55.0 64.0 63.0 52.0 62.0 72.0 82.0 450.0
2 C 5.5 8.4 9.3 NaN NaN NaN NaN 23.2
3 D NaN NaN NaN 22.0 42.0 62.0 82.0 208.0
4 E 6.0 8.8 9.6 NaN NaN NaN NaN 24.4
5 F NaN NaN NaN 24.0 44.0 64.0 84.0 216.0
TEST DATA
data_out = [
['A', 10,15,20,25,30,35,40],
['B', 11,16,21,26,31,36,41],
['C', 1.1,2.1,3.1],
['D', np.NaN,np.NaN,np.NaN,11,21,31,41],
['E', 1.2,2.2,3.2],
['F', np.NaN,np.NaN,np.NaN,12,22,32,42],
]
df=pd.DataFrame(data_out, columns=[ 'source', 'Math', 'Physics', 'Bio', 'Algo', 'Archi', 'Sport', 'eng'])
df['total'] = df.iloc[:,1:].sum(axis=1)
source Math Physics Bio Algo Archi Sport eng total
0 A 10.0 15.0 20.0 25.0 30.0 35.0 40.0 175.0
1 B 11.0 16.0 21.0 26.0 31.0 36.0 41.0 182.0
2 C 1.1 2.1 3.1 NaN NaN NaN NaN 6.3
3 D NaN NaN NaN 11.0 21.0 31.0 41.0 104.0
4 E 1.2 2.2 3.2 NaN NaN NaN NaN 6.6
5 F NaN NaN NaN 12.0 22.0 32.0 42.0 108.0
I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0
When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0
I have a dataframe with 5 columns indexed by date. I would like to normalize these data series by the first item in their lists.
A B C D E
1/1/2017 3 4 1 2 3
1/2/2017 7 4 4 3 3
1/3/2017 2 5 5 4 3
1/4/2017 2 5 3 6 3
1/5/2017 2 2 2 6 6
for example, in column A, i would like to divided everything by 3, the first item on the list. Same for column B to E.
thanks for your help!
In [100]: df.div(df.iloc[0])
Out[100]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0
or
In [101]: df / df.iloc[0]
Out[101]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0
By using div
df.div(df.iloc[0,:],1)
Out[496]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0