I want to fill missing values of a specific column only if a condition is met.
e.g. A B
Nan 0
Nan 0
0 0
Nan 1
Nan 1
.....................
.....................
In the above case I want to fill Nan values in column A only when corresponding value in column B is 0. Rest values in A (with Nan) should not change.
Use mask with fillna:
df['A'] = df['A'].mask(df['B'] == 0, df['A'].fillna(3))
Alternatives with loc, numpy.where:
df.loc[df['B'] == 0, 'A'] = df['A'].fillna(3)
df['A'] = np.where(df['B'] == 0, df['A'].fillna(3), df['A'])
print (df)
A B
0 3.0 0
1 3.0 0
2 0.0 0
3 NaN 1
4 NaN 1
np.where is quicke and simple solution.
In [47]: df['A'] = np.where(np.isnan(df['A']) & df['B'] == 0, 3, df['A'])
In [48]: df
Out[48]:
A B
0 3.0 0
1 3.0 0
2 3.0 0
3 NaN 1
4 NaN 1
You should use a loop over all elements, something like this:
for i in range(len(A))
if numpy.isnan(A[i]) && B[i] == 0:
A[i] = value
There are nicer ways to implement these loops, but I don't know what structures you are using.
Related
The wording of the title may be confusing, but I will explain in the code. Say I have a dataframe df:
In [1]: import pandas as pd
df = pd.DataFrame([[20, 20], [20, 0], [0, 20], [0, 0]], columns=['a', 'b'])
df
Out[1]:
a b
0 20 20
1 20 0
2 0 20
3 0 0
Now I want to create a new dataframe "df_new" based on 2 conditions, for example:
If 'a' is greater than 10, then check 'b'. If 'b' is greater than 5, fill values with NaN or cut out data (doesn't matter). If 'b' is less than 5, return the data.
If 'a' is less than 10, return the data regardless of the value of 'b'.
Here's my I attempt with df.where -- it does not return how I would like.
In [2]: df_new = df.where((df['a'] < 10) & (df['b'] < 5))
df_new
Out[2]:
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 0.0 0.0
This is how I would like df_new to return
Out[3]:
a b
0 NaN NaN
1 20.0 0.0
2 0.0 20.0
3 0.0 0.0
I know df.where is doing exactly what I told it to do, but I am not sure how to check the 'b' value depending on the 'a' value with df.where -- I am trying to avoid a loop since my actual dataframe is quite large.
Just use this condition (df.a < 10) | (df.b < 5):
df[(df.a < 10) | (df.b < 5)]
a b
1 20 0
2 0 20
3 0 0
Given this df:
Name i j k
A 1 0 3
B 0 5 4
C 0 0 4
D 0 5
My goal is to add in a column "Final" that takes value in an order of i j k:
Name i j k Final
A 1 0 3 1
B 0 5 4 5
C 0 0 4 4
D 0 5 <-- this one is tricky. We do count the null for j column here.
Here is my attempt: df['Final'] = df[['i', 'j', 'k'].bfill(axis=1).iloc[:, 0]. This doesn't work since it always takes the value of column 1. Any help would be appreciated. :)
Many thanks!
If by "taking values in column order", you mean "taking the first non-zero value in each row, or zero if all values are zero", you could use DataFrame.lookup after doing a boolean comparison:
In [113]: df["final"] = df.lookup(df.index,(df[["i","j","k"]] != 0).idxmax(axis=1))
In [114]: df
Out[114]:
Name i j k final
0 A 1 0.0 3 1.0
1 B 0 5.0 4 5.0
2 C 0 0.0 4 4.0
3 D 0 NaN 5 NaN
where first we compare everything with zero:
In [115]: df[["i","j","k"]] != 0
Out[115]:
i j k
0 True False True
1 False True True
2 False False True
3 False True True
and then we use idxmax to find the first True (or the first False if you have a row of zeroes):
In [116]: (df[["i","j","k"]] != 0).idxmax(axis=1)
Out[116]:
0 i
1 j
2 k
3 j
dtype: object
Is this what you need ?
df['Final']=df[['i', 'j', 'k']].mask((df=='')|(df==0)).bfill(axis=1).iloc[:, 0][(df!='').all(1)]
df
Out[1290]:
Name i j k Final
0 A 1 0 3 1.0
1 B 0 5 4 5.0
2 C 0 0 4 4.0
3 D 0 5 NaN
Using pandas.Series.nonzero the solution can be expressed succicntly.
df['Final'] = df.apply(lambda x: x.iloc[x.nonzero()[0][0]], axis=1)
How this works:
nonzero() returns the indices of elements that are not zero (and will match np.nan as well).
We take the first index location and return the value at that location to construct the Final Column.
We apply this on the dataframe using axis=1 to apply it row by row.
A benefit of this approach is that it does not depend on naming individual columns ['i', 'j', 'k']
I tried to change all NaN elements in column b to 1 if column a is not NaN in the same row. eg: a==1 b==NaN ,change b to 1. Here is my code.
raw_data['b'] = ((raw_data['a'],raw_data['b']).apply(condition))
def condition(a,b):
if a != None and b == None:
return 1
And I got an AttributeError: 'tuple' object has no attribute 'apply'. What other methods I can use in this situation?
First create boolean mask by chained conditions with & with functions isnull and notnull.
Then is more possible solutions for add 1 - with mask, loc or numpy.where:
mask = raw_data['a'].notnull() & raw_data['b'].isnull()
raw_data['b'] = raw_data['b'].mask(mask, 1)
Or:
raw_data.loc[mask, 'b'] = 1
Or:
raw_data['b'] = np.where(mask, 1,raw_data['b'])
Sample:
raw_data = pd.DataFrame({
'a': [1,np.nan, np.nan],
'b': [np.nan, np.nan,2]
})
print (raw_data)
a b
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
mask = raw_data['a'].notnull() & raw_data['b'].isnull()
print (mask)
0 True
1 False
2 False
dtype: bool
raw_data.loc[mask, 'b'] = 1
print (raw_data)
a b
0 1.0 1.0
1 NaN NaN
2 NaN 2.0
EDIT:
If want use custom function (really slow if more data) need apply with axis=1 for processing by rows:
def condition(x):
if pd.notnull(x.a) and pd.isnull(x.b):
return 1
else:
return x.b
raw_data['b'] = raw_data.apply(condition, axis=1)
print (raw_data)
a b
0 1.0 1.0
1 NaN NaN
2 NaN 2.0
I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?
Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value
Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2
as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]
Hard to explain in words but the expample should be clear:
df = DataFrame( { 'x':[0,1], 'y':[np.NaN,0], 'z':[0,np.NaN] }, index=['a','b'] )
x y z
a 0 NaN 0
b 1 0 NaN
I want to replace all non-NaN values with a '1', if there is a '1' anywhere in that row. Just like this:
x y z
a 0 NaN 0
b 1 1 NaN
This sort of works, but unfortunately overwrites the NaN
df[ df.any(1) ] = 1
x y z
a 0 NaN 0
b 1 1 1
I thought there might be some non-reducing form of any (like cumsum is a non-reducing form of sum), but I can't find anything like that so far...
You could combine a multiplication by zero (to give an empty frame but which remembers nan locations) with an add on axis=0:
>>> df
x y z
a 0 NaN 0
b 1 0 NaN
>>> (df * 0).add(df.any(1), axis=0)
x y z
a 0 NaN 0
b 1 1 NaN