I tried to change all NaN elements in column b to 1 if column a is not NaN in the same row. eg: a==1 b==NaN ,change b to 1. Here is my code.
raw_data['b'] = ((raw_data['a'],raw_data['b']).apply(condition))
def condition(a,b):
if a != None and b == None:
return 1
And I got an AttributeError: 'tuple' object has no attribute 'apply'. What other methods I can use in this situation?
First create boolean mask by chained conditions with & with functions isnull and notnull.
Then is more possible solutions for add 1 - with mask, loc or numpy.where:
mask = raw_data['a'].notnull() & raw_data['b'].isnull()
raw_data['b'] = raw_data['b'].mask(mask, 1)
Or:
raw_data.loc[mask, 'b'] = 1
Or:
raw_data['b'] = np.where(mask, 1,raw_data['b'])
Sample:
raw_data = pd.DataFrame({
'a': [1,np.nan, np.nan],
'b': [np.nan, np.nan,2]
})
print (raw_data)
a b
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
mask = raw_data['a'].notnull() & raw_data['b'].isnull()
print (mask)
0 True
1 False
2 False
dtype: bool
raw_data.loc[mask, 'b'] = 1
print (raw_data)
a b
0 1.0 1.0
1 NaN NaN
2 NaN 2.0
EDIT:
If want use custom function (really slow if more data) need apply with axis=1 for processing by rows:
def condition(x):
if pd.notnull(x.a) and pd.isnull(x.b):
return 1
else:
return x.b
raw_data['b'] = raw_data.apply(condition, axis=1)
print (raw_data)
a b
0 1.0 1.0
1 NaN NaN
2 NaN 2.0
Related
I have a table:
A
B
C
x
1
NA
y
NA
4
z
2
NA
p
NA
5
t
6
7
I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):
A
B
C
D
x
1
NA
1
y
NA
4
4
z
2
NA
2
p
NA
5
5
t
6
7
error
In case both columns contain a value, it should return the text 'error' inside the cell.
You could first calculate a mask with rows where both values are present and then fill NA values of, let's say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.
error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df
A B C D
0 x 1 <NA> 1
1 y <NA> 4 4
2 z 2 <NA> 2
3 p 3 5 <NA>
OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'
I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype
There as several ways to achieve this.
Using fillna and mask
df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')
Or numpy.select:
m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])
Output:
A B C D
0 x 1.0 NaN 1.0
1 y NaN 4.0 4.0
2 z 2.0 NaN 2.0
3 p NaN 5.0 5.0
4 t 6.0 7.0 error
Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.
Consider the dataframe that you presented, with np.nan as the NA values:
df = pd.DataFrame({
'B':[1, np.nan, 2, np.nan, 6],
'C':[np.nan, 4, np.nan, 5, 7]})
First generate a list of the elements from the series in question:
df['D'] = df.apply(lambda x: list(x), axis=1)
This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)
df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])
Finally, create the error by filtering based on length.
df['F'] = df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')
The resulting dataframe works like this:
B C D E F
0 1.0 NaN [1.0, nan] [1.0] 1.0
1 NaN 4.0 [nan, 4.0] [4.0] 4.0
2 2.0 NaN [2.0, nan] [2.0] 2.0
3 NaN 5.0 [nan, 5.0] [5.0] 5.0
4 6.0 7.0 [6.0, 7.0] [6.0, 7.0] error
Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')
Have a look at the function combine_first:
df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')
Output:
0 1.0
1 4.0
2 2.0
3 5.0
4 error
Name: C, dtype: object
I have the following dataframe:
a b
0 3.0 10.0
1 2.0 9.0
2 NaN 8.0
For each row, I need to drop (and replace with NaN) all values, excluding the first non-null one.
This is the expected output:
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I know that using the justify function I can identify the first n non-null values, but I need to keep the same structure of the original dataframe.
One way to go, would be:
import pandas as pd
data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}
df = pd.DataFrame(data)
def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)
df = df.apply(lambda x: keep_first_valid(x), axis=1)
df
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
So, the first x passed to the function would consist of pd.Series([3.0, 10.0],index=['a','b']).
Inside the function first_valid = x.first_valid_index() will store 'a'; see df.first_valid_index.
Finally, we apply s.mask to get pd.Series([3.0, None],index=['a','b']), which we assign back to the df.
try this:
f = df.copy()
f[:] = f.columns
fv_idx = df.apply(pd.Series.first_valid_index, axis=1).values[:, None]
res = df.where(f == fv_idx)
print(res)
>>>
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I want to split a column into two separate columns based on whether the value in column A is true or false.
Example:
A X
True 3
False 6
True 2
False 4
A Y Z
True 3
False 6
True 2
False 4
I've found examples online of this with string manipulation, but I'm working with integers.
I can combine the columns by df[Y] + df[Z], but can't find a way to split them.
Use double numpy.where:
df['Y'] = np.where(df.A, df.X, np.nan)
df['Z'] = np.where(~df.A, df.X, np.nan)
Or Series.where with
Series.mask:
df['Y'] = df.X.where(df.A)
df['Z'] = df.X.mask(df.A)
print (df)
A X Y Z
0 True 3 3.0 NaN
1 False 6 NaN 6.0
2 True 2 2.0 NaN
3 False 4 NaN 4.0
Or numpy.select with () for masks:
df['Y'], df['Z'] = np.select([(df.A, ~df.A)], [df.X], default=np.nan)
print (df)
A X Y Z
0 True 3 3.0 NaN
1 False 6 NaN 6.0
2 True 2 2.0 NaN
3 False 4 NaN 4.0
If want empty strings change NaNs to '', but if next processing is necessary then it failed:
df['Y'], df['Z'] = np.select([(df.A, ~df.A)], [df.X], default='')
Or:
df['Y'] = np.where(df.A, df.X, '')
df['Z'] = np.where(~df.A, df.X, '')
print (df)
A X Y Z
0 True 3 3
1 False 6 6
2 True 2 2
3 False 4 4
df['Y'] = df['X'][df['A']==True]
df['Z'] = df['X'][df['A']==False]
I want to fill missing values of a specific column only if a condition is met.
e.g. A B
Nan 0
Nan 0
0 0
Nan 1
Nan 1
.....................
.....................
In the above case I want to fill Nan values in column A only when corresponding value in column B is 0. Rest values in A (with Nan) should not change.
Use mask with fillna:
df['A'] = df['A'].mask(df['B'] == 0, df['A'].fillna(3))
Alternatives with loc, numpy.where:
df.loc[df['B'] == 0, 'A'] = df['A'].fillna(3)
df['A'] = np.where(df['B'] == 0, df['A'].fillna(3), df['A'])
print (df)
A B
0 3.0 0
1 3.0 0
2 0.0 0
3 NaN 1
4 NaN 1
np.where is quicke and simple solution.
In [47]: df['A'] = np.where(np.isnan(df['A']) & df['B'] == 0, 3, df['A'])
In [48]: df
Out[48]:
A B
0 3.0 0
1 3.0 0
2 3.0 0
3 NaN 1
4 NaN 1
You should use a loop over all elements, something like this:
for i in range(len(A))
if numpy.isnan(A[i]) && B[i] == 0:
A[i] = value
There are nicer ways to implement these loops, but I don't know what structures you are using.
Is there a way to remove a NaN values from a panda series? I have a series that may or may not have some NaN values in it, and I'd like to return a copy of the series with all the NaNs removed.
>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s[~s.isnull()]
0 1
1 2
2 3
3 4
5 5
update or even better approach as #DSM suggested in comments, using pandas.Series.dropna():
>>> s.dropna()
0 1
1 2
2 3
3 4
5 5
A small usage of np.nan ! = np.nan
s[s==s]
Out[953]:
0 1.0
1 2.0
2 3.0
3 4.0
5 5.0
dtype: float64
More Info
np.nan == np.nan
Out[954]: False
If you have a pandas serie with NaN, and want to remove it (without loosing index):
serie = serie.dropna()
# create data for example
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
ser.replace('e', np.NAN)
print(ser)
0 g
1 NaN
2 NaN
3 k
4 s
dtype: object
# the code
ser = ser.dropna()
print(ser)
0 g
3 k
4 s
dtype: object