Pandas NaN value causing trouble when change values depending on other columns - python

why do pandas NaN values sometime typed as numpy.float64, and sometimes float?
This is so confusing when I want to use function and change values in a dataframe depending on other columns
example:
A B C
0 1 NaN d
1 2 a s
2 2 b s
3 3 c NaN
I have a def to change value of column C
def change_val(df):
if df.A==1 and df.B==np.nan:
return df.C
else:
return df.B
Then I apply this function onto column C
df['C']=df.apply(lambda x: change_val(x),axis=1)
Things go wrong on df.B==np.nan, how do I correctly express this please?
Desired result:
A B C
0 1 NaN d
1 2 a a
2 2 b b
3 3 c c

Use numpy.where or loc, for check missing values is used special function Series.isna:
mask = (df.A==1) & (df.B.isna())
#oldier pandas versions
#mask = (df.A==1) & (df.B.isnull())
df['C'] = np.where(mask, df.C, df.B)
Or:
df.loc[~mask, 'C'] = df.B
print (df)
A B C
0 1 NaN d
1 2 a a
2 2 b b
3 3 c c
For more information about working with missing data check docs.

def change_val(df):
if df.A==1 and pd.isnull(df.B):
return df.C
else:
return df.B
NaN is no value will not be equal to any value, not even Nan itself, so use isnull()/isna()

Related

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

How to create conditional pandas series/column?

Here is a sample df:
A B C D E (New Column)
0 1 2 a n ?
1 3 3 p d ?
2 5 9 f z ?
If Column A == Column B PICK Column C's value apply to Column E;
Otherwise PICK Column D's value apply to Column E.
I have tried many ways but failed, I am new please teach me how to do it, THANK YOU!
Note:
It needs to PICK the value from Col.C or Col.D in this case. So there are not specify values are provided to fill in the Col.E(this is the most different to other similar questions)
use numpy.where
df['E'] = np.where(df['A'] == df['B'],
df['C'],
df['D'])
df
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z
Try pandas where
df['E'] = df['C'].where(df['A'].eq(df['B']), df['D'])
df
Out[570]:
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z

Pandas: optimise iterating with a condition on both row and column with large file

I have the following data, and what I would like is to fill in col E with values from another row (let’s call it the target row) in col D only when the following conditions are met:
col E has no value
the string in col A of the target row is the same as that in col A
the value in col B for the target row is the same as the value in col C
A
B
C
D
E
1
XXZ
a
d
1
2
YXXZ
b
a
2
3
YXXZ
c
b
3
2
4
YXXZ
d
c
4
5
5
XXZ
e
a
4
What I would get would be something like this:
A
B
C
D
E
XXZ
a
d
1
1
YXXZ
b
a
2
2
YXXZ
c
b
3
2
YXXZ
d
c
4
5
XXZ
e
a
4
NaN
The answer from #ralubrusto below works, but is clearly not efficient for large files. Is there any suggestion on how to make it work faster?
missing = df.E.isna()
for id in df[missing].index:
original = df.loc[id]
# Second condition
equal_A = df[df['A'] == original['A']]
# Third condition
the_one = equal_A[equal_A['C'] == original['B']]
# Assigning
if len(the_one) > 0:
df.at[id, 'E'] = the_one.iloc[0]['D']
Since you have multiple and different conditions, you might wanna do something like:
# Find missing E values
missing = df.E.isna()
for id in df[missing].index:
original = df.loc[id]
# Second condition
equal_A = df[df['A'] == original['A']]
# Third condition
the_one = equal_A[equal_A['C'] == original['B']]
# Assigning
if len(the_one) > 0:
df.at[id, 'E'] = the_one.iloc[0]['D']
The answer for your example data would be:
A B C D E
0 XXZ a d 1 4.0
1 YXXZ b a 2 3.0
2 YXXZ c b 3 2.0
3 YXXZ d c 4 5.0
4 XXZ e a 4 NaN
Edit: Thanks for your patience. I've tried a few different approaches to accomplish this task, and most of them are pretty inefficient, as you can see in the perfplot below (it's not a perfect plot, but you can get the general idea).
I've tried some approaches using groupby, apply, for loops (the previous answer) and finally a merge one, which is by far the fastest one.
Here it is its code:
_df = (df.reset_index()
.merge(df, left_on=['A', 'B'],
right_on=['A', 'C'],
how='inner',
suffixes=['_ori', '_target']))
_df.loc[_df.E_ori.isna(), 'E_ori'] = _df.loc[_df.E_ori.isna(), 'D_target']
_df.set_index('index', inplace=True)
df.loc[_df.index, 'E'] = _df['E_ori']
It's really more efficient than the previous solution, so please try it out using your dataset and tell us if you have any further issues.

Map only first occurrence of key/value match in dataframe

Is it possible to map only the first occurrence of key in a dataframe?
Ex:
testDict = { A : 1, B: 2}
df
Name Num
A
A
B
B
Expected output
Name Num
A 1
A
B 2
B
Use duplicated to find the first occurrence and then map:
df['Num'] = df.Name[df.Name.duplicated(keep='last')].map(testDict)
print(df)
Output
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
To remove the NaN values, if you wish, do:
df = df.fillna('')
map the drop_duplicates, assuming you have a unique Index for alignment. (Probably best to keep NaN so the column remains numeric)
df['Num'] = df['Name'].drop_duplicates().map(testDict)
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
You can use duplicated and map:
df['Num'] = np.where(~df['Name'].duplicated(), df['Name'].map(testDict), '')
Output:
Name Num
0 A 1
1 A
2 B 2
3 B

How to replace a value in a pandas dataframe with column name based on a condition?

I have a dataframe that looks something like this:
I want to replace all 1's in the range A:D with the name of the column, so that the final result should resemble:
How can I do that?
You can recreate my dataframe with this:
dfz = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,3,1],
'D' : [1,0,0,1,0,0],
'E' : [22.0,15.0,None,10.,None,557.0]})
One way could be to use replace and pass in a Series mapping column labels to values (those same labels in this case):
>>> dfz.loc[:, 'A':'D'].replace(1, pd.Series(dfz.columns, dfz.columns))
A B C D
0 A B C D
1 0 0 0 0
2 0 0 0 0
3 A B C D
4 0 0 3 0
5 0 B C 0
To make the change permanent, you'd assign the returned DataFrame back to dfz.loc[:, 'A':'D'].
Solutions aside, it's useful to keep in mind that you may lose a lot of performance benefits when you mix numeric and string types in columns, as pandas is forced to use the generic 'object' dtype to hold the values.
A solution using where:
>>> dfz.where(dfz != 1, dfz.columns.to_series(), axis=1)
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0
Maybe it's not so elegant but...just looping through columns and replace:
for i in dfz[['A','B','C','D']].columns:
dfz[i].replace(1,i,inplace=True)
I do prefer very elegant solution from #ajcr.
In case if you have column names that you cant use that easily for slicing, here is my solution:
dfz.ix[:, dfz.filter(regex=r'(A|B|C|D)').columns.tolist()] = (
dfz[dfz!=1].ix[:,dfz.filter(regex=r'(A|B|C|D)').columns.tolist()]
.apply(lambda x: x.fillna(x.name))
)
Output:
In [207]: dfz
Out[207]:
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0

Categories