Why isn't this replace in DataFrame doing what I intended? - python

I'm trying to replace NaN in train_df with values of corresponding indexes in dff. I can't understand what I'm doing wrong.
train_df.replace(to_replace = train_df["Age"].values ,
value = dff["Age"].values ,
inplace = True ,
regex = False ,
limit = None)
dff.Age.mean()
Output : 30.128401985359698
train_df.Age.mean()
Output : 28.96758312013303

You replace everything in train_df not just NaN.
The replace docs say:
Replace values given in to_replace with value.
If you just want to replace the NaN you should take a look at fillna or maybe you could use indexing with isna.
fillna Docs
isna Docs
Example with fillna
df1 = pd.DataFrame({"a": [1, 2, np.nan, 4]})
df2 = pd.DataFrame({"a": [5, 5, 3, 5]})
df1.fillna(df2, inplace=True)
Example with isna
df1[pd.isna(df1)] = df2
Results
>> df1
a
0 1.0
1 2.0
2 3.0
3 4.0

Related

What is the Fastest way to compare values across columns in pandas (Python)

I have the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, 1], [1, np.nan, 1, 1]]),
columns=['t', 't_1', 't_2', 't_3'])
Which in reality have ~10 Million rows.
I need a fast way to know which is the last consecutive column that have a non null value.
Taking this df as an example the results would be ->
df_result = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, np.nan], [1, np.nan, np.nan, np.nan]]),
columns=['t', 't_1', 't_2', 't_3'])
currently I'm doing this with the following lambda function, but the result is too slow:
def second_to_last_null(*args):
for i in range(len(args)):
if np.isnan(args[i]):
return np.nan
else:
return args[-1]
df_result['t'] = df['t']
df_result['t_1_consecutive'] = df[['t', 't_1']].apply(lambda x: second_to_last_null(x.t, x.t_1), axis=1)
df_result['t_2_consecutive'] = df[['t', 't_1', 't_2']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2), axis=1)
df_result['t_3_consecutive'] = df[['t', 't_1', 't_2', 't_3']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2, x.t_3), axis=1)
Can somebody suggest the fastest way to do this in Pandas or Numpy?
A simple technical explanation as to why that method is better than mine.
Try cumsum on isna, then mask
df_result = df.mask(df.isna().cumsum(axis=1) >= 1)
Output:
t t_1 t_2 t_3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 NaN NaN
2 1.0 NaN NaN NaN
Explanation: df.isna() mask the nan with True, else False. Then taking the cumsum(axis=1) allow you to find the cumulative number of nan so far (on the rows). Finally, all cumsum >= 1 indicates that there is a nan before that position.

Pandas replace NaN values with respect to the columns

I have the following data frame.
df = pd.DataFrame({'A': [2, np.nan], 'B': [1, np.nan]})
df.fillna(0) replaces all the NaN values with 0. But
I want to replace the NaN values in the column 'A' with 1 and in the column 'B' with 0, simultaneously. How can I do that ?
Use:
df["A"].fillna(1 , inplace=True) # for col A - NaN -> 1
df["B"].fillna(0 , inplace=True) # for col B - NaN -> 0
This does it in one line
(df['column_name'].fillna(0,inplace=True),df['column_name'].fillna(1,inplace=True))
print(df)
fillna method also exists for Series objects.
df["A"] = df["A"].fillna(1)
df["B"] = df["B"].fillna(0)

Filter for rows in pandas dataframe where values in a column are greater than x or NaN

I'm trying to figure out how to filter a pandas dataframe so that that the values in a certain column are either greater than a certain value, or are NaN. Lets say my dataframe looks like this:
df = pd.DataFrame({"col1":[1, 2, 3, 4], "col2": [4, 5, np.nan, 7]})
I've tried:
df = df[df["col2"] >= 5 | df["col2"] == np.nan]
and:
df = df[df["col2"] >= 5 | np.isnan(df["col2"])]
But the first causes an error, and the second excludes rows where the value is NaN. How can I get the result to be this:
pd.DataFrame({"col1":[2, 3, 4], "col2":[5, np.nan, 7]})
Please Try
df[df.col2.isna()|df.col2.gt(4)]
col1 col2
1 2 5.0
2 3 NaN
3 4 7.0
Also, you can fill nan with the threshold:
df[df.fillna(5)>=5]

How to re-oder a dataframe in pandas using a string in cell?

I have been trying to re-order a dataframe in pandas by locating the cell which contains the string negative, delete this row and instead add a new column into the dataframe which distinguishes the rows before the cell which contains the string negative and call them 'POSITIVE' whereas the rows which used to be after the string, call them 'NEGATIVE'.
This is a minimal example of the dataframe:
d = {'col1': [1, 2, 'negative', 4,5],
'col2': [6,7,None, 8,9],
'col3':[10,11,None,12,13] }
data = pd.DataFrame(data=d)
This is what I have been trying to acheive:
arget = {'col1': [1, 2, 4, 5],
'col2': [6,7, 8,9],
'col3':[10,11, 12,13],
'col4':['POSTIVE','POSTIVE','NEGATIVE', 'NEGATIVE' ] }
target = pd.DataFrame(data= target)
I have tried to split the dataframe, then remove the row and then add the new column and finally join them again. I was wondering if there is a better way in pandas to achieve this?
First compare by negative for first mask, then add cumulative sum and compare for previous values to mask1, pass to numpy.where and last remove negative rows by invert mask by ~ in boolean indexing:
mask = data['col1'].eq('negative')
mask1 = mask.cumsum().eq(0)
data['col4'] = np.where(mask1, 'POSTIVE','NEGATIVE')
data = data[~mask].copy()
print (data)
col1 col2 col3 col4
0 1 6.0 10.0 POSTIVE
1 2 7.0 11.0 POSTIVE
3 4 8.0 12.0 NEGATIVE
4 5 9.0 13.0 NEGATIVE

Short way to replace values in a Series based on values in another Series?

I the below code, I am replacing all NaN values from column b with blank string if the corresponding value in column a is 1.
The code works, but I have to type df.loc[df.a == 1, 'b'] twice.
Is there a shorter/better way to do it?
import pandas as pd
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
filtered = df.loc[df.a == 1, 'b']
filtered.fillna('', inplace=True)
df.loc[df.a == 1, 'b'] = filtered
print(df)
how about the use of numpy where clause to check values in a and b and replace? see a mockup below. I have used column 'c' to illustrate
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
#replace b value if the corresponding value in column a is 1 and column b is NaN
df['c'] = np.where(((df['a'] == 1) & (df['b'].isna())), df['a'], df['b'])
df
original dataframe
a b
0 1.0 1.0
1 NaN 5.0
2 3.0 6.0
result:
a b c
0 1.0 NaN 1.0
1 NaN 5.0 5.0
2 3.0 6.0 6.0
Use where() to do it in one line
import numpy as np
df['b'] = np.where((df['b'].isnull()) & (df['a']==1),'',df['a'])
Use Series.fillna only for matched values by condition:
df.loc[df.a == 1, 'b'] = df['b'].fillna('')

Categories