Replace numerical values with NaN in Python - python

I want to replace all numerical values in a DataFrame column with NaN
Input
A B C
test foo xyz
hit bar 10
hit fish 90
hit NaN abc
test val 20
test val 90
Desired Output:
A B C
test foo xyz
hit bar NaN
hit fish NaN
hit NaN abc
test val NaN
test val NaN
I tried the following:
db_old.loc[db_old['Current Value'].istype(float), db_old['Current Value']] = np.nan
but returns:
AttributeError: 'Series' object has no attribute 'istype'
Any suggestions?
Thanks

You can mask numeric values using to_numeric:
df['C'] = df['C'].mask(pd.to_numeric(df['C'], errors='coerce').notna())
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN
to_numeric is the most general solution and should work regardless of whether you have a column of strings or mixed objects.
If it is a column of strings and you're only trying to retain strings of letters, str.isalpha may suffice:
df['C'] = df['C'].where(df['C'].str.isalpha())
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN
Although this specifically keeps strings that do not have digits.
If you have a column of mixed objects, here is yet another solution using str.match (any str method with a na flag, really) with na=False:
df['C'] = ['xyz', 10, 90, 'abc', 20, 90]
df['C'] = df['C'].where(df['C'].str.match(r'\D+$', na=False))
df
A B C
0 test foo xyz
1 hit bar NaN
2 hit fish NaN
3 hit NaN abc
4 test val NaN
5 test val NaN

Related

Combine numerical and boolean indexing

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(index=[0, 1, 2], columns=["test1", "test2"])
df.at[1, "test1"] = 3
df.at[2, "test2"] = 5
print(df)
test1 test2
0 NaN NaN
1 3 NaN
2 NaN 5
I tried the following line in order to set all NaN values at indices 1 and 2 to False:
df.loc[[1, 2] & pd.isna(df)] = False
However, this gives me an error.
My expected output would be:
test1 test2
0 NaN NaN
1 3 False
2 False 5
You can do this:
In [917]: df.loc[1:2] = df.loc[1:2].fillna(False)
In [918]: df
Out[918]:
test1 test2
0 NaN NaN
1 3 False
2 False 5
pd.isna(df)is a mask the shape of your DataFrame and you can't use that as a slice in a .loc call. In this case you want to selectively update the null values of your DataFrame on specific rows, so we can use .fillna with update to assing the changes back.
df.update(df.loc[[1, 2]].fillna(False))
print(df)
test1 test2
0 NaN NaN
1 3 False
2 False 5
Let us try with fillna and pass a dict
df = df.T.fillna(dict.fromkeys(df.index[1:],False),axis=0).T
test1 test2
0 NaN NaN
1 3 False
2 False 5

Replacing certain values in column with string

This is my current data frame:
sports_gpa music_gpa Activity Sport
2 3 nan nan
0 2 nan nan
3 3.5 nan nan
2 1 nan nan
I have the following condition:
If the 'sports_gpa' is greater than 0 and the 'music_gpa' is greater than the 'sports_gpa', fill the the 'Activity' column with the 'sport_gpa' and fill the 'Sport' column with the str 'basketball'.
Expected output:
sports_gpa music_gpa Activity Sport
2 3 2 basketball
0 2 nan nan
3 3.5 3 basketball
2 1 nan nan
To do this I would use the following statement...
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],'basketball'), (df['Activity'], df['Sport']))
This of course gives an error that operands could not be broadcast together with shapes.
To fix this I could add a column to the data frame..
df.loc[:,'str'] = 'basketball'
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],df['str']), (df['Activity'], df['Sport']))
This gives me my expected output.
I am wondering if there is a way to fix this error without having to create a new column in order to add the str value 'basketball' to the 'Sport' column in the np.where statement.
Use np.where + Series.fillna:
where=df['sports_gpa'].ne(0)&(df['sports_gpa']<df['music_gpa'])
df['Activity'], df['Sport'] = np.where(where, (df['sports_gpa'],df['Sport'].fillna('basketball')), (df['Activity'], df['Sport']))
You can also use Series.where + Series.mask:
df['Activity']=df['sports_gpa'].where(where)
df['Sport']=df['Sport'].mask(where,'basketball')
print(df)
sports_gpa music_gpa Activity Sport
0 2 3.0 2.0 basketball
1 0 2.0 NaN NaN
2 3 3.5 3.0 basketball
3 2 1.0 NaN NaN
Just figured out I could do:
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sports_gpa'],df['Sport'].astype(str).replace({"nan": "basketball"})), (df['Activity'], df['Sport']))

pandas replace is not replacing value even with inplace=True

My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.
Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged

Remove columns that have NA values for rows - Python

Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]

Union of two variables to form a new one in Pandas Data Frame - Python

I have a data frame , and I want to create a new data frame based on the values of two columns. The pair of values alwyas would be: 'x' and 'x' or 'x' and NaN or NaN and 'x' or NaN and NaN. So for the first three examples the values of the new varabiel would b 'x' and for the last one would be NaN. Nan is missing value.
The pandas data frame is:
Name Company nameC.antiguo Company nameC.completado
ssd X X
adf B NaN
dsf C C
eee NaN NaN
wqe NaN C
I tried the following code but it doen´t work at all.
pruebaofempat['bn']= pruebaofempat['Company nameC.antiguo'] + pruebaofempat['Company nameC.completado']
So, How Can I create the new variable correctly?
use .fillna:
>>> df
Name antiguo completado
0 ssd X X
1 adf B NaN
2 dsf C C
3 eee NaN NaN
4 wqe NaN C
>>> df['antiguo'].fillna(df['completado'])
0 X
1 B
2 C
3 NaN
4 C
Name: antiguo, dtype: object

Categories