extract rows which contain only NaN and 0

extract rows which contain only NaN and 0 - python

I have a dataframe which contains NaNs and 0s in some rows for all columns. I am trying to extract such rows, so that I can process them further. Also, some of these columns are object and some float. I am trying the below code to extract such rows, but because of the columns being object, its not giving me the desired result.
Now, I can solve this problem by substituting some arbitrary values to NaN and use it in .isin statement, but then it also changes the datatype of my columns, and I would have to convert them back.
Can somebody please help me with a workaround/solution to this.
Thanks.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,0,np.nan,1,'abc'], 'b':[0,np.nan,np.nan,1,np.nan]})
df
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
3 1 1.0
4 abc NaN
5 NaN 1.0
values = [np.nan,0]
df_all_empty = df[df.isin(values).all(1)]
df_all_empty
Expected Output:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
Actual Output:
a b
0 NaN 0.0

Change
df_all_empty = df[(df.isnull()|df.isin([0])).all(1)]

The code below will let you select those rows.
df_sel = df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) ]
If you want to make column 'a' in those rows, say for example -9999, you can use:
df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) , 'a'] = -9999
For reference, refer to the official documentation, in
https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

You can use df.query, and the trick described here (compare to NaN by checking if a value equals to itself)
Write something like this:
df.query("(a!=a or a==0) and (b!=b or b==0)")
And the output is:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN

Related

Divide several columns with the same column name ending by one other column in python

I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.

Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10

You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10

Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)

Replacing certain values in column with string

This is my current data frame:
sports_gpa music_gpa Activity Sport
2 3 nan nan
0 2 nan nan
3 3.5 nan nan
2 1 nan nan
I have the following condition:
If the 'sports_gpa' is greater than 0 and the 'music_gpa' is greater than the 'sports_gpa', fill the the 'Activity' column with the 'sport_gpa' and fill the 'Sport' column with the str 'basketball'.
Expected output:
sports_gpa music_gpa Activity Sport
2 3 2 basketball
0 2 nan nan
3 3.5 3 basketball
2 1 nan nan
To do this I would use the following statement...
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],'basketball'), (df['Activity'], df['Sport']))
This of course gives an error that operands could not be broadcast together with shapes.
To fix this I could add a column to the data frame..
df.loc[:,'str'] = 'basketball'
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],df['str']), (df['Activity'], df['Sport']))
This gives me my expected output.
I am wondering if there is a way to fix this error without having to create a new column in order to add the str value 'basketball' to the 'Sport' column in the np.where statement.

Use np.where + Series.fillna:
where=df['sports_gpa'].ne(0)&(df['sports_gpa']<df['music_gpa'])
df['Activity'], df['Sport'] = np.where(where, (df['sports_gpa'],df['Sport'].fillna('basketball')), (df['Activity'], df['Sport']))
You can also use Series.where + Series.mask:
df['Activity']=df['sports_gpa'].where(where)
df['Sport']=df['Sport'].mask(where,'basketball')
print(df)
sports_gpa music_gpa Activity Sport
0 2 3.0 2.0 basketball
1 0 2.0 NaN NaN
2 3 3.5 3.0 basketball
3 2 1.0 NaN NaN

Just figured out I could do:
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sports_gpa'],df['Sport'].astype(str).replace({"nan": "basketball"})), (df['Activity'], df['Sport']))

pandas replace is not replacing value even with inplace=True

My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.

Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?

Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

Remove columns that have NA values for rows - Python

Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks

Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract rows which contain only NaN and 0 - python

Change df_all_empty = df[(df.isnull()|df.isin([0])).all(1)]

You can use df.query, and the trick described here (compare to NaN by checking if a value equals to itself) Write something like this: df.query("(a!=a or a==0) and (b!=b or b==0)") And the output is: a b 0 NaN 0.0 1 0 NaN 2 NaN NaN

Related

Divide several columns with the same column name ending by one other column in python

Replacing certain values in column with string

pandas replace is not replacing value even with inplace=True

Using fillna() selectively in pandas

Remove columns that have NA values for rows - Python

Categories

Resources