I have a dataframe which contains NaNs and 0s in some rows for all columns. I am trying to extract such rows, so that I can process them further. Also, some of these columns are object and some float. I am trying the below code to extract such rows, but because of the columns being object, its not giving me the desired result.
Now, I can solve this problem by substituting some arbitrary values to NaN and use it in .isin statement, but then it also changes the datatype of my columns, and I would have to convert them back.
Can somebody please help me with a workaround/solution to this.
Thanks.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,0,np.nan,1,'abc'], 'b':[0,np.nan,np.nan,1,np.nan]})
df
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
3 1 1.0
4 abc NaN
5 NaN 1.0
values = [np.nan,0]
df_all_empty = df[df.isin(values).all(1)]
df_all_empty
Expected Output:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
Actual Output:
a b
0 NaN 0.0
Change
df_all_empty = df[(df.isnull()|df.isin([0])).all(1)]
The code below will let you select those rows.
df_sel = df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) ]
If you want to make column 'a' in those rows, say for example -9999, you can use:
df.loc[(df.a.isnull()) | \
(df.b.isnull()) | \
(df.a==0) | \
(df.b==0) , 'a'] = -9999
For reference, refer to the official documentation, in
https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
You can use df.query, and the trick described here (compare to NaN by checking if a value equals to itself)
Write something like this:
df.query("(a!=a or a==0) and (b!=b or b==0)")
And the output is:
a b
0 NaN 0.0
1 0 NaN
2 NaN NaN
Related
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
This is my current data frame:
sports_gpa music_gpa Activity Sport
2 3 nan nan
0 2 nan nan
3 3.5 nan nan
2 1 nan nan
I have the following condition:
If the 'sports_gpa' is greater than 0 and the 'music_gpa' is greater than the 'sports_gpa', fill the the 'Activity' column with the 'sport_gpa' and fill the 'Sport' column with the str 'basketball'.
Expected output:
sports_gpa music_gpa Activity Sport
2 3 2 basketball
0 2 nan nan
3 3.5 3 basketball
2 1 nan nan
To do this I would use the following statement...
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],'basketball'), (df['Activity'], df['Sport']))
This of course gives an error that operands could not be broadcast together with shapes.
To fix this I could add a column to the data frame..
df.loc[:,'str'] = 'basketball'
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],df['str']), (df['Activity'], df['Sport']))
This gives me my expected output.
I am wondering if there is a way to fix this error without having to create a new column in order to add the str value 'basketball' to the 'Sport' column in the np.where statement.
Use np.where + Series.fillna:
where=df['sports_gpa'].ne(0)&(df['sports_gpa']<df['music_gpa'])
df['Activity'], df['Sport'] = np.where(where, (df['sports_gpa'],df['Sport'].fillna('basketball')), (df['Activity'], df['Sport']))
You can also use Series.where + Series.mask:
df['Activity']=df['sports_gpa'].where(where)
df['Sport']=df['Sport'].mask(where,'basketball')
print(df)
sports_gpa music_gpa Activity Sport
0 2 3.0 2.0 basketball
1 0 2.0 NaN NaN
2 3 3.5 3.0 basketball
3 2 1.0 NaN NaN
Just figured out I could do:
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sports_gpa'],df['Sport'].astype(str).replace({"nan": "basketball"})), (df['Activity'], df['Sport']))
My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.
Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged
I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.
Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]