Is there a way to replace NAN values in both categorical columns as well as numerical columns at once?
A very simplistic example:
data = {'col_1': [3, np.nan, 1, 2], 'col_2': ['a', 'a', np.nan, 'd']}
df = pd.DataFrame.from_dict(data)
Dataframe:
col_1 col_2
0 3.0 a
1 NaN a
2 1.0 NaN
3 0.0 d
Goal:
To replace col_1's NAN with the mean of col_1 and replace col_2's NAN with the mode ('a') of col_2.
Right now, I have to replace it for each column individually. If all columns are numeric or categorical then it's easy because the operation can be applied on the whole data frame but I couldn't find a way to do it one line for a mixed data frame.
mean will only work for numeric types, so fill that first then fill the remainder with mode.
df.fillna(df.mean()).fillna(df.mode().iloc[0])
# col_1 col_2
#0 3.0 a
#1 2.0 a
#2 1.0 a
#3 2.0 d
If you have ties, the mode will be the one that is sorted first.
What I will do
df.fillna(df.agg(['mean',lambda x : x.value_counts().index[0]]).ffill().iloc[-1,:])
col_1 col_2
0 3.0 a
1 2.0 a
2 1.0 a
3 2.0 d
Related
how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.
Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.
Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I have a pandas data-frame with multiple features, where I would like to insert rows of nans corresponding to only the first feature. In other words, I would like to transform something like this:
into this:
As I will be dealing with large datasets, the speed is important.
For general solution for select missing values if more columns add new DataFrame created by DataFrame.drop_duplicates, selecting features columns and rewritten data in feat2, so if use concat are all another columns replaced to missing values. Last for correct order add DataFrame.sort_values:
df1 = df.drop_duplicates('feat1')[['feat1','feat2']].assign(feat2='-')
df2 = (pd.concat([df1, df], sort=False, ignore_index=True)
.sort_values('feat1'))
print (df2)
feat1 feat2 var
0 A - NaN
3 A x 0.0
4 A y 1.0
5 A z 2.0
1 B - NaN
6 B x 3.0
7 B y 4.0
8 B z 5.0
2 C - NaN
9 C x 6.0
10 C y 7.0
11 C z 8.0
my code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]
adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)
I want the index of the rows which have the value 'NA' in more than one column. Is there an inbuilt method or I have to iterate row wise and check values at each column?
here is the snapshot of the data:
I want index of rows like 398,409(missing values at column B and G) and not of rows like 394(missing value only at column N)
Use isnull.any(1) or sum to get the boolean mask, then select the rows to get the index i.e
df = pd.DataFrame({'A':[1,2,3,4,5],
'B' :[np.nan,4,5,np.nan,8],
'C' :[2,4,np.nan,3,5],
'D' :[np.nan,np.nan,np.nan,np.nan,5]})
A B C D
0 1 NaN 2.0 NaN
1 2 4.0 4.0 NaN
2 3 5.0 NaN NaN
3 4 NaN 3.0 NaN
4 5 8.0 5.0 5.0
# If you want to select rows with nan value from Columns B and C
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')
# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')