Remove columns that have NA values for rows - Python - python

Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks

Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]

Related

Pandas Dataframe - (Column re structure)

I have a dataframe that has n number of columns. These contain letters, the amount of letters a column contains varies and a letter can appear in various amounts of columns. I need the code for a pandas dataframe to convert the sheet to columns starting with the letters, the rows should contain the numbers of the columns that that letter was in.
Link to example problem
ABCDEF
ABDE. 11 1
BBCC -> 2 2
EFB. 3 3
4 4
The image describes my problem better. Thank you in advance for any help.
Use DataFrame.stack with DataFrame.reset_index for reshape, then DataFrame.sort_values and aggregate lists, last create DataFrame by constructor with transpose:
s=df.stack().reset_index(name='a').sort_values('level_1').groupby('a')['level_1'].agg(list)
df1 = pd.DataFrame(s.tolist(), index=s.index).T
print (df1)
a a b c d e f
0 1 1 1 1 3 2
1 3 3 2 4 4 None
2 None 4 None None None None
Or use GroupBy.cumcount for counter and reshape by DataFrame.pivot:
df2 = df.stack().reset_index(name='a').sort_values('level_1')
df2['g'] = df2.groupby('a').cumcount()
df2 = df2.pivot('g','a','level_1')
print (df2)
a a b c d e f
g
0 1 1 1 1 3 2
1 3 3 2 4 4 NaN
2 NaN 4 NaN NaN NaN NaN
Last if necessary remove index and columns names:
df1 = df1.rename_axis(index=None)
df2 = df2.rename_axis(index=None, columns=None)

Divide several columns with the same column name ending by one other column in python

I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)

Unable to update Pandas row in For loop

I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()

Merge unaligned DataFrames while filling with empty string

I have multiple DataFrames that I want to merge where I would like the fill value an empty string rather than nan. Some of the DataFrames have already nan values in them. concat sort of does what I want but fill empty values with nan. How does one not fill them with nan, or specify the fill_value to achieve something like this:
>>> df1
Value1
0 1
1 NaN
2 3
>>> df2
Value2
1 5
2 Nan
3 7
>>> merge_multiple_without_nan([df1,df2])
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
This is what concat does:
>>> concat([df1,df2], axis=1)
Value1 Value2
0 1 NaN
1 NaN 5
2 3 NaN
3 NaN 7
Well, I couldn't find any function in concat or merge that would handle this by itself, but the code below works without much hassel:
df1 = pd.DataFrame({'Value2': [1,np.nan,3]}, index = [0,1, 2])
df2 = pd.DataFrame({'Value2': [5,np.nan,7]}, index = [1, 2, 3])
# Add temporary Nan values for the data frames.
df = pd.concat([df1.fillna('X'), df2.fillna('Y')], axis=1)
df=
Value2 Value2
0 1 NaN
1 X 5
2 3 Y
3 NaN 7
Step 2:
df.fillna('', inplace=True)
df=
Value2 Value2
0 1
1 X 5
2 3 Y
3 7
Step 3:
df.replace(to_replace=['X','Y'], value=np.nan, inplace=True)
df=
Value2 Value2
0 1
1 NaN 5
2 3 NaN
3 7
After using concat, you can iterate over the DataFrames you merged, find the indices that are missing, and fill them in with an empty string. This should work for concatenating an arbitrary number of DataFrames, as long as your column names are unique.
# Concatenate all of the DataFrames.
merge_dfs = [df1, df2]
full_df = pd.concat(merge_dfs, axis=1)
# Find missing indices for each merged frame, fill with an empty string.
for partial_df in merge_dfs:
missing_idx = full_df.index.difference(partial_df.index)
full_df.loc[missing_idx, partial_df.columns] = ''
The resulting output using your sample data:
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7

pandas: create column on a subset of a dataframe, set null on other rows?

I've got a pandas dataframe and I want to calculate percentiles based on the value of the calc_value column, unless calc_value is null, in which case percentile should also be null.
I'm using scipy's rankdata to calculate the percentiles, because it handles repeated values better than pandas's qcut.
However, rankdata has one flaw, which is that it will happily include null values, and there doesn't seem to be an option to exclude them.
df = pd.DataFrame({'calc_value': [0, 0.081928, 0.94444, None, None]})
df['rank_val'] = rankdata(df.calc_value.values, method='min')
df.rank_val = df.rank_val - 1
df['percentile'] = (df.rank_val / float(len(df)-1)) * 100
This produces obviously wrong results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN 3 75
4 NaN 4 100
I can calculate the percentiles for all non-null values by slicing the dataframe, and doing the same calculations on the slice:
df_without_nan = df[df.calc_value.notnull()]
But what I don't know is how to push these values back into the main dataframe as df['percentile'], setting percentile and rank_val to be null on any rows where calc_value is also null.
Can anyone advise? I'm looking for the following results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN
Use pd.merge:
df_nonan = df[df['calc_value'].notnull()]
df_nonan['rank_val'] = stats.rankdata(df_nonan.calc_value.values, method='min')
df_nonan['rank_val'] = df_nonan['rank_val'] - 1
df_nonan['percentile'] = (df_nonan.rank_val / float(len(df)-1)) * 100
df_merge = pd.merge(df, df_nonan, left_index=True, right_index=True, how='left')
(This will give a SettingWithCopyWarning; if that's a problem you can do reset_index on both dataframes and use the column that generates named index instead: pd.merge(df, df_nonan, on='index', how='left'), and drop the index column after the merge.) The merged dataframe at this point is
calc_value_x calc_value_y rank_val percentile
0 0.000000 0.000000 0 0
1 0.081928 0.081928 1 25
2 0.944440 0.944440 2 50
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Then do a bit of cleanup on the redundant columns:
del df_merge['calc_value_x']
df_merge = df_merge.rename(columns = {'calc_value_y' : 'calc_value'})
to wind up with
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN

Categories