Dropping columns with >N NaNs excluding specific columns

Dropping columns with >N NaNs excluding specific columns - python

I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?

You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]

df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]

Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5

Related

Groupby and agg produce NaNs when used with diff

I have an indexed dataset like this
np.random.seed(1)
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)},
index = [5,242,12,634])
Now I'm trying to get the difference of C by group like so
df.groupby('A').agg('diff')
which gives me the output
B C
5 NaN NaN
242 1.0 -2.492028
12 NaN NaN
634 1.0 -0.455332
I'm trying to get a resulting dataframe with only 2 rows, which contain the differences like so
B C
1.0 -2.492028
1.0 -0.455332
How can I achieve this?

First diff is not a agg function which will return the same length of out put same as original dataframe , if you would like the diff without NaN we should do dropna
out = df.groupby('A').diff().dropna()

Is there a pythonic way of shifting pandas dataframe cells to the left, while pushing out or overwriting any nan?

I have a pandas dataframe (starting_df) with nan values in the left-hand columns. I'd like to shift all values over to the left for a left-aligned dataframe. My Dataframe is 24x24, but for argument's sake, I'm just posting a 4x4 version.
After some cool initial answers here, I modified the dataframe to also include a non-leading nan, who's position I'd like to preserve.
I have a piece of code that accomplishes what I want, but it relies on nested for-loops and suppressing an IndexError, which does not feel very pythonic. I have no experience with error handling in general, but simply suppressing an error does not seem to be the right strategy.
Starting dataframe and desired final dataframe:
Here is the code that (poorly) accomplishes the goal.
import pandas as pd
import numpy as np
def get_left_aligned(starting_df):
"""take a starting df with right-aligned numbers and nan, and
turn it into a left aligned table."""
left_aligned_df = pd.DataFrame()
for temp_index_1 in range(0, starting_df.shape[0]):
temp_series = []
for temp_index_2 in range(0, starting_df.shape[0]):
try:
temp_series.append(starting_df.iloc[temp_index_2, temp_index_2 + temp_index_1])
temp_index_2 += 1
except IndexError:
pass
temp_series = pd.DataFrame(temp_series, columns=['col'+str(temp_index_1 + 1)])
left_aligned_df = pd.concat([left_aligned_df, temp_series], axis=1)
return left_aligned_df
df = pd.DataFrame(dict(col1=[1, np.nan, np.nan, np.nan],
col2=[5, 2, np.nan, np.nan],
col3=[7, np.nan, 3, np.nan],
col4=[9, 8, 6, 4]))
df_expected = pd.DataFrame(dict(col1=[1, 2, 3, 4],
col2=[5, np.nan, 6, np.nan],
col3=[7, 8, np.nan, np.nan],
col4=[9, np.nan, np.nan, np.nan]))
df_left = get_left_aligned(df)
I appreciate any help with this.
Thanks!

or transpose the df and use shift to shift by column, when the NA num is increasing 1 by 1.
dfn = df.T.copy()
for i, col in enumerate(dfn.columns):
dfn[col] = dfn[col].shift(-i)
dfn = dfn.T
print(dfn)
col1 col2 col3 col4
0 1.0 5.0 7.0 9.0
1 2.0 NaN 8.0 NaN
2 3.0 6.0 NaN NaN
3 4.0 NaN NaN NaN

One way to resolve your challenge is to move the data into numpy territory, sort the data, then return it as a pandas DataFrame:
Numpy converts pandas NA to object data type; pd.to_numeric resolves that to data types numpy can work with.
pd.DataFrame(
np.sort(df.transform(pd.to_numeric).to_numpy(), axis=1),
columns=df.columns,
dtype="Int64",
)
col1 col2 col3
0 1 4 6
1 2 5 <NA>
2 3 <NA> <NA>

You can sort values on the row based on their positions, keeping the nan values at the end, giving them a very high value (np.nan, for example), rather than their actual position.
df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
Here an example:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
],
columns=['A', 'B', 'C', 'D']
)
df2 = df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
And this id df2:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
EDIT
If you have rows with NaNs after the first not NaN value, you can use this approach based on first_valid_index:
df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
An example for this case:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 5, np.nan, 3],
],
columns=['A', 'B', 'C', 'D']
)
df3 = df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
And df3 is:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 5.0 NaN 3.0 NaN

Pandas merging mixed length datasets without duplicate columns

I'm trying to merge several mixed dataframes, with some missing values which sometimes exist in other dataframes to one combined dataset, some dataframes also might contain extra columns, then those should be added and all other rows have NaN as values.
This based on one or several columns, the row index has no meaning, the true dataset has many columns so manually removing anything is very much less than desirable.
So essentially, merging several dataframes based on one or several columns, prioritizing any non NaN value, or if two conflicting non NaN values would exist then prioritize the existing value in the base dataframe and not the one being merged in.
df1 = pd.DataFrame({
'id': [1, 2, 4],
'data_one': [np.nan, 3, np.nan],
'data_two': [4, np.nan, np.nan],
})
id data_one data_two
0 1 NaN 4.0
1 2 3.0 NaN
2 4 NaN NaN
df2 = pd.DataFrame({
'id': [1, 3],
'data_one': [8, np.nan],
'data_two': [np.nan, 4],
'data_three': [np.nan, 100]
})
id data_one data_two data_three
0 1 8.0 NaN NaN
1 3 NaN 4.0 100.0
# Desired result
res = pd.DataFrame({
'id': [1, 2, 3, 4],
'data_one': [8, 3, np.nan, np.nan],
'data_two': [4, np.nan, 4, np.nan],
'data_three': [np.nan, np.nan, 100, np.nan],
})
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
2 3 NaN 4.0 100.0
3 4 NaN NaN NaN
The functions I have been experimenting with so far is pd.merge(), pd.join(), pd.combine_first() but haven't had any success, maybe missing something easy.

You can do a groupby() coupled with fillna():
pd.concat([df1,df2]).groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
results in:
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
1 3 NaN 4.0 100.0
2 4 NaN NaN NaN
Note that it is gonna return separate rows for the positions where df1 and df2 both have a non-null values. Which is intentional since I don't know what you want to do with in such cases.

Replace missing data based on certain conditions

Let's say I have data:
a b
0 1.0 NaN
1 6.0 1
2 3.0 NaN
3 1.0 NaN
I would like to iterate over this data to see,
if Data[i] == NaN **and** column['a'] == 1.0 then replace NAN with 4 instead of replace by 4 in any NaN you see. How shall I go about it? I tried every for if function and it didn't work. I also did
for i in df.itertuples():
but the problem is df.itertuples() doesn't have a replace functionality and the other methods I've seen were to do it one by one.
End Result looking for:
a b
0 1.0 4
1 6.0 1
2 3.0 NaN
3 1.0 4

def func(x):
if x['a'] == 1 and pd.isna(x['b']):
x['b'] = 4
return x
df = pd.DataFrame.from_dict({'a': [1.0, 6.0, 3.0, 1.0], 'b': [np.nan, 1, np.nan, np.nan]})
df.apply(func, axis=1)
Instead of iterrows(), apply() may be a better option.

You can create a mask and then fill in the intended NaNs using that mask:
df = pd.DataFrame({'a': [1,6,3,1], 'b': [np.nan, 1, np.nan, np.nan]})
mask = df[['a', 'b']].apply(lambda x: (x[0] == 1) and (pd.isna(x[1])), axis=1)
df['b'] = df['b'].mask(mask, df['b'].fillna(4))
print(df)
a b
0 1 4.0
1 6 1.0
2 3 NaN
3 1 4.0

df2 = df[df['a']==1.0].fillna(4.0)
df2.combine_first(df)
Can this help you?

Like you said, you can achieve this by combining 2 conditions: a==1 and b==Nan.
To combine two conditions in python you can use &.
In your example:
import pandas as pd
import numpy as np
# Create sample data
d = {'a': [1, 6, 3, 1], 'b': [np.nan, 1, np.nan, np.nan]}
df = pd.DataFrame(data=d)
# Convert to numeric
df = df.apply(pd.to_numeric, errors='coerce')
print(df)
# Replace Nans
df[ (df['a'] == 1 ) & np.isnan(df['b']) ] = 4
print(df)
Should do the trick.

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.

Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dropping columns with >N NaNs excluding specific columns - python

df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))] Out[415]: B C D 0 2.0 NaN 0 1 4.0 NaN 1 2 NaN NaN 5 As per Zero's suggestion df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))] EDIT : df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]

Another take that blends some concepts from other answers. df.loc[:, df.isnull().assign(C=False).sum().lt(2)] B C D 0 2.0 NaN 0 1 4.0 NaN 1 2 NaN NaN 5

Related

Groupby and agg produce NaNs when used with diff

Is there a pythonic way of shifting pandas dataframe cells to the left, while pushing out or overwriting any nan?

Pandas merging mixed length datasets without duplicate columns

Replace missing data based on certain conditions

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

Categories

Resources