Pandas excel: write rows with NaN values in a separate DataFrame - python

I have df with 2 columns: Name, Number.
I need to write a row if NaN in cell to a new DataFrame.
path = 'Files/Directory.xlsx'
df = pd.read_excel(path)
I've tried so many different things, spent 3 days and still can't get it.

df = pd.DataFrame(
{
"Name": ["Alex", "Bob", "Jim", np.nan, np.nan],
"Number": [1, 2, np.nan, 3, np.nan],
}
)
df
Name
Number
Alex
1.0
Bob
2.0
Jim
NaN
NaN
3.0
NaN
NaN
So it depends if you want to write rows with any NaN values to a new DataFrame or if you just want to write rows with all NaN values to the new DataFrame.
If any, the following should work:
df_nan = df.loc[df.isnull().any(axis=1)]
df_nan
Name
Number
Jim
NaN
NaN
3.0
NaN
NaN
If all, this should work:
df_nan = df.loc[df.isnull().all(axis=1)]
df_nan
Name
Number
NaN
NaN

Related

Pandas merging mixed length datasets without duplicate columns

I'm trying to merge several mixed dataframes, with some missing values which sometimes exist in other dataframes to one combined dataset, some dataframes also might contain extra columns, then those should be added and all other rows have NaN as values.
This based on one or several columns, the row index has no meaning, the true dataset has many columns so manually removing anything is very much less than desirable.
So essentially, merging several dataframes based on one or several columns, prioritizing any non NaN value, or if two conflicting non NaN values would exist then prioritize the existing value in the base dataframe and not the one being merged in.
df1 = pd.DataFrame({
'id': [1, 2, 4],
'data_one': [np.nan, 3, np.nan],
'data_two': [4, np.nan, np.nan],
})
id data_one data_two
0 1 NaN 4.0
1 2 3.0 NaN
2 4 NaN NaN
df2 = pd.DataFrame({
'id': [1, 3],
'data_one': [8, np.nan],
'data_two': [np.nan, 4],
'data_three': [np.nan, 100]
})
id data_one data_two data_three
0 1 8.0 NaN NaN
1 3 NaN 4.0 100.0
# Desired result
res = pd.DataFrame({
'id': [1, 2, 3, 4],
'data_one': [8, 3, np.nan, np.nan],
'data_two': [4, np.nan, 4, np.nan],
'data_three': [np.nan, np.nan, 100, np.nan],
})
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
2 3 NaN 4.0 100.0
3 4 NaN NaN NaN
The functions I have been experimenting with so far is pd.merge(), pd.join(), pd.combine_first() but haven't had any success, maybe missing something easy.
You can do a groupby() coupled with fillna():
pd.concat([df1,df2]).groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
results in:
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
1 3 NaN 4.0 100.0
2 4 NaN NaN NaN
Note that it is gonna return separate rows for the positions where df1 and df2 both have a non-null values. Which is intentional since I don't know what you want to do with in such cases.

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Pandas Fill Not Null data with First Row

let say a dataset will have value as per below :
import pandas as pd
df = pd.DataFrame({'DATA1': ['OK', np.nan,'1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
df
Data will show as per below:
My objective is to replace all of row that have value (not null) to the first row value as per sample below :
I know that I can change the data directly, but I want to find a better solution if I have thousands of columns and row.
Thank You
Best Regards
Railey Shahril
You can also use np.where():
final=pd.DataFrame(np.where(df.notnull(),df.iloc[0],df),df.index,df.columns)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
Use DataFrame.mask with DataFrame.iloc for select first row:
df = df.mask(df.notna(), df.iloc[0], axis=1)
print (df)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
For replace by first non missing value use add backfill:
df = pd.DataFrame({'DATA1': [ np.nan, 'OK','1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
print (df)
DATA1 DATA2
0 NaN KO
1 OK 2
2 1 NaN
3 NaN NaN
df = df.mask(df.notna(), df.bfill(axis=1).iloc[0], axis=1)
print (df)
DATA1 DATA2
0 NaN KO
1 KO KO
2 KO NaN
3 NaN NaN

Combining several columns with pandas

I have a pandas dataframe:
Name A1 A2 A3
Andy 1 NaN NaN
Brian Nan NaN NaN
Carlos NaN 2 NaN
David NaN Nan 3
Frank 2 Nan Nan
For each row, in 3 columns A1, A2 and A3 there is at most one non-NaN cell. So I want to merge them into one column and remove the rows that are all NaN. So the above dataframe will become:
Name A A-ID
Andy 1 1
Carlos 2 2
David 3 3
Frank 2 1
A-ID will store the original column (A1, A2 or A3). The row with Brian is removed because all 3 columns are NaN.
Naively I can write a for loop to do the task, but is there a more pythonic and faster way?
This method should achieve the desired result:
import pandas as pd
import numpy as np
d = {"Name": ["Andy", "Brian", "Carlos", "David", "Frank"],
"A1": [1,np.nan,np.nan,np.nan,2],
"A2": [np.nan,np.nan,2,np.nan,np.nan],
"A3": [np.nan,np.nan,np.nan,3,np.nan]}
df = pd.DataFrame(data=d)
#Drops rows where all A* values are NaN
df = df.dropna(subset = ['A1', 'A2', 'A3'], how="all")
#Sums values to produce result
df["A"] = df.sum(axis=1)
#Alternative method for getting 'A'
#df["A"] = df[["A1", "A2", "A3"]].bfill(axis=1).iloc[:, 0]
#Returns final char of column name of first non-NaN column
df["A-ID"] = df[["A1", "A2", "A3"]].apply(lambda row: row.first_valid_index()[-1], axis=1)
#Dropping old A* columns
df = df.drop(["A1", "A2", "A3"], axis=1)
print(df)
Name A A-ID
0 Andy 1.0 1
2 Carlos 2.0 2
3 David 3.0 3
4 Frank 2.0 1
there are several ways to do that. probably the simplest is defining a new column which is the sum or the concatenation of the other columns
df["B"] = df["A1"] + df["A2"] + df["A3"]
then, you keep only the rows with B not null
df = df[df.B.notnull()]
Regards

Pandas: How to drop multiple columns with nan as col name?

As per the title here's a reproducible example:
raw_data = {'x': ['this', 'that', 'this', 'that', 'this'],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan],
'y': [np.nan, np.nan, np.nan, np.nan, np.nan],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(raw_data, columns = ['x', np.nan, 'y', np.nan])
df
x NaN y NaN
0 this NaN NaN NaN
1 that NaN NaN NaN
2 this NaN NaN NaN
3 that NaN NaN NaN
4 this NaN NaN NaN
Aim is to drop only the columns with nan as the col name (so keep column y). dropna() doesn't work as it conditions on the nan values in the column, not nan as the col name.
df.drop(np.nan, axis=1, inplace=True) works if there's a single column in the data with nan as the col name, but not with multiple columns with nan as the col name, as in my data.
So how to drop multiple columns where the col name is nan?
In [218]: df = df.loc[:, df.columns.notna()]
In [219]: df
Out[219]:
x y
0 this NaN
1 that NaN
2 this NaN
3 that NaN
4 this NaN
You can try
df.columns = df.columns.fillna('to_drop')
df.drop('to_drop', axis = 1, inplace = True)
As of pandas 1.4.0
df.drop is the simplest solution, as it now handles multiple NaN headers properly:
df = df.drop(columns=np.nan)
# x y
# 0 this NaN
# 1 that NaN
# 2 this NaN
# 3 that NaN
# 4 this NaN
Or the equivalent axis syntax:
df = df.drop(np.nan, axis=1)
Note that it's possible to use inplace instead of assigning back to df, but inplace is not recommended and will eventually be deprecated.

Categories