let say a dataset will have value as per below :
import pandas as pd
df = pd.DataFrame({'DATA1': ['OK', np.nan,'1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
df
Data will show as per below:
My objective is to replace all of row that have value (not null) to the first row value as per sample below :
I know that I can change the data directly, but I want to find a better solution if I have thousands of columns and row.
Thank You
Best Regards
Railey Shahril
You can also use np.where():
final=pd.DataFrame(np.where(df.notnull(),df.iloc[0],df),df.index,df.columns)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
Use DataFrame.mask with DataFrame.iloc for select first row:
df = df.mask(df.notna(), df.iloc[0], axis=1)
print (df)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
For replace by first non missing value use add backfill:
df = pd.DataFrame({'DATA1': [ np.nan, 'OK','1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
print (df)
DATA1 DATA2
0 NaN KO
1 OK 2
2 1 NaN
3 NaN NaN
df = df.mask(df.notna(), df.bfill(axis=1).iloc[0], axis=1)
print (df)
DATA1 DATA2
0 NaN KO
1 KO KO
2 KO NaN
3 NaN NaN
Related
I have df with 2 columns: Name, Number.
I need to write a row if NaN in cell to a new DataFrame.
path = 'Files/Directory.xlsx'
df = pd.read_excel(path)
I've tried so many different things, spent 3 days and still can't get it.
df = pd.DataFrame(
{
"Name": ["Alex", "Bob", "Jim", np.nan, np.nan],
"Number": [1, 2, np.nan, 3, np.nan],
}
)
df
Name
Number
Alex
1.0
Bob
2.0
Jim
NaN
NaN
3.0
NaN
NaN
So it depends if you want to write rows with any NaN values to a new DataFrame or if you just want to write rows with all NaN values to the new DataFrame.
If any, the following should work:
df_nan = df.loc[df.isnull().any(axis=1)]
df_nan
Name
Number
Jim
NaN
NaN
3.0
NaN
NaN
If all, this should work:
df_nan = df.loc[df.isnull().all(axis=1)]
df_nan
Name
Number
NaN
NaN
imagine you have the following df:
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, np.nan, np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection= pd.DataFrame(data=d)
dffinalselection
description#1 description#2 description#3
0 happy NaN NaN
1 coding NaN NaN
2 NaN NaN NaN
I want to fill the df with the first description#1 column value if NaN:
filldesc = dffinalselection.filter(like='description')
filldesc = filldesc.fillna(dffinalselection['description#1'], axis=1)
filldesc
However, getting the following error:
NotImplementedError: Currently only can fill with dict/Series column by column
How to workaround?
desired output:
description#1 description#2 description#3
0 happy happy happy
1 coding coding coding
2 NaN NaN NaN
Please help!
You can use apply() on rows with axis=1 then use Series.fillna() to fill nan values.
import pandas as pd
import numpy as np
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, 'tokeep', np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection = pd.DataFrame(data=d)
df_ = dffinalselection.apply(lambda row: row.fillna(row[0]), axis=1)
print(df_)
description#1 description#2 description#3
0 happy happy happy
1 coding tokeep coding
2 NaN NaN NaN
Use ffill method with axis=1:
dffinalselection.ffill(axis=1)
Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0
If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0
df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')
As per the title here's a reproducible example:
raw_data = {'x': ['this', 'that', 'this', 'that', 'this'],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan],
'y': [np.nan, np.nan, np.nan, np.nan, np.nan],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(raw_data, columns = ['x', np.nan, 'y', np.nan])
df
x NaN y NaN
0 this NaN NaN NaN
1 that NaN NaN NaN
2 this NaN NaN NaN
3 that NaN NaN NaN
4 this NaN NaN NaN
Aim is to drop only the columns with nan as the col name (so keep column y). dropna() doesn't work as it conditions on the nan values in the column, not nan as the col name.
df.drop(np.nan, axis=1, inplace=True) works if there's a single column in the data with nan as the col name, but not with multiple columns with nan as the col name, as in my data.
So how to drop multiple columns where the col name is nan?
In [218]: df = df.loc[:, df.columns.notna()]
In [219]: df
Out[219]:
x y
0 this NaN
1 that NaN
2 this NaN
3 that NaN
4 this NaN
You can try
df.columns = df.columns.fillna('to_drop')
df.drop('to_drop', axis = 1, inplace = True)
As of pandas 1.4.0
df.drop is the simplest solution, as it now handles multiple NaN headers properly:
df = df.drop(columns=np.nan)
# x y
# 0 this NaN
# 1 that NaN
# 2 this NaN
# 3 that NaN
# 4 this NaN
Or the equivalent axis syntax:
df = df.drop(np.nan, axis=1)
Note that it's possible to use inplace instead of assigning back to df, but inplace is not recommended and will eventually be deprecated.
I want know the first year with incoming revenue for various projects.
Given the following, dataframe:
ID Y1 Y2 Y3
0 NaN 8 4
1 NaN NaN 1
2 NaN NaN NaN
3 5 3 NaN
I would like to return the name of the first column with a non-null value by row.
In this case, I would want to return:
['Y2','Y3',NaN,'Y1']
My goal is to add this as a column to the original dataframe.
The following code mostly works, but is really clunky.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
df['first'] = np.nan
for ID in df.index:
row = df.loc[ID,]
for i in range(0,len(row)):
if (~pd.isnull(row[i])):
df.loc[ID,'first'] = row.index[i]
break
returns:
Y1 Y2 Y3 first
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN first
3 5 3 NaN Y1
Does anyone know a more elegant solution?
You can apply first_valid_index to each row in the dataframe using a lambda expression with axis=1 to specify rows.
>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object
To apply it to your dataframe:
df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))
>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1
Avoiding apply is preferable as its not vectorized. The following is vectorized. It was tested with Pandas 1.1.
Setup
import numpy as np
import pandas as pd
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
# df.dropna(how='all', inplace=True) # Optional but cleaner
# For ranking only:
col_ranks = pd.DataFrame(index=df.columns, data=np.arange(1, 1 + len(df.columns)), columns=['first_notna_rank'], dtype='UInt8') # UInt8 supports max value of 255.
To find the name of the first non-null column
df['first_notna_name'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')
If df is guaranteed to have no rows with all nulls, the .dropna operation above can optionally be removed.
To then find the first non-null value
Using bfill:
df['first_notna_value'] = df[df.columns.difference(['first_notna_name'])].bfill(axis=1).iloc[:, 0]
Using melt:
df['first_notna_value'] = df.melt(id_vars='first_notna_name', value_vars=df.columns.difference(['first_notna_name']), ignore_index=False).query('first_notna_name == variable').merge(df[[]], how='right', left_index=True, right_index=True).loc[df.index, 'value']
If df is guaranteed to have no rows with all nulls, the .merge operation above can optionally be removed.
To rank the name
df = df.merge(col_ranks, how='left', left_on='first_notna_name', right_index=True)
Is there a better way?
Output
Y1 Y2 Y3 first_notna_name first_notna_value first_notna_rank
0 NaN 8.0 4.0 Y2 8.0 2
1 NaN NaN 1.0 Y3 1.0 3
2 NaN NaN NaN <NA> NaN <NA>
3 5.0 3.0 NaN Y1 5.0 1
Partial credit: answers by me, piRSquared, and Andy
Apply this code to a dataframe with only one row to return the first column in the row that contains a null value.
row.columns[~(row.loc[:].isna()).all()][-1]