As per the title here's a reproducible example:
raw_data = {'x': ['this', 'that', 'this', 'that', 'this'],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan],
'y': [np.nan, np.nan, np.nan, np.nan, np.nan],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(raw_data, columns = ['x', np.nan, 'y', np.nan])
df
x NaN y NaN
0 this NaN NaN NaN
1 that NaN NaN NaN
2 this NaN NaN NaN
3 that NaN NaN NaN
4 this NaN NaN NaN
Aim is to drop only the columns with nan as the col name (so keep column y). dropna() doesn't work as it conditions on the nan values in the column, not nan as the col name.
df.drop(np.nan, axis=1, inplace=True) works if there's a single column in the data with nan as the col name, but not with multiple columns with nan as the col name, as in my data.
So how to drop multiple columns where the col name is nan?
In [218]: df = df.loc[:, df.columns.notna()]
In [219]: df
Out[219]:
x y
0 this NaN
1 that NaN
2 this NaN
3 that NaN
4 this NaN
You can try
df.columns = df.columns.fillna('to_drop')
df.drop('to_drop', axis = 1, inplace = True)
As of pandas 1.4.0
df.drop is the simplest solution, as it now handles multiple NaN headers properly:
df = df.drop(columns=np.nan)
# x y
# 0 this NaN
# 1 that NaN
# 2 this NaN
# 3 that NaN
# 4 this NaN
Or the equivalent axis syntax:
df = df.drop(np.nan, axis=1)
Note that it's possible to use inplace instead of assigning back to df, but inplace is not recommended and will eventually be deprecated.
Related
imagine you have the following df:
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, np.nan, np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection= pd.DataFrame(data=d)
dffinalselection
description#1 description#2 description#3
0 happy NaN NaN
1 coding NaN NaN
2 NaN NaN NaN
I want to fill the df with the first description#1 column value if NaN:
filldesc = dffinalselection.filter(like='description')
filldesc = filldesc.fillna(dffinalselection['description#1'], axis=1)
filldesc
However, getting the following error:
NotImplementedError: Currently only can fill with dict/Series column by column
How to workaround?
desired output:
description#1 description#2 description#3
0 happy happy happy
1 coding coding coding
2 NaN NaN NaN
Please help!
You can use apply() on rows with axis=1 then use Series.fillna() to fill nan values.
import pandas as pd
import numpy as np
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, 'tokeep', np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection = pd.DataFrame(data=d)
df_ = dffinalselection.apply(lambda row: row.fillna(row[0]), axis=1)
print(df_)
description#1 description#2 description#3
0 happy happy happy
1 coding tokeep coding
2 NaN NaN NaN
Use ffill method with axis=1:
dffinalselection.ffill(axis=1)
I have a pandas dataframe (starting_df) with nan values in the left-hand columns. I'd like to shift all values over to the left for a left-aligned dataframe. My Dataframe is 24x24, but for argument's sake, I'm just posting a 4x4 version.
After some cool initial answers here, I modified the dataframe to also include a non-leading nan, who's position I'd like to preserve.
I have a piece of code that accomplishes what I want, but it relies on nested for-loops and suppressing an IndexError, which does not feel very pythonic. I have no experience with error handling in general, but simply suppressing an error does not seem to be the right strategy.
Starting dataframe and desired final dataframe:
Here is the code that (poorly) accomplishes the goal.
import pandas as pd
import numpy as np
def get_left_aligned(starting_df):
"""take a starting df with right-aligned numbers and nan, and
turn it into a left aligned table."""
left_aligned_df = pd.DataFrame()
for temp_index_1 in range(0, starting_df.shape[0]):
temp_series = []
for temp_index_2 in range(0, starting_df.shape[0]):
try:
temp_series.append(starting_df.iloc[temp_index_2, temp_index_2 + temp_index_1])
temp_index_2 += 1
except IndexError:
pass
temp_series = pd.DataFrame(temp_series, columns=['col'+str(temp_index_1 + 1)])
left_aligned_df = pd.concat([left_aligned_df, temp_series], axis=1)
return left_aligned_df
df = pd.DataFrame(dict(col1=[1, np.nan, np.nan, np.nan],
col2=[5, 2, np.nan, np.nan],
col3=[7, np.nan, 3, np.nan],
col4=[9, 8, 6, 4]))
df_expected = pd.DataFrame(dict(col1=[1, 2, 3, 4],
col2=[5, np.nan, 6, np.nan],
col3=[7, 8, np.nan, np.nan],
col4=[9, np.nan, np.nan, np.nan]))
df_left = get_left_aligned(df)
I appreciate any help with this.
Thanks!
or transpose the df and use shift to shift by column, when the NA num is increasing 1 by 1.
dfn = df.T.copy()
for i, col in enumerate(dfn.columns):
dfn[col] = dfn[col].shift(-i)
dfn = dfn.T
print(dfn)
col1 col2 col3 col4
0 1.0 5.0 7.0 9.0
1 2.0 NaN 8.0 NaN
2 3.0 6.0 NaN NaN
3 4.0 NaN NaN NaN
One way to resolve your challenge is to move the data into numpy territory, sort the data, then return it as a pandas DataFrame:
Numpy converts pandas NA to object data type; pd.to_numeric resolves that to data types numpy can work with.
pd.DataFrame(
np.sort(df.transform(pd.to_numeric).to_numpy(), axis=1),
columns=df.columns,
dtype="Int64",
)
col1 col2 col3
0 1 4 6
1 2 5 <NA>
2 3 <NA> <NA>
You can sort values on the row based on their positions, keeping the nan values at the end, giving them a very high value (np.nan, for example), rather than their actual position.
df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
Here an example:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
],
columns=['A', 'B', 'C', 'D']
)
df2 = df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
And this id df2:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
EDIT
If you have rows with NaNs after the first not NaN value, you can use this approach based on first_valid_index:
df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
An example for this case:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 5, np.nan, 3],
],
columns=['A', 'B', 'C', 'D']
)
df3 = df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
And df3 is:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 5.0 NaN 3.0 NaN
I'm trying to merge several mixed dataframes, with some missing values which sometimes exist in other dataframes to one combined dataset, some dataframes also might contain extra columns, then those should be added and all other rows have NaN as values.
This based on one or several columns, the row index has no meaning, the true dataset has many columns so manually removing anything is very much less than desirable.
So essentially, merging several dataframes based on one or several columns, prioritizing any non NaN value, or if two conflicting non NaN values would exist then prioritize the existing value in the base dataframe and not the one being merged in.
df1 = pd.DataFrame({
'id': [1, 2, 4],
'data_one': [np.nan, 3, np.nan],
'data_two': [4, np.nan, np.nan],
})
id data_one data_two
0 1 NaN 4.0
1 2 3.0 NaN
2 4 NaN NaN
df2 = pd.DataFrame({
'id': [1, 3],
'data_one': [8, np.nan],
'data_two': [np.nan, 4],
'data_three': [np.nan, 100]
})
id data_one data_two data_three
0 1 8.0 NaN NaN
1 3 NaN 4.0 100.0
# Desired result
res = pd.DataFrame({
'id': [1, 2, 3, 4],
'data_one': [8, 3, np.nan, np.nan],
'data_two': [4, np.nan, 4, np.nan],
'data_three': [np.nan, np.nan, 100, np.nan],
})
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
2 3 NaN 4.0 100.0
3 4 NaN NaN NaN
The functions I have been experimenting with so far is pd.merge(), pd.join(), pd.combine_first() but haven't had any success, maybe missing something easy.
You can do a groupby() coupled with fillna():
pd.concat([df1,df2]).groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
results in:
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
1 3 NaN 4.0 100.0
2 4 NaN NaN NaN
Note that it is gonna return separate rows for the positions where df1 and df2 both have a non-null values. Which is intentional since I don't know what you want to do with in such cases.
let say a dataset will have value as per below :
import pandas as pd
df = pd.DataFrame({'DATA1': ['OK', np.nan,'1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
df
Data will show as per below:
My objective is to replace all of row that have value (not null) to the first row value as per sample below :
I know that I can change the data directly, but I want to find a better solution if I have thousands of columns and row.
Thank You
Best Regards
Railey Shahril
You can also use np.where():
final=pd.DataFrame(np.where(df.notnull(),df.iloc[0],df),df.index,df.columns)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
Use DataFrame.mask with DataFrame.iloc for select first row:
df = df.mask(df.notna(), df.iloc[0], axis=1)
print (df)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
For replace by first non missing value use add backfill:
df = pd.DataFrame({'DATA1': [ np.nan, 'OK','1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
print (df)
DATA1 DATA2
0 NaN KO
1 OK 2
2 1 NaN
3 NaN NaN
df = df.mask(df.notna(), df.bfill(axis=1).iloc[0], axis=1)
print (df)
DATA1 DATA2
0 NaN KO
1 KO KO
2 KO NaN
3 NaN NaN
As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074