I want know the first year with incoming revenue for various projects.
Given the following, dataframe:
ID Y1 Y2 Y3
0 NaN 8 4
1 NaN NaN 1
2 NaN NaN NaN
3 5 3 NaN
I would like to return the name of the first column with a non-null value by row.
In this case, I would want to return:
['Y2','Y3',NaN,'Y1']
My goal is to add this as a column to the original dataframe.
The following code mostly works, but is really clunky.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
df['first'] = np.nan
for ID in df.index:
row = df.loc[ID,]
for i in range(0,len(row)):
if (~pd.isnull(row[i])):
df.loc[ID,'first'] = row.index[i]
break
returns:
Y1 Y2 Y3 first
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN first
3 5 3 NaN Y1
Does anyone know a more elegant solution?
You can apply first_valid_index to each row in the dataframe using a lambda expression with axis=1 to specify rows.
>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object
To apply it to your dataframe:
df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))
>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1
Avoiding apply is preferable as its not vectorized. The following is vectorized. It was tested with Pandas 1.1.
Setup
import numpy as np
import pandas as pd
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
# df.dropna(how='all', inplace=True) # Optional but cleaner
# For ranking only:
col_ranks = pd.DataFrame(index=df.columns, data=np.arange(1, 1 + len(df.columns)), columns=['first_notna_rank'], dtype='UInt8') # UInt8 supports max value of 255.
To find the name of the first non-null column
df['first_notna_name'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')
If df is guaranteed to have no rows with all nulls, the .dropna operation above can optionally be removed.
To then find the first non-null value
Using bfill:
df['first_notna_value'] = df[df.columns.difference(['first_notna_name'])].bfill(axis=1).iloc[:, 0]
Using melt:
df['first_notna_value'] = df.melt(id_vars='first_notna_name', value_vars=df.columns.difference(['first_notna_name']), ignore_index=False).query('first_notna_name == variable').merge(df[[]], how='right', left_index=True, right_index=True).loc[df.index, 'value']
If df is guaranteed to have no rows with all nulls, the .merge operation above can optionally be removed.
To rank the name
df = df.merge(col_ranks, how='left', left_on='first_notna_name', right_index=True)
Is there a better way?
Output
Y1 Y2 Y3 first_notna_name first_notna_value first_notna_rank
0 NaN 8.0 4.0 Y2 8.0 2
1 NaN NaN 1.0 Y3 1.0 3
2 NaN NaN NaN <NA> NaN <NA>
3 5.0 3.0 NaN Y1 5.0 1
Partial credit: answers by me, piRSquared, and Andy
Apply this code to a dataframe with only one row to return the first column in the row that contains a null value.
row.columns[~(row.loc[:].isna()).all()][-1]
Related
I have 2 dataframes:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['X', 'Y', 'Z'])
and
df1 = pd.DataFrame({'M': [10, 20, 30],
'N': [40, 50, 60]},
index=['S', 'T', 'U'])
i want to append the df1 with a row of from the df dataframe.
i use the following code to extract the row:
row = df.loc['Y']
when i print this i get:
A 2
B 5
Name: Y, dtype: int64
A and B are key values or column heading names. so i transpose this with
row_25 = row.transpose()
i print row_25 and get:
A 2
B 5
Name: Y, dtype: int64
this is the same as row, so it seems the transpose didn't happen
i then add this code to add the row to df1:
result = pd.concat([df1, row_25], axis=0, ignore_index=False)
print(result)
when i print df1 i get:
M N 0
S 10.0 40.0 NaN
T 20.0 50.0 NaN
U 30.0 60.0 NaN
A NaN NaN 2.0
B NaN NaN 5.0
i want A and B to be column headings (key values) and the name of row (Y) to be the row index.
what am i doing wrong?
Try
pd.concat([df1, df.loc[['Y']])
It generates:
M N A B
S 10.0 40.0 NaN NaN
T 20.0 50.0 NaN NaN
U 30.0 60.0 NaN NaN
Y NaN NaN 2.0 5.0
Not sure if this is what you want.
To exclude column names 'M' and 'N' from the result you can rename the columns beforehand:
>>> df1.columns = ['A', 'B']
>>> pd.concat([df1, df.loc[['Y']])
A B
S 10 40
T 20 50
U 30 60
Y 2 5
The reason why you need double square brackets is that single square brackets return a 1D Series, that cannot be transposed. And double brackets return a 2D DataFrame (in general double brackets are used to reference several columns, like df1.loc[['X', 'Y']]; it is called 'fancy indexing' in NumPy).
If you are allergic to double brackets, use
pd.concat([df1.rename(columns={'M': 'A', 'N': 'B'}),
df.filter('Y', axis=0)])
Finally, if you really want to transpose something, you can convert the series to a frame and transpose it:
>>> df.loc['Y'].to_frame().T
A B
Y 2 5
let say a dataset will have value as per below :
import pandas as pd
df = pd.DataFrame({'DATA1': ['OK', np.nan,'1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
df
Data will show as per below:
My objective is to replace all of row that have value (not null) to the first row value as per sample below :
I know that I can change the data directly, but I want to find a better solution if I have thousands of columns and row.
Thank You
Best Regards
Railey Shahril
You can also use np.where():
final=pd.DataFrame(np.where(df.notnull(),df.iloc[0],df),df.index,df.columns)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
Use DataFrame.mask with DataFrame.iloc for select first row:
df = df.mask(df.notna(), df.iloc[0], axis=1)
print (df)
DATA1 DATA2
0 OK KO
1 NaN KO
2 OK NaN
3 NaN NaN
For replace by first non missing value use add backfill:
df = pd.DataFrame({'DATA1': [ np.nan, 'OK','1', np.nan],
'DATA2' : ['KO','2', np.nan, np.nan]})
print (df)
DATA1 DATA2
0 NaN KO
1 OK 2
2 1 NaN
3 NaN NaN
df = df.mask(df.notna(), df.bfill(axis=1).iloc[0], axis=1)
print (df)
DATA1 DATA2
0 NaN KO
1 KO KO
2 KO NaN
3 NaN NaN
Looking to preserve NaN values when changing the shape of the dataframe.
These two questions may be related:
How to preserve NaN instead of filling with zeros in pivot table?
How to make two NaN as NaN after the operation instead of making it zero?
but not been able to use the answers provided - can I set a min count for np.sum somehow?
import pandas as pd
import numpy as np
df = pd.DataFrame([['Y1', np.nan], ['Y2', np.nan], ['Y1', 6], ['Y2',8]], columns=['A', 'B'], index=['1988-01-01','1988-01-01', '1988-01-04', '1988-01-04'])
df.index.name = 'Date'
df
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=np.sum)
pivot_df
The output is:
A Y1 Y2
Date
1988-01-01 0.0 0.0
1988-01-04 6.0 8.0
and the desired output is:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
From the helpful comments the following solution meets my requirements:
pivot_df_2 = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=min, dropna=False)
pivot_df_2
Values are supposed to be unique per slot so replacing the sum function with a min function shouldn't make a difference (in my case)
If you have no duplicate entries, use set_index + unstack
df.set_index('A', append=True)['B'].unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
If you have duplicates, use a groupby with min_count
>> df
A B
Date
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
df.set_index('A', append=True).groupby(level=[0, 1])['B'].sum(min_count=1).unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 12.0 16.0
In this case, I would resolve by groupby:
(df.groupby(['Date', 'A']).B
.apply(lambda x: np.nan if x.isna().all() else x.sum())
.unstack('A')
)
output:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Change isna().all() to isna().any() if needed.
It is possible to count the values, and drop when 0 (or less than the expected count):
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],
aggfunc=['sum','count'])
# build the mask from count
mask = (pivot_df.xs('count', axis=1) == 0) # or ...<min_limit
#build the actual pivot_df from sum
pivot_df = pivot_df.xs('sum', axis=1)
# and reset to NaN when not enough values
pivot_df[mask] = np.nan
It gives as expected:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
This one will give sensible result when you sum more than one value.
Try to add 'dropna= False' to your pivot_table function?
I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?
You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]
Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
I'm trying to find the difference between the first valid value and the last valid value in a DataFrame per row.
I have a working code with a for loop and looking for something faster.
Here's an example of what I'm doing currently:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
One way would be to back and forward fill the missing values, and then just compare the first and last rows.
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
If you want to do it in one line, without creating a new dataframe:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
Pandas making your life easy, one method (first_valid_values()) at a time. Note that you'll have to delete any rows that have all NaN values (no point in having these anyways):
For first valid values:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
For last valid values:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
Subtract to get final result:
a-b