Pandas select the second to last column which is also not nan - python

I've cleaned my data as much as I can and read them in Pandas dataframe. So the problem is that different files have different number of columns, but it always the second to the last non-nan column is what I want. So is there anyway to pick them out? Here is an example of the data.
... f g h l
0 ... 39994 29.568 29.569 NaN
1 ... 39994 29.568 29.569 NaN
2 ... 39994 29.568 29.569 NaN
so I want the column g in this case. So in other files, it could be f or anything depends on the number of nan columns in the end. But it's always the second to the last non-nan column is what I need. Thanks for the help ahead.

Similar idea to #piRSquared. Essentially, use loc to keep the non-null columns, then use iloc to select the second to last.
df.loc[:, ~df.isnull().all()].iloc[:, -2]
Sample input:
a b c d e f g h i j
0 0 3 6 9 12 15 18 21 NaN NaN
1 1 4 7 10 13 16 19 22 NaN NaN
2 2 5 8 11 14 17 20 23 NaN NaN
Sample output:
0 18
1 19
2 20
Name: g, dtype: int32

One liner
df.loc[:, :df.columns[(df.columns == df.isnull().all().idxmax()).argmax() - 2]]
... f g
0 ... 39994 29.568
1 ... 39994 29.568
2 ... 39994 29.568
Readable
# identify null columns
nullcols = df.isnull().all()
# find the column heading for the first null column
nullcol = nullcols.idxmax()
# where is null column at
nullcol_position = (df.columns == nullcol).argmax()
# get column 2 positions prior
col_2_prior_to_null_col = df.columns[nullcol_position - 2]
# get dataframe
print df.loc[:, :col_2_prior_to_null_col]

Related

How to fill missing value in a few columns at the same time

I need to drop missing values in a few columns. I wrote this to do it one by one:
df2['A'].fillna(df1['A'].mean(), inplace=True)
df2['B'].fillna(df1['B'].mean(), inplace=True)
df2['C'].fillna(df1['C'].mean(), inplace=True)
Any other ways I can fill them all in one line of code?
You can use a single instructions:
cols = ['A', 'B', 'C']
df[cols] = df[cols].fillna(df[cols].mean())
Or for apply on all numeric columns, use select_dtypes:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].fillna(df[cols].mean())
Note: I strongly discourage you to use inplace parameter. It will probably disappear in Pandas 2
[lambda c: df2[c].fillna(df1[c].mean(), inplace=True) for c in df2.columns]
There are few options to work with nans in a df. I'll explain some of them...
Given this example df:
A
B
C
0
1
5
10
1
2
nan
11
2
nan
nan
12
3
4
8
nan
4
nan
9
14
Example 1: fill all columns with mean
df = df.fillna(df.mean())
Result:
A
B
C
0
1
5
10
1
2
7.33333
11
2
2.33333
7.33333
12
3
4
8
11.75
4
2.33333
9
14
Example 2: fill some columns with median
df[["A","B"]] = df[["A","B"]].fillna(df.median())
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
nan
4
2
9
14
Example 3: fill all columns using ffill()
Explanation: Missing values are replaced with the most recent available value in the same column. So, the value of the preceding row in the same column is used to fill in the blanks.
df = df.fillna(method='ffill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
12
4
2
9
14
Example 4: fill all columns using bfill()
Explanation: Missing values in a column are filled using the value of the next row going up, meaning the values are filled from the bottom to the top. Basically, you're replacing the missing values with the next known non-missing value.
df = df.fillna(method='bfill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
4
8
12
3
4
8
14
4
nan
9
14
If you wanted to DROP (no fill) the missing values. You can do this:
Option 1: remove rows with one or more missing values
df = df.dropna(how="any")
Result:
A
B
C
0
1
5
10
Option 2: remove rows with all missing values
df = df.dropna(how="all")

create new columns where values of row is NaN

I have a column of data with rows where NaN exists (see image). I intend splitting it where values are NaN and create new columns where a value emerges after NaN. For instance, I intend to create a new column at row 7 and subsequent rows where succeeding NaN values in the column. I have tried this but it congests the data together.
Col1
0 Start
1 65
2 oft
3 23:59:02
4 12-Feb-99
5 NaN
6 NaN
7 17
8 Sparkle
9 10
I have used the code below to break them into groups.
df['group_no'] = (df.Column1.isnull()).cumsum()
Col1 groups
0 Start 0
1 65 0
2 oft 0
3 23:59:02 0
4 12-Feb-99. 0
5 NaN 1
6 NaN 2
7 17 2
8 Sparkle 2
9 10 2
I now intend to stack the the data into different columns based on the groups numbers
Col1 Col2 Col3 ... ColN
0 Start NaN Nan ...
1 65 17 ....
2 oft Sparkle ....
3 23:59:02 10 ...
4 12-Feb-99
I suggest slicing pandas dataframe manually instead of using numpy to slice.
# Get index of Null values
index = df.index[df.col.isna()].to_list()
starting_index = [0] + [i + 1 for i in index]
ending_index = [i - 1 for i in index] + [len(df) - 1]
n = 0
for i, j in zip(starting_index, ending_index):
if i <= j:
n += 1
df[f"col{n}"] = np.nan
df.loc[: j - i, f"col{n}"] = df.loc[i:j, "col"].values

How to remove columns after any row has a NaN value in Python pandas dataframe

Toy example code
Let's say I have following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[11,21,31], "B":[12,22,32], "C":[np.nan,23,33], "D":[np.nan,24,34], "E":[15,25,35]})
Which would return:
>>> df
A B C D E
0 11 12 NaN NaN 15
1 21 22 23.0 24.0 25
2 31 32 33.0 34.0 35
Remove all columns with nan values
I know how to remove all the columns which have any row with a nan value like this:
out1 = df.dropna(axis=1, how="any")
Which returns:
>>> out1
A B E
0 11 12 15
1 21 22 25
2 31 32 35
Expected output
However what I expect is to remove all columns after a nan value is found. In the toy example code the expected output would be:
A B
0 11 12
1 21 22
2 31 32
Question
How can I remove all columns after a nan is found within any row in a pandas DataFrame ?
What I would do:
check every element for being null/not null
cumulative sum every row across the columns
check any for every column, across the rows
use that result as an indexer:
df.loc[:, ~df.isna().cumsum(axis=1).any(axis=0)]
Give me:
A B
0 11 12
1 21 22
2 31 32
I could find a way as follows to get the expected output:
colFirstNaN = df.isna().any(axis=0).idxmax() # Find column that has first NaN element in any row
indexColLastValue = df.columns.tolist().index(colFirstNaN) -1
ColLastValue = df.columns[indexColLastValue]
out2 = df.loc[:, :ColLastValue]
And the output would be then:
>>> out2
A B
0 11 12
1 21 22
2 31 32

Update column with NaN with mean of filtered rows

I have the following DataFrame
VOTES CITY
24 A
22 A
20 B
NaN A
NaN A
30 B
NaN C
I need to fill the NaN with mean of values where CITY is 'A' or 'C'
The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.
train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)
The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .
Use Series.fillna only for filtered rows with mean of filtered rows:
train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean()
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
VOTES CITY VOTES_EN
0 24.0 A 24
1 22.0 A 22
2 20.0 B 20
3 NaN A 23
4 NaN A 23
5 30.0 B 30
6 NaN C 23

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Categories