Pandas - remove row similar to other row - python

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!

You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.

You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8

This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Related

How to fill missing value in a few columns at the same time

I need to drop missing values in a few columns. I wrote this to do it one by one:
df2['A'].fillna(df1['A'].mean(), inplace=True)
df2['B'].fillna(df1['B'].mean(), inplace=True)
df2['C'].fillna(df1['C'].mean(), inplace=True)
Any other ways I can fill them all in one line of code?
You can use a single instructions:
cols = ['A', 'B', 'C']
df[cols] = df[cols].fillna(df[cols].mean())
Or for apply on all numeric columns, use select_dtypes:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].fillna(df[cols].mean())
Note: I strongly discourage you to use inplace parameter. It will probably disappear in Pandas 2
[lambda c: df2[c].fillna(df1[c].mean(), inplace=True) for c in df2.columns]
There are few options to work with nans in a df. I'll explain some of them...
Given this example df:
A
B
C
0
1
5
10
1
2
nan
11
2
nan
nan
12
3
4
8
nan
4
nan
9
14
Example 1: fill all columns with mean
df = df.fillna(df.mean())
Result:
A
B
C
0
1
5
10
1
2
7.33333
11
2
2.33333
7.33333
12
3
4
8
11.75
4
2.33333
9
14
Example 2: fill some columns with median
df[["A","B"]] = df[["A","B"]].fillna(df.median())
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
nan
4
2
9
14
Example 3: fill all columns using ffill()
Explanation: Missing values are replaced with the most recent available value in the same column. So, the value of the preceding row in the same column is used to fill in the blanks.
df = df.fillna(method='ffill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
12
4
2
9
14
Example 4: fill all columns using bfill()
Explanation: Missing values in a column are filled using the value of the next row going up, meaning the values are filled from the bottom to the top. Basically, you're replacing the missing values with the next known non-missing value.
df = df.fillna(method='bfill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
4
8
12
3
4
8
14
4
nan
9
14
If you wanted to DROP (no fill) the missing values. You can do this:
Option 1: remove rows with one or more missing values
df = df.dropna(how="any")
Result:
A
B
C
0
1
5
10
Option 2: remove rows with all missing values
df = df.dropna(how="all")

Filling na in a column in pandas from multiple columns conition

I would love to fill na in pandas dataframe where two columns in the dataframe both has on the same row.
A B C
2 3 5
Nan nan 7
4 7 9
Nan 4 9
12 5 8
Nan Nan 6
In the above dataframe, I would love to replace just row where both column A and B has Nan with "Not Available"
Thus:
A B C
2 3 5
Not available not available 7
4 7 9
Nan 4 9
12 5 8
Not available not available 6
I have tried multiple approach but I am getting undesirable result
If want test only A and B columns use DataFrame.loc with mask test it missing value with DataFrame.all for test if both are match:
m = df[['A','B']].isna().all(axis=1)
df.loc[m, ['A','B']] = 'Not Available'
If need test any 2 columns first count number of missing values and is 2 use fillna:
m = df.isna().sum(axis=1).eq(1)
df = df.where(m, df.fillna('Not Available'))
print (df)
A B C
0 2 3 5
1 Not Available Not Available 7
2 4 7 9
3 NaN 4 9
4 12 5 8
5 Not Available Not Available 6

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Categories