Reliable way of dropping rows in df1 which are also in df2

Reliable way of dropping rows in df1 which are also in df2 - python

I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.
I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0

We can use DataFrame.merge with indicator = True + DataFrame.query and DataFrame.drop
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
3 4 13
4 5 14
if now for example we change a value of row 0:
df1.iat[0,0]=3
row 0 is no longer filtered
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
Step by step
df_filtered=( df1.merge(df2,how='outer',indicator=True)
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 1 10 right_only
df_filtered=( df1.merge(df2,how='outer',indicator=True).query("_merge == 'left_only'")
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
3 4 13 left_only
4 5 14 left_only
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1)
)
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14

You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask
s1 = df1.agg(tuple, axis=1)
s2 = df2.agg(tuple, axis=1)
df1[~s1.isin(s2)]
Out[538]:
col1 col2
3 4 13
4 5 14

Related

Concat pandas dataframes in Python with different row size without getting NaN values

I have to combine some dataframes in Python. I've tried to combine them using concat operation, but I am getting NaN values because each dataframe has different row size. For example:
DATAFRAME 1:
col1
0 1
DATAFRAME 2:
col2
0 5
DATAFRAME 3:
col3
0 7
1 8
2 9
COMBINED DATAFRAME:
col1 col2 col3
0 1.0 5.0 7
1 NaN NaN 8
2 NaN NaN 9
In this example, dataframe 1 and dataframe 2 only have 1 row. However, dataframe 3 has 3 rows. When I combine these 3 dataframes, I get NaN values for columns col1 and col2 in the new dataframe. I'd like to get a dataframe where the values for col1 and col2 are always the same. In this case, the expected dataframe would look like this:
EXPECTED DATAFRAME:
col1 col2 col3
0 1 5 7
1 1 5 8
2 1 5 9
Any idea? Thanks in advance

Use concat and ffill:
df = pd.concat([df1, df2, df3], axis=1).fill()

You can use ffill() on your merged dataframe to fill in the blanks with the previous value:
df.ffill()
col1 col2 col3
0 1 5 7
1 1 5 8
2 1 5 9

How to complete NaN cells based on another Pandas dataframe in Python

I have the following 2 dataframes..
First dataframe df1:
import pandas as pd
import numpy as np
d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
id col1 col2
0 1 13.0 23.0
1 2 NaN NaN
2 3 15.0 NaN
3 4 NaN NaN
And the second dataframe df2:
d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2
id col1 col2
0 2 14 24.0
1 3 150 250.0
2 4 16 NaN
I need to replace the NaN fields in df1 with the non-NaN values from df2, where it is possible. But there are some conditions...
Condition 1) id column in each dataframe consists of unique values. When replacing any NaN value in df1 with another value from df2, the id column value needs to match.
Condition 2) Dataframes do not necessarily have the same size.
Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes. The id column cannot be NaN in any row. There might be other columns in the dataframes, with or without NaN values. But for replacing the data, we will only be looking at col1 and col2 columns.
Condition 4) To go for a replacement of a row in df1, it is enough that any of col1 or col2 have a NaN value in any corresponding row. And when any NaN value is detected in any row in df1, the entire row will be replaced by the corresponding row with the same id value from df2, as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2, do not replace any data in df1.
After doing this operation, the df1 should look like the following:
id col1 col2
0 1 13.0 23.0
1 2 14 24
2 3 150.0 250.0 # Note that the entire row is replaced!
3 4 NaN NaN # This row not replaced bcz col2 value is NaN in df2 for the same row
How can this be done in the most elegant way? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic.

You can drop the NaN values from df2, then update with concat and groupby:
pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()
Output:
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 150.0 250.0
3 4 NaN NaN

here is another way using fillna:
df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()
output:
>>>
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 15.0 250.0
3 4 NaN NaN

Merge small file into big file and give NaN's for the rows that do not match in python

I would like to merge two data frames - big one and small one. Example of data frames is following:
# small data frame construction
>>> d1 = {'col1': ['A', 'B'], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 A 3
1 B 4
# big data frame construction
>>> d2 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': [3, 4, 6, 7, 8]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 A 3
1 B 4
2 C 6
3 D 7
4 E 8
The code I am looking for should produce the following output (a data frame with big data frame shape, column names, and NaNs in rows that were not merged with the small data frame):
col1 col2
0 A 3
1 B 4
2 NA NA
3 NA NA
4 NA NA
The code I have tried:
>>> print(pd.merge(df1, df2, left_index=True, right_index=True, how='right', sort=False))
col1_x col2_x col1_y col2_y
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 5
3 NaN NaN D 6
4 NaN NaN E 7

You can add parameter suffixes with add _ for added columns and then removed added columns with Series.str.endswith, inverted mask by ~ and boolean indexing with loc, because droping columns:
df = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='right',
sort=False,
suffixes=('','_'))
print (df)
col1 col2 col1_ col2_
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 6
3 NaN NaN D 7
4 NaN NaN E 8
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
col1 col2
0 A 3.0
1 B 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

Modify values in a column based on condition from another

Question: How do you group a df based on a variable, make a computation using a for loop?
The task is to make a conditional computation based on the value in a column. But the computational constants are dependent upon the value in the reference column. Given this df:
In [55]: df = pd.DataFrame({
...: 'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
...: 'col2' : [2, 1, 9, 8, 7, 4],
...: 'col3': [0, 1, 9, 4, 2, 3],
...: })
In [56]: df
Out[56]:
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
3 NaN 8 4
4 D 7 2
5 C 4 3
I've used the solution here to insert a 'math' column that takes the balance from col3 and adds 10. But now I want to iterate over a list to set the computational variable dependent upon the values in col1. Here's the result:
In [57]: items = ['A', 'D']
In [58]: for item in items:
...: df.loc[:, 'math'] = df.loc[df['col1'] == item, 'col3']
...:
In [59]: df
Out[59]:
col1 col2 col3 math
0 A 2 0 NaN
1 A 1 1 NaN
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 2.0
5 C 4 3 NaN
The obvious issue is that the df is over written on each iteration. The math column for index 0 and 1 computed values on the first iteration, but they are removed on the second iteration. The resulting df only considers the last element of the list.
I could go through and add coding to iterate through each index value - but that seems more pathetic than pythonic.
Expected Output for the .mul() example
In [100]: df
Out[100]:
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN

The problem with your current method is the output of each subsequent iteration overwrites the output of the one before it. So you'd end up with output for just the last item and nothing more.
Select all rows with elements in items and assign, same as you did before.
df['math'] = df.loc[df.col1.isin(items), 'col3'] * 10
Or,
df['math'] = df.query("col1 in #items").col3 * 10
Or even,
df['math'] = df.col3.where(df.col1.isin(items)) * 10
df
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN

The reason why you fail with assign , cause in each for loop you are assign a Math with new value , like below which will only show the last one and present to the result after the for loop
0 0.0
1 10.0
2 NaN
3 NaN
4 NaN
5 NaN
Name: col3, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 20.0
5 NaN
Name: col3, dtype: float64
You can do it with below
df.loc[df.col1.isin(items),'math']=df.col3*10
df
Out[85]:
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN

How to groupby one column and then divide two columns together?

I'm new to pandas, and I want to create a new column in my pandas dataframe. I'd like to groupby one column, and then divide two other columns together.
This perfectly works:
df['new_col'] = (df.col2/df.col3)
However, when I groupby another column, what I have doesn't work:
df['new_col'] = df.groupby('col1')(df.col2/df.col3)
Does anyone know how I can rewrite the above code? Thanks.

Setup
df = pd.DataFrame(dict(
Col1=list('AAAABBBB'),
Col2=range(1, 9, 1),
Col3=range(9, 1, -1)
))
df
df.groupby('Col1').sum().eval('Col4 = Col2 / Col3')
Col1 Col2 Col3
0 A 1 9
1 A 2 8
2 A 3 7
3 A 4 6
4 B 5 5
5 B 6 4
6 B 7 3
7 B 8 2
Solution
Using pd.DataFrame.eval
We can use eval to create new columns in a pipeline
df.groupby('Col1', as_index=False).sum().eval('Col4 = Col2 / Col3')
Col1 Col2 Col3 Col4
0 A 10 30 0.333333
1 B 26 14 1.857143

This may be what you are looking for:
import pandas as pd
df = pd.DataFrame([['A', 4, 3], ['B', 2, 4], ['C', 5, 1], ['A', 5, 1], ['B', 2, 7]],
columns=['Col1', 'Col2', 'Col3'])
# Col1 Col2 Col3
# 0 A 4 3
# 1 B 2 4
# 2 C 5 1
# 3 A 5 1
# 4 B 2 7
df['Col4'] = df['Col2'] / df['Col3']
df = df.sort_values('Col1')
# Col1 Col2 Col3 Col4
# 0 A 4 3 1.333333
# 3 A 5 1 5.000000
# 1 B 2 4 0.500000
# 4 B 2 7 0.285714
# 2 C 5 1 5.000000
Or if you need to perform a groupby.sum first:
df = df.groupby('Col1', as_index=False).sum()
df['Col4'] = df['Col2'] / df['Col3']
# Col1 Col2 Col3 Col4
# 0 A 9 4 2.250000
# 1 B 4 11 0.363636
# 2 C 5 1 5.000000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reliable way of dropping rows in df1 which are also in df2 - python

You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask s1 = df1.agg(tuple, axis=1) s2 = df2.agg(tuple, axis=1) df1[~s1.isin(s2)] Out[538]: col1 col2 3 4 13 4 5 14

Related

Concat pandas dataframes in Python with different row size without getting NaN values

How to complete NaN cells based on another Pandas dataframe in Python

Merge small file into big file and give NaN's for the rows that do not match in python

Modify values in a column based on condition from another

How to groupby one column and then divide two columns together?

Categories

Resources