Remove rows of one Dataframe based on one column of another dataframe

Remove rows of one Dataframe based on one column of another dataframe - python

I got two DataFrame and want remove rows in df1 where we have same value in column 'a' in df2. Moreover one common value in df2 will only remove one row.
df1 = pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2 = pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
result = pd.DataFrame({'a':[1,1,3,4],'b':[1,2,4,6],'c':[6,5,3,1]})

Use Series.isin + Series.duplicated to create a boolean mask and use this mask to filter the rows from df1:
m = df1['a'].isin(df2['a']) & ~df1['a'].duplicated()
df = df1[~m]
Result:
print(df)
a b c
0 1 1 6
1 1 2 5
3 3 4 3
5 4 6 1

Try This:
import pandas as pd
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
df2a = df2['a'].tolist()
def remove_df2_dup(x):
if x in df2a:
df2a.remove(x)
return False
return True
df1[df1.a.apply(remove_df2_dup)]
It creates a list from df2['a'], then checks that list against each value of df1['a'], removing values from the list each time there's a match in df1

try this
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
for x in df2.a:
if x in df1.a:
df1.drop(df1[df1.a==x].index[0], inplace=True)
print(df1)

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df

You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Python: How to remove rows which are in other dataframe?

I have two dataframes:
First:
tif_pobrany
0 65926_504019_N-33-127-B-d-3-4.tif
1 65926_504618_N-33-139-D-b-1-3.tif
2 65926_504670_N-33-140-A-a-2-3.tif
3 66533_595038_N-33-79-C-b-3-3.tif
4 66533_595135_N-33-79-D-d-3-4.tif
Second:
url godlo ... row_num nazwa_tifa
0 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-1-2 ... 48004 73231_904142_M-34-68-C-a-1-2.tif
1 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-1 ... 48011 73231_904127_M-34-68-C-a-3-1.tif
2 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-2 ... 48012 73231_904336_M-34-68-C-a-3-2.tif
3 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-3 ... 48013 73231_904286_M-34-68-C-a-3-3.tif
4 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-4-2 ... 48016 73231_904263_M-34-68-C-a-4-2.tif
How can I delete rows in second dataframe which have the same 'nazwa_tifa' like in the first dataframe 'tif_pobrany'?
Something like this:
for index, row in second.iterrows():
for index2, row2 in first.iterrows():
if row['nazwa_tifa'] == row2['tif_pobrany']:
del row
but it didn't work.

Try this with your data:
import pandas as pd
df1 = pd.DataFrame({"col1":[1,2,3,4,5]})
df2 = pd.DataFrame({"col2":[1,3,4,9,8]})
df1.drop(df1[df1.col1.isin(df2.col2)].index, inplace = True)
print(df1)
output:
col1
1 2
4 5

considering df1 and df2 are names of your dataframes respectively:
df2 = df2[df2['nazwa_tifa'].isin(df1['tif_pobrany'])]
How it works?
The isin function is used to check whether values inside a Pandas series are present in another Pandas series.
Then, an array of True/False values are passed to df2, so it selects only rows wherever the condition is True.
Finally, an assignment is used to replace df2 with the new dataframe.

Drop a group of rows if one column has missing data in a pandas dataframe

I have the following dataframe:
df
Group Dist
0 A 5
1 B 2
2 A 3
3 B 1
4 B 0
5 A 5
I am trying to drop all rows that match Group if the Dist column equals zero. This works to delete row 4:
df = df[df.Dist != 0]
however I also want to delete rows 1 and 3 so I am left with:
df
Group Dist
0 A 5
2 A 3
5 A 5
Any ideas on how to drop the group based off this condition?
Thanks!

First get all Group values for Entry == 0 and then filter out them by check column Group with inverted mask by ~:
df1 = df[~df['Group'].isin(df.loc[df.Dist == 0, 'Group'])]
print (df1)
Group Dist
0 A 5
2 A 3
5 A 5
Or you can use GroupBy.transform with GroupBy.all for test if groups has no 0 values:
df1 = df[(df.Dist != 0).groupby(df['Group']).transform('all')]
EDIT: For remove all groups with missing values:
df2 = df[df['Dist'].notna().groupby(df['Group']).transform('all')]
For test missing values:
print (df[df['Dist'].isna()])
if return nothing there are no missing values NaN or no None like Nonetype.
So is possible check scalar, e.g. if this value is in row with index 10:
print (df.loc[10, 'Dist'])
print (type(df.loc[10, 'Dist']))

You can use groupby and the method filter:
df.groupby('Group').filter(lambda x: x['Dist'].ne(0).all())
Output:
Group Dist
0 A 5
2 A 3
5 A 5
If you want to filter out groups with missing values:
df.groupby('Group').filter(lambda x: x['Dist'].notna().all())

Dataframe becomes larger than it should be after join operation in pandas

I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected

The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.

It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).

Pandas how to check row is from which dataframe when comparing 2 dataframes?

I have the following code which compares 2 columns in 2 dataframes, which actually returns the rows which are different in both dataframes but I want to get the rows which are different for example in df1 not in both:
df1 = pd.DataFrame([('a','b','src'), ('a','b','src'), ('c','b','src'),('a','d','src')],columns=['col1','col2','origin'])
df2 = df1.copy(deep=True)
df2['origin'] = 'tgt'
df1['col1'][3] = 't'
df2['col2'][2] = 't'
df1[(df1['col1'] != df2['col1']) | (df1['col2'] != df2['col2'])]
which gives output as in the image:
Now, over here I do see the 2 differences but the origin column is always src. What I want is, the count of rows which are different but only from source i.e. df1

Because same columns and same indices in both DataFrames, is possible compare between them.
For check not equal in df1 and also in df2 need sum of Trues in boolean mask:
mask = (df1['col1'] != df2['col1']) | (df1['col2'] != df2['col2'])
print (mask)
0 False
1 False
2 True
3 True
dtype: bool
out = mask.sum()
print (out)
2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows of one Dataframe based on one column of another dataframe - python

Use Series.isin + Series.duplicated to create a boolean mask and use this mask to filter the rows from df1: m = df1['a'].isin(df2['a']) & ~df1['a'].duplicated() df = df1[~m] Result: print(df) a b c 0 1 1 6 1 1 2 5 3 3 4 3 5 4 6 1

try this df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]}) df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]}) for x in df2.a: if x in df1.a: df1.drop(df1[df1.a==x].index[0], inplace=True) print(df1)

Related

Stick the columns based on the one columns keeping ids

Python: How to remove rows which are in other dataframe?

Drop a group of rows if one column has missing data in a pandas dataframe

Dataframe becomes larger than it should be after join operation in pandas

Pandas how to check row is from which dataframe when comparing 2 dataframes?

Categories

Resources