Anti-Join Pandas

Anti-Join Pandas - python

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
I am familiar with R and this is the code I would use to do this in R.
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:
left_only
right_only
both
Keep right_only and left_only. That is it.
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
Here is a comparison with a solution from piRSquared:
1) When run on this example matching based on one column, piRSquared's solution is faster.
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
So it's up for you to decide.

Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Timing
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='\n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='\n')
Which prints this output:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693

Easiest answer imaginable:
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")
answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
Should be the fastest proposed as well.

One liner
TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)
%%timeit gives about the same timing as the accepted answer.

You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.
Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:
import pandas as pd
TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same
for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)
TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)
Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0
# Set 'Key' column = 1
# Run the script after the loop
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0
# Table B
0 1 2
1 5 7 8
2 9 10 0

Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()]
you have a semi_join.
def anti_join(tableA,tableB,on):
#if joining on index, make it into a column
if tableB.index.name is not None:
dummy = tableB.reset_index()[on]
else:
dummy = tableB[on]
#create a dummy columns of 1s
if isinstance(dummy, pd.Series):
dummy = dummy.to_frame()
dummy.loc[:,'dummy_col'] = 1
#preserve the index of tableA if it has one
if tableA.index.name is not None:
idx_name = tableA.index.name
tableA = tableA.reset_index(drop = False)
else:
idx_name = None
#do a left-join
merged = tableA.merge(dummy,on=on,how='left')
#keep only the non-matches
output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
#reset the index (if applicable)
if idx_name is not None:
output = output.set_index(idx_name)
return(output)

Related

Pandas DataFrame lookup for loop: for loop won't stop running

I'm trying to lookup the values on a certain column and copy the remaining column based on that lookup. The thing is, the number of row in this operation is more than 20 million rows.
I tried to run the code, but it did not stop for like 8 hours and then I stop it. My question is:
Is my algorithm correct? If its correct, is the cause of this non-stop running is due to my inefficient algorithm?
Here's my code and tables to illustrate:
Table 1
A
B
12
abc
13
def
28
ghi
50
jkl
Table 2 (Lookup to this table)
B
C
D
abc
4
7
def
3
3
ghi
6
2
jkl
8
1
Targeted result
A
B
C
D
12
abc
4
7
13
def
3
3
28
ghi
6
2
50
jkl
8
1
So the column of C and D will be added to table 1 also but lookup to table 2 of column B
This value on Table 1 is located in different CSV files, so I also loop through the files in the folder. I name the directory as all_files in the code. So, after looking up
My code:
df = pd.DataFrame()
for f in all_files:
Table1 = pd.read_csv(all_files[f])
for j in range(len(Table1)):
u = Table1.loc[j,'B']
for z in range(len(Table2)):
if u == Table2.loc[z,'B']:
Table1.loc[j,'C'] = Table2.loc[z,'C']
Table1.loc[j,'D'] = Table2.loc[z,'D']
break
df = pd.concat([df,Table1],axis=0)
I used that break at the end just to stop the looping when it finds the same value and Table1 is concatenated to df. This code here didn't work on me, loops continuously and never stops.
Can anyone help? Any help will be very much appreciated!

I hope this is the solution you are looking for:
Firstly I would join all the CSV for table_1 together as a single DataFrame. Then I would merge the table_2 to table_1 with the key of Column B. Sample code:
df = pd.DataFrame()
for file in all_file:
df_tmp = pd.read_csv(file)
df = pd.concat([df, df_tmp])
df_merge = pd.merge(df, table_2, on="B", how="left")

When we use for-loop with Pandas, there is a 98% chance that we are doping it wrong. Pandas is design for you not to use loops.
Solutions with increasing performance:
import pandas as pd
table_1 = pd.DataFrame({'A': [12, 13, 28, 50], 'B': ['abc', 'def', 'ghi','jkl']})
table_2 = pd.DataFrame({'B': ['abc', 'def', 'ghi','jkl'], 'C': [4, 3, 6, 8], 'D': [7, 3, 2, 1]})
# simple merge
table = pd.merge(table_1, table_2, how='inner', on='B')
# gain speed by using indexing
table_1 = table_1.set_index('B')
table_2 = table_2.set_index('B')
table = pd.merge(table_1, table_2, how='inner', left_index=True, right_index=True)
# there is also join but it’s slow than merge
table = table_1.join(table_2, on="B").reset_index()

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?

This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7

You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )

Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.

Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN

Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)

You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']

Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

python: using .iterrows() to create columns

I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas
mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3},
{'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2},
{'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},]
df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows():
i = 0
max_range = row['Close_date_wk']
min_range = int(row['Close_date_wk'] - row['week_diff'])
for i in range(min_range,max_range):
col_head = 'job_week_' + str(i)
row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3
row B has date range 2,3
row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1
1 alternative products in week 2
2 alternative products in week 3
row B has 1 alternative products in week 2
2 alternative products in week 3
&c..

You can't mutate the df using row here to add a new column, you'd either refer to the original df or use .loc, .iloc, or .ix, example:
In [29]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[29]:
a b c
0 -1.525011 0.778190 -1.010391
1 0.619824 0.790439 -0.692568
2 1.272323 1.620728 0.192169
3 0.193523 0.070921 1.067544
4 0.057110 -1.007442 1.706704
In [30]:
for index,row in df.iterrows():
df.loc[index,'d'] = np.random.randint(0, 10)
df
Out[30]:
a b c d
0 -1.525011 0.778190 -1.010391 9
1 0.619824 0.790439 -0.692568 9
2 1.272323 1.620728 0.192169 1
3 0.193523 0.070921 1.067544 0
4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]:
# reset the df by slicing
df = df[list('abc')]
for index,row in df.iterrows():
row['b'] = np.random.randint(0, 10)
df
Out[31]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]:
df = df[list('abc')]
for index,row in df.iterrows():
row['d'] = np.random.randint(0,10)
df
Out[35]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704

row[col_head] = 1 ..
Please try the below line:
df.at[index,col_head]=1

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0

The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])

import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Anti-Join Pandas - python

Easiest answer imaginable: tableB = pd.concat([tableB, pd.Series(1)], axis=1) mergedTable = tableA.merge(tableB, how="left" on="key") answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()] Should be the fastest proposed as well.

One liner TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True) %%timeit gives about the same timing as the accepted answer.

Related

Pandas DataFrame lookup for loop: for loop won't stop running

Element-wise Comparison of Two Pandas Dataframes

Match rows between dataframes and preserve order

python: using .iterrows() to create columns

Selecting rows from pandas DataFrame using a list

Categories

Resources