Find similarity between two dataframes, row by row

Find similarity between two dataframes, row by row - python

I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches.
The first one was to append one of the two dataframes to the other one and selecting duplicates:
df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows
The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.
Sample of data: Please note that the datasets have different length
For df1
Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football
and for df2
Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy?
Trump did not win the last presidential election
In order to do this, I am using the following function:
def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]
This should allow me to find rows which match.
The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.
My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).
Please let me know if you need any further information or if you need more code.
Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.
Any suggestion is welcome.
Expected output:
(it is an example; since I am setting a threshold the output may change)
df1
Check sim
how to join to first row []
large data work flows [small data work flows]
I have two dataframes []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here? []
add language identifier []
my dad loves watching football []
df2
Check sim
small data work flows [large data work flows]
I have tried to puzze out an answer []
mix grammatical or spelling errors [fix grammatical or spelling errors]
indent code by 2 spaces [indent code by 4 spaces]
indent code by 8 spaces [indent code by 4 spaces]
put returns between paragraphs []
add curry on the chicken curry []
mom!! mom!! mom!! []
create code fences with backticks []
are you crazy? []
Trump did not win the last presidential election []

I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2) so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores useful as well.
from fuzzywuzzy import fuzz
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)
output:
Check matches
0 how to join to first row []
1 large data work flows [small data work flows]
2 I have two dataframes []
3 fix grammatical or spelling errors [mix gramma... [mix grammatical or spelling errors]
4 indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spa...
5 why are you posting here? [are you crazy?]
6 add language identifier []
7 my dad loves watching football []

Related

how search on a specific rows of a dataframe to find a match in a second dataframe?

I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.
The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?

You have pretty much explained the most efficient method already.
Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
Extract unique values of 2 letter column from df2.
Inner join the result from #2 on to df1.
Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
Left join result of #4 onto df2.

Comparing column in data frame with many items in a single sting to a separate list and picking out common elements

I'm relatively new to python and am really having trouble working with lists.
I have a dataframe (df1) with a column for 'actors' with many actors in a string and I have a separate dataframe (df2) that lists actors that have received an award.
I want to add a column to df1 that will indicate whether an actor has received an award or not, so for example 1=award, 0=no award.
I am trying to use for loops but it is not iterating in the way I want.
In my example, only 'Betty' has an award, so the 'actors_with_awards' column should display a 0 for the first row and 1 for the second, but the result is a 1 for both rows.
I suspect this is because it's looking at the string in its entirety for example is 'Alexander, Ann' in the list vs. is 'Alexander' or 'Ann' in the list, I thought splitting the stings would solve this (maybe I did that step wrong?) so I'm not sure how to fix this.
My full code is below:
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x)
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
df1['actors_with_awards']=actors_with_awards
df1
Current Output Df1
cast
actors_with_awards
Alexander, Ann
1
Bob, Bill, Benedict, Betty
1
Expected Output Df1
cast
actors_with_awards
Alexander, Ann
0
Bob, Bill, Benedict, Betty
1

When trying out your program, a couple of things popped up. First was your comparison of "x" to see if it was contained in the awards database.
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
The issue here was that x contains the value of "Betty" left over from populating the awards array. It is not the "x" value for each split actor array. The other issue was when performing a check whether an item existed or not existed in the awards array, leading and/or trailing spaces for the actor names was throwing off the comparison.
With that in mind I made a few tweaks to your code to address those situations as follows in the code snippet.
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x.strip()) # Make sure no leading or trailing spaces exist for subsequent test
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
y = 0
for x in item: # Reworked this so that "x" is associated with the selected actor set
if x.strip() not in award: # Again, make sure no leading or trailing spaces are in the comparison
y = 0
else:
y = 1
actors_with_awards.append(y)
df1['actors_with_awards']=actors_with_awards
print(df1) # Changed this so as to print out the data to a terminal
To insure that leading or trailing spaces would not trip of comparisons or list checks, I added in the ".strip()" function where needed to store just the name value and nothing more. Secondly, so that the proper name value was placed into variable "x", an additional for loop was added along with a work variable to be populated with the proper "0" or "1" value. Adding those tweaks resulted in the following raw data output on the terminal.
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1
You may want to give that a try. Please note that this may be just one way to address this issue.

One possible solution is to convert actors with awards from df2 to set, split column df1['cast'] and check intersection between each for and the set:
awards = set(df2["awards"].values)
df1["actors_with_awards"] = [
int(bool(awards.intersection(a)))
for a in df1["cast"].str.split(r"\s*,\s*", regex=True)
]
print(df1)
Prints:
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]

Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever

You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order. Most of the data is comparable but not all. The rows represent individuals and the the columns contain data about those individuals. I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.
I have been using the df.query() method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "D V" and town == "LONDON"'). I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.
I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal. List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).
There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).
I have been unable to find an example of someone doing the same, which might be my failure again. I am a novice and I have a feeling I might be thinking about this in completely the wrong way. I would very much value any thoughts and pointers...
EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)
df1 (my master list)
surname initials town
JONES D V LONDON
DAVIES H G BIRMINGHAM
df2
surname initials town
DAVIES H G BIRMINGHAM
HARRIS P J SOUTHAMPTON
JONES D V LONDON
I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:
surname initials town
HARRIS P J SOUTHAMPTON

define columns to join:
cols = ['surname', 'initials', 'town']
Than you can use merge with parameter indicator=True which shows appearance of the data (left_only, right_only or both) :
df_res = df1.merge(df2, 'outer',on=cols, indicator=True)
and exclude rows appear in both dataframes:
df_res = df_res[df_res['_merge'] != 'both']
print(df_res)
surname initials town _merge
2 HARRIS P J SOUTHAMPTON right_only
you can filter by left_only or right only.

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.

why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)

Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.

You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.