Drop Dataframe Rows Based on a Similarity Measure Pandas - python

I want to eliminate repeated rows in my dataframe.
I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:
Title | Area | Price
Apartment at Boston 100 150000
Apt at Boston 105 149000
I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?

Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
'Area':[100,105],
'Price':[150000,149000]})
def string_ratio(df,col,ratio):
from difflib import SequenceMatcher
import numpy as np
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
ratios = []
for i, x in enumerate(df[col]):
a = np.array([similar(x, row) for row in df[col]])
a = np.where(a < ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
def numeric_ratio(df,col,ratio):
ratios = []
for i, x in enumerate(df[col]):
a = np.array([min(x,row)/max(x,row) for row in df[col]])
a = np.where(a<ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))
df1[mask]
It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.

See if this meets your needs
Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())
The df created is as below
Title Area Price
0 Apartment at Boston 100 150000
1 Apt at Boston 105 149000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
4 Apt at Seattle 101 150000
5 Apt at Chicago 101 150000
Now, we run the code as below
from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
val=fuzz.partial_ratio(a,b)
return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
itered=i+1
while itered < len(tl):
val=fuzzy_compare(tl[i],tl[itered])
if val > 80:
if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
df.drop(itered,inplace=True)
df.reset_index()
pass
else:
pass
else:
pass
else:
pass
itered=itered+1
while i < len(tl)-1:
try:
do_the_thing(i)
i=i+1
except:
i=i+1
pass
else:
pass
the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.
Title Area Price
0 Apartment at Boston 100 150000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300

Related

Apply fuzzy string matching of two columns in two Pandas dataframes while preserving a similarity score and output a Pandas DataFrame

I have two data frames that I'm trying to merge, based on a primary & foreign key of company name. One data set has ~50,000 unique company names, the other one has about 5,000. Duplicate company names are possible within each list.
To that end, I've tried to follow along the first solution from Figure out if a business name is very similar to another one - Python. Here's an MWE:
mwe1 = pd.DataFrame({'company_name': ['Deloitte',
'PriceWaterhouseCoopers',
'KPMG',
'Ernst & Young',
'intentionall typo company XYZ'
],
'revenue': [100, 200, 300, 250, 400]
}
)
mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
'PriceWaterhouseCooper'
],
'CEO': ['John', 'Jane']
}
)
I am trying to get the following code from Figure out if a business name is very similar to another one - Python to work:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
How do I apply those two functions to produce a similarity score between each possible combination of mwe1 and mwe2, then filter such that to the most probable matches?
For example, I'm looking for something like this (I'm just making up the scores in the similarity_score column:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 Deloite John 0
KPMG 300 Deloite John 15
Ernst & Young 250 Deloite John 10
intentionall typo company XYZ 400 Deloite John 2
Deloitte 100 PriceWaterhouseCooper Jane 20
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
KPMG 300 PriceWaterhouseCooper Jane 5
Ernst & Young 250 PriceWaterhouseCooper Jane 7
intentionall typo company XYZ 400 PriceWaterhouseCooper Jane 3
I'm also open to better end-states, if you can think of one. Then, I'd filter that table above to get something like:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
Here's what I've tried:
name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'
I'm familiar with using lambda functions but not sure how to make it work when iterating through two columns in two Pandas data frames.
Here is a class I wrote using difflib should be close to what you need.
import difflib
import pandas as pd
class FuzzyMerge:
"""
Works like pandas merge except merges on approximate matches.
"""
def __init__(self, **kwargs):
self.left = kwargs.get("left")
self.right = kwargs.get("right")
self.left_on = kwargs.get("left_on")
self.right_on = kwargs.get("right_on")
self.how = kwargs.get("how", "inner")
self.cutoff = kwargs.get("cutoff", 0.8)
def merge(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
df = self.left.merge(temp, on=self.left_on, how=self.how)
df["similarity_percent"] = df.apply(lambda x: self.similarity_score(x[self.left_on], x[self.right_on]), axis=1)
return df
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
#staticmethod
def similarity_score(left: pd.Series, right: pd.Series) -> int:
return int(round(difflib.SequenceMatcher(a=left, b=right).ratio(), 2) * 100)
Call it with:
df = FuzzyMerge(left=df1, right=df2, left_on="column from df1", right_on="column from df2", how="inner", cutoff=0.8).merge()

Pandas Fuzzy Matching

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.
Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552

Python: fuzzywuzzy, the output of the first value is correct, the others are NaN

I'm stuck in a very strange problem:
I have two dfs and I have to match strings of one df with the strings of the other df, by similarity.
The target column is the name of the television program (program_name_1 & program_name_2).
In order to let him choose from a limited set of data, I also used the column 'channel' as filter.
The function applies the fuzzy algorithm and gives as result the match of the elements from the columns program_name_1 with program_name_2 and the score similarity between them.
The really strange thing is that the output works fine just for the first channel, but for all the next channels it doesn't. The first column (scorer_test_2), that just prints the program_name_1 is always correct, but scorer_test_2 (that should print program_name_2) and the similarity columns are NaN.
I did a lot of checks on the dfs: I am sure that the names of the columns are the same of the names in the lists and that in the other channels, there are all the data I'm asking for.
The strangest thing is that the first channel and all the other channels are in the same df, for this reason there are no differences between the data of the channels.
I will show you 'toys dts', to ley you understand better the problem:
df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])
that will print for the df1:
Channel program_name_1
1 party
1 animals
1 gucci
2 the simpson
2 cars
2 mathematics
3 bikes
4 chef
and for the df2:
Channel program_name_2
1 parties
1 gucci_gucci
1 animal
2 simpsons
2 math
2 the car
3 bike
4 cooking
and here the code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']
# creation of a function for the score
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
print(scorer_tester_function('R').head())
The output that I would like to get for all the channels, but I just get if I pass the first channel in the code is this:
for the channel[1]:
program_name_1 program_name_2 similarity
party parties 95
animals animal 95
gucci gucci_gucci 75
for the channel[2]:
program_name_1 program_name_2 similarity
the simpson simpsons 85
cars the car 75
mathematics math 70
This is the output I get if I ask for the channel 2 or next:
code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']
output:
Channel program_name_1 program_name_2 similarity
2 the simpson NaN NaN
2 cars NaN NaN
2 mathematics NaN NaN
I hope someone can help me :)
Thanks!
This was for Index mismatch, resetting indices after adding first dataseries can do the work!
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
print(my_df.index)
my_df.reset_index(inplace=True)
print(my_df.index)
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df

How to group data in a DataFrame and also show the number of row in that group?

first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.

Find max halflife values relative to their temperature value in the same array

Basically I load the excel file into a pandas dataframe here:
dv = pd.read_excel('data.xlsx')
Then I clean it up and rename it to "cleaned" which is not important for this reproducible example, just mentioning for clarity:
if (selected_x.title()=="Viscosity" or selected_y.title()=="Viscosity"):
cleaned = cleaned[cleaned.Study != "Yanqing Wang 2017"]
cleaned = cleaned[cleaned.Study != "Thakore 2020"]
From there, I separate the cleaned dataframe into separate studies, this project is a composition of literature. I will include an example of two below:
yan = cleaned[cleaned.Study == "Yanqing Wang 2017"]
tha = cleaned[cleaned.Study == "Thakore 2020"]
Finally, I load each of the individual studies into traces, and display them in a graph. Selected y and selected x are strings, such as "Temperature (C) " and "Halflife (Min)":
trace1 = go.Scatter(y=tha[selected_y], x=tha[selected_x])
trace2 = go.Scatter(y=yan[selected_y], x=yan[selected_x])
What I need to do is, after splitting the array into individual studies, find the maximum halflife relative to each temperature (0,50,100,150,200,250,300) and compile them into separate lists, then find the max value of these lists, take the whole row and append them into the same list. I have tried to do this using stuff like:
yan50 = yanq[yanq['Temperature (C) '] == 50]
yan100 = yanq[yanq['Temperature (C) '] == 100]
yan150 = yanq[yanq['Temperature (C) '] == 150]
yan200 = yanq[yanq['Temperature (C) '] == 200]
yan250 = yanq[yanq['Temperature (C) '] == 250]
yan300 = yanq[yanq['Temperature (C) '] == 300]
To split the study into the varying degree lists. I am currently stuck where I have to find the max value in halflife column of each list and add the whole corresponding row into a new list. This is what I am trying:
yan = pd.DataFrame(columns=["Study","Gas","Surfactant","Surfactant Concentration","Additive","Additive Concentration","LiquidPhase","Quality","Pressure (Psi)","Temperature (C) ","Shear Rate (/Sec)","Halflife (Min)","Viscosity","Color"])
if (len(yan50) > 0):
yan50.loc[yan50['Halflife (Min)'].idxmax()]
yan50 = yan50.dropna()
yan.append(yan50)
if (len(yan100) > 0):
yan100.loc[yan100['Halflife (Min)'].idxmax()]
yan100 = yan100.dropna()
yan.append(yan100)
if (len(yan150) > 0):
yan150.loc[yan150['Halflife (Min)'].idxmax()]
yan150 = yan150.dropna()
yan.append(yan150)
if (len(yan200) > 0):
yan200.loc[yan200['Halflife (Min)'].idxmax()]
yan200 = yan200.dropna()
yan.append(yan200)
if (len(yan250) > 0):
yan250.loc[yan250['Halflife (Min)'].idxmax()]
yan250 = yan250.dropna()
yan.append(yan250)
if (len(yan300) > 0):
yan300.loc[yan300['Halflife (Min)'].idxmax()]
yan300 = yan300.dropna()
yan.append(yan300)yan50.iloc[yan50['Halflife (Min)'].idxmax()]
The error I am getting is the individual temperature lists are empty.
I also got a bunch of Nan values for the separate temperature lists I compiled, and I am unsure if I am splitting the list correctly. I am not too strong with Pandas. Recommendations needed!
Link to CSV of data
------------Edit-------------
What I have, all the studies placed on the same temp points (50, 100, etc). I want to find the maximum value of halflife, so that only the topmost point shows. The reason I am doing this is to aid in data-visualization. Future plans beyond this topic include: connecting the max value dots with a line and comparing the trends of the separate studies halflife values.
IIUC, what you need is
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
This will result in
Study Temperature (C) Max_halflife
0 Thakore 2020 50 120.00
1 Thakore 2020 100 2.40
2 Thakore 2020 150 0.20
3 Yanqing Wang 2017 50 123.00
4 Yanqing Wang 2017 100 3.20
5 Yanqing Wang 2017 150 0.31
Then the code below should get you the graph you wnat.
import seaborn as sns
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
fig = plt.figure(figsize=(8, 5))
sns.scatterplot(x='Temperature (C) ', y='Max_halflife', data=df2, hue='Study')

Categories