Pandas Fuzzy Matching - python

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.

Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552

Related

group by similar value of column in dataframe

I'm using a DataFrame that contains sample data on rocks and soils. I want to create 2 separate plots, one for rocks and one for soils, showing SO3 composition relative to SIO2. I created a dictionary of rocks only, but there are still 90+ samples. As it's shown in the figure, some have similar names. For example 'Adirondack' appears 3 times. I could manually go through them all, but that would take a while (P.S. I did, but I would still like to know the easier way than if ... elif ... statements, since I had to manually create a legend entry to avoid many duplicate entries).
How can I just group together the ones with the same x letters and save them in a new dataframe or my dictionary as just 'Adirondack (all)', for example (take the part of the name before the '_' perhaps, so that it will appear in the legend that way), and have the three sets of values for 'Adirondack_' etc. in one dictionary entry.
Rocks = APXSData[APXSData.Type.str.contains('R')]
RockLabels = Rocks['Sample'].to_list()
RockDict = {}
for i in RockLabels:
SiO2val = np.extract(Rocks["Sample"]==i, Rocks["SiO2"])
SO3val = np.extract(Rocks["Sample"]==i, Rocks["SO3"])
newKey = i
RockDict[newKey] = {'SiO2':SiO2val, 'SO3':SO3val}
DatabyRockSample = pd.DataFrame(RockDict)
fig = plt.figure()
for i in RockLabels:
plt.scatter(
DatabyRockSample[i]["SiO2"],
DatabyRockSample[i]["SO3"],
marker='o',
label = i) #, color = colors[count], edgecolors = edgecolor[count],
plt.xlabel("SiO$_2$", labelpad = 10)
plt.ylabel("SO$_3$", labelpad = 10)
plt.title('Composition of all rocks \n at Gusev Crater')
plt.legend()
Let's prepare some dummy data:
df = pd.DataFrame({
'Sol': [14,18,33,34,41],
'Type': ['SU','RU','RB','RR','SU'],
'Sample': ['Gusev_Soil','Adirondack_asis','Adirondack_brush','Adirondack_RAT','Gusev_Other'],
'N': [45,126,129,128,76],
'Na2O': [2.8,2.3,2.8,2.4,2.7],
# ...
})
So here's our data frame:
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
4 41 SU Gusev_Other 76 2.7
We can use grouping in this way.
If the only option we have is matching first n letters, then:
n = 5
grouper = df['Sample'].str[:n]
groups = {name: group for name, group in df.groupby(grouper)}
If we can extract meaningful data by splitting, which is better I think, then:
# in this case we can split by '_' and get the first word
grouper = df['Sample'].str.split('_').str.get(0)
groups = {name: group for name, group in df.groupby(grouper)}
If splitting isn't that simple, say our words are separated by space, underscore or hyphen, then we could use str.extract method:
grouper = df['Sample'].str.extract('\A(.*)(?=[ _-])')
groups = {name: group for name, group in df.groupby(grouper)}
We can also avoid creating dictionaries. Let's see how we can iterate over the groups obtained by splitting as an example:
grouper = df['Sample'].str.split('_').str.get(0)
groups = df.groupby(grouper)
for name, dataframe in groups:
print(f'name: {name}')
print(dataframe, '\n')
Output:
name: Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
name: Gusev
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
4 41 SU Gusev_Other 76 2.7
The same with rocks. IMO we can do better than APXSData.Type.str.contains('R'):
APXSData['Type'].str[0] == 'R'
APXSData['Type'].str.startswith('R')
Let's separate rocks and group them by the leading name:
is_rock = df['Type'].str.startswith('R')
grouper = df['Sample'].str.split('_').str.get(0)
groups_of_rocks = df[is_rock].groupby(grouper)
for k,v in groups_of_rocks:
print(k)
print(v)
Output:
Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
To plot data for some group of interest only, we can use get_group(name):
groups.get_group('Adirondack').plot.bar(x='Sample', y=['N','Na2O'])
See also:
detail about str in pandas
pandas.Series.split
pandas.Series.str.get
pandas.Series.str.extract
regex in python
run help('pandas.core.strings.StringMethods') to see help offline

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

Find max halflife values relative to their temperature value in the same array

Basically I load the excel file into a pandas dataframe here:
dv = pd.read_excel('data.xlsx')
Then I clean it up and rename it to "cleaned" which is not important for this reproducible example, just mentioning for clarity:
if (selected_x.title()=="Viscosity" or selected_y.title()=="Viscosity"):
cleaned = cleaned[cleaned.Study != "Yanqing Wang 2017"]
cleaned = cleaned[cleaned.Study != "Thakore 2020"]
From there, I separate the cleaned dataframe into separate studies, this project is a composition of literature. I will include an example of two below:
yan = cleaned[cleaned.Study == "Yanqing Wang 2017"]
tha = cleaned[cleaned.Study == "Thakore 2020"]
Finally, I load each of the individual studies into traces, and display them in a graph. Selected y and selected x are strings, such as "Temperature (C) " and "Halflife (Min)":
trace1 = go.Scatter(y=tha[selected_y], x=tha[selected_x])
trace2 = go.Scatter(y=yan[selected_y], x=yan[selected_x])
What I need to do is, after splitting the array into individual studies, find the maximum halflife relative to each temperature (0,50,100,150,200,250,300) and compile them into separate lists, then find the max value of these lists, take the whole row and append them into the same list. I have tried to do this using stuff like:
yan50 = yanq[yanq['Temperature (C) '] == 50]
yan100 = yanq[yanq['Temperature (C) '] == 100]
yan150 = yanq[yanq['Temperature (C) '] == 150]
yan200 = yanq[yanq['Temperature (C) '] == 200]
yan250 = yanq[yanq['Temperature (C) '] == 250]
yan300 = yanq[yanq['Temperature (C) '] == 300]
To split the study into the varying degree lists. I am currently stuck where I have to find the max value in halflife column of each list and add the whole corresponding row into a new list. This is what I am trying:
yan = pd.DataFrame(columns=["Study","Gas","Surfactant","Surfactant Concentration","Additive","Additive Concentration","LiquidPhase","Quality","Pressure (Psi)","Temperature (C) ","Shear Rate (/Sec)","Halflife (Min)","Viscosity","Color"])
if (len(yan50) > 0):
yan50.loc[yan50['Halflife (Min)'].idxmax()]
yan50 = yan50.dropna()
yan.append(yan50)
if (len(yan100) > 0):
yan100.loc[yan100['Halflife (Min)'].idxmax()]
yan100 = yan100.dropna()
yan.append(yan100)
if (len(yan150) > 0):
yan150.loc[yan150['Halflife (Min)'].idxmax()]
yan150 = yan150.dropna()
yan.append(yan150)
if (len(yan200) > 0):
yan200.loc[yan200['Halflife (Min)'].idxmax()]
yan200 = yan200.dropna()
yan.append(yan200)
if (len(yan250) > 0):
yan250.loc[yan250['Halflife (Min)'].idxmax()]
yan250 = yan250.dropna()
yan.append(yan250)
if (len(yan300) > 0):
yan300.loc[yan300['Halflife (Min)'].idxmax()]
yan300 = yan300.dropna()
yan.append(yan300)yan50.iloc[yan50['Halflife (Min)'].idxmax()]
The error I am getting is the individual temperature lists are empty.
I also got a bunch of Nan values for the separate temperature lists I compiled, and I am unsure if I am splitting the list correctly. I am not too strong with Pandas. Recommendations needed!
Link to CSV of data
------------Edit-------------
What I have, all the studies placed on the same temp points (50, 100, etc). I want to find the maximum value of halflife, so that only the topmost point shows. The reason I am doing this is to aid in data-visualization. Future plans beyond this topic include: connecting the max value dots with a line and comparing the trends of the separate studies halflife values.
IIUC, what you need is
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
This will result in
Study Temperature (C) Max_halflife
0 Thakore 2020 50 120.00
1 Thakore 2020 100 2.40
2 Thakore 2020 150 0.20
3 Yanqing Wang 2017 50 123.00
4 Yanqing Wang 2017 100 3.20
5 Yanqing Wang 2017 150 0.31
Then the code below should get you the graph you wnat.
import seaborn as sns
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
fig = plt.figure(figsize=(8, 5))
sns.scatterplot(x='Temperature (C) ', y='Max_halflife', data=df2, hue='Study')

How to get multiple drive time between two addresses using arcGIS api

I'm using the ArcGIS api (ArcMap - 10.5.1), and I'm trying to get the drive time between two addresses. I can get the drive time between two points, but I don't know how to iterate over multiple points. I have one hundred addresses. I keep getting
AttributeError: 'NoneType' object has no attribute '_tools'
This is the Pandas dataframe I'm working with. I have two columns with indexes. Column 1 is the origin address and column two is the second address. If possible, I would love to make a new row with the drive time.
df2
Address_1 Address_2
0 1600 Pennsylvania Ave NW, Washington, DC 20500 2 15th St NW, Washington
1 400 Broad St, Seattle, WA 98109 325 5th Ave N, Seattle
This is the link where I grabbed the code from
https://developers.arcgis.com/python/guide/performing-route-analyses/
I tried hacking this code. Specifically the code below.
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)
#empty list - will be used to store calculated distances
list = [0]
# Loop through each row in the data frame using pairwise
for (i1, row1), (i2, row2) in pairwise(df.iterrows()):
https://medium.com/how-to-use-google-distance-matrix-api-in-python/how-to-use-google-distance-matrix-api-in-python-ef9cd895303c
I looked up what non_type means so I tried printing out to see if anything would print out and it works fine. I mostly use R and don't use python much.
for (i,j) in pairwise(df2.iterrows()):
print(i)
print(j)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy
from datetime import datetime
from IPython.display import HTML
import json
from arcgis.gis import GIS
import arcgis.network as network
import arcgis.geocoding as geocoding
from itertools import tee
user_name = 'username'
password = 'password'
my_gis = GIS('https://www.arcgis.com', user_name, password)
route_service_url = my_gis.properties.helperServices.route.url
route_service = network.RouteLayer(route_service_url, gis=my_gis)
for (i,j) in pairwise(df2.iterrows()):
stop1_geocoded = geocoding.geocode(i)
stop2_geocoded = geocoding.geocode(j)
stops = '{0},{1}; {2},{3}'.format(stop1_geocoded[0]['attributes']['X'],
stop1_geocoded[0]['attributes']['Y'],
stop2_geocoded[0]['attributes']['X'],
stop2_geocoded[0]['attributes']['Y'])
route_layer = network.RouteLayer(route_service_url, gis=my_gis)
result = route_layer.solve(stops=stops, return_directions=False, return_routes=True,
output_lines='esriNAOutputLineNone', return_barriers=False,
return_polygon_barriers=False, return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
print("Total travel time is {0:.2f} min".format(travel_time))
The expected output is a list of drive times as a list. I tried appended all to a dataframe, and that would be ideal. So the ideal output would be 3 columns - address 1, address 2, and drive time. The code does work with one address at a time (instead of i,j it's just two addresses as strings and no for statement).
example:
Address_1 Address_2
0 1600 Pennsylvania Ave NW, Washington, DC 20500 2 15th St NW, Washington
1 400 Broad St, Seattle, WA 98109 325 5th Ave N, Seattle
drive_time
0 7 minutes
1 3 minutes
Your use of the pairwise function is unnecessary. Just wrap the arcGIS code in a function that will return the time to you, and this way you can map the values as a new column on your dataframe.
Also make sure that you import the time library, which is not noted on the arcGIS documentation but is needed to run this.
`
def getTime(row):
try:
stop1_geocoded = geocoding.geocode(row.df_column_1)
stop2_geocoded = geocoding.geocode(row.df_column_2)
stops = '{0},{1}; {2},{3}'.format(stop1_geocoded[0]['attributes']['X'],
stop1_geocoded[0]['attributes']['Y'],
stop2_geocoded[0]['attributes']['X'],
stop2_geocoded[0]['attributes']['Y'])
route_layer = network.RouteLayer(route_service_url, gis=my_gis)
result = route_layer.solve(stops=stops, return_directions=False, return_routes=True,
output_lines='esriNAOutputLineNone', return_barriers=False,
return_polygon_barriers=False, return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
time = "Total travel time is {0:.2f} min".format(travel_time)
return time
except RuntimeError:
return
streets['travel_time'] = streets.apply(getTime, axis=1)
`

Drop Dataframe Rows Based on a Similarity Measure Pandas

I want to eliminate repeated rows in my dataframe.
I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:
Title | Area | Price
Apartment at Boston 100 150000
Apt at Boston 105 149000
I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?
Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
'Area':[100,105],
'Price':[150000,149000]})
def string_ratio(df,col,ratio):
from difflib import SequenceMatcher
import numpy as np
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
ratios = []
for i, x in enumerate(df[col]):
a = np.array([similar(x, row) for row in df[col]])
a = np.where(a < ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
def numeric_ratio(df,col,ratio):
ratios = []
for i, x in enumerate(df[col]):
a = np.array([min(x,row)/max(x,row) for row in df[col]])
a = np.where(a<ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))
df1[mask]
It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.
See if this meets your needs
Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())
The df created is as below
Title Area Price
0 Apartment at Boston 100 150000
1 Apt at Boston 105 149000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
4 Apt at Seattle 101 150000
5 Apt at Chicago 101 150000
Now, we run the code as below
from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
val=fuzz.partial_ratio(a,b)
return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
itered=i+1
while itered < len(tl):
val=fuzzy_compare(tl[i],tl[itered])
if val > 80:
if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
df.drop(itered,inplace=True)
df.reset_index()
pass
else:
pass
else:
pass
else:
pass
itered=itered+1
while i < len(tl)-1:
try:
do_the_thing(i)
i=i+1
except:
i=i+1
pass
else:
pass
the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.
Title Area Price
0 Apartment at Boston 100 150000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300

Categories