Efficient Way to Find Partial Duplicates in Dataset - python

I have a large dataset (~20,000 rows) consisting of persons and their info, and am looking for a way to identify potential duplicate persons within this dataset. These duplicates are not necessarily perfect matches since they have been entered manually and some contain typos.
ex)
LastName MiddleName FirstName DOB
1 Farmer Berry Dave 1/1/2004
2 Place D. Tom 8/4/2001
3 Famrer B. Dave 01/01/2004
4 Ander Kate Linda 12/26/1954
5 Place jr. David Tom 8/4/2001
...
In this case row 1 and 3, and rows 2 and 5 would need to be flagged as duplicates. The only solution I have been able to come up with is O(n^2), and consists of iterating through the entire dataset for each record in the dataset, comparing fields for partial matches and flagging rows if matching criteria is met.
Is there a more elegant solution for this?
Edit For clarity: it is possible that none of the fields contain an exact match. People are manually entering all of these individuals very quickly, so there is a lot of opportunity for typos/ incorrect information
LastName MiddleName FirstName DOB
1 John-adams T. Samuel 1/15/2021
2 Jhon-adams Tom Sam 10/15/2021
These 2 rows should be flagged as potential duplicates.

Related

Pandas: Check values between columns in different dataframes and return list of multiple possible values into a new column

I am trying to compare two columns from two different dataframes, and return all possible matches using python: (Kinda of an xlookup in excel but with multiple possible matches)
Please see the details below for sample dataframes and work I attempted.
An explanation of the datasets below: Mark does not own any cars, however, there are several listed under his name, which we know that none belong to him. I am attempting to look at dataframe 1 (Marks) and compare it against the larger dataset that has all other owners and their cars: dataframe 2 (claimed) and return possible owners for Mark's cars as shown below.
Dataframe 1 : Marks
Marks = pd.DataFrame({'Car Brand': ['Jeep','Jeep','BMW','Volvo'],'Owner Name': ['Mark',
'Mark', 'Mark', 'Mark']})
Car Brand Owner Name
0 Jeep Mark
1 Jeep Mark
2 BMW Mark
3 Volvo Mark
Dataframe 2: claimed
Dataframe 2: claimed
claimed = pd.DataFrame({'Car Brand': ['Dodge', 'Jeep', 'BMW', 'Merc', 'Volvo', 'Jeep',
'Volvo'], 'Owner Name': ['Chris', 'Frank','Rob','Kelly','John','Chris','Kelly']})
Car Brand Owner Name
0 Dodge Chris
1 Jeep Frank
2 BMW Rob
3 Merc Kelly
4 Volvo John
5 Jeep Chris
6 Volvo Kelly
The data does have duplicate car brand names HOWEVER, the Owner Names are unique - meaning that Kelly even though she is mentioned twice IS THE SAME PERSON. same for Chris..etc
I want my Mark's dataframe to have a new column that looks like this:
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark Rob
3 Volvo Mark [John, Kelly]
I have tried the below codes:
possible_owners = list()
for cars in Marks['Car Brand']:
for car_brands in claimed['Car Brand']:
if Marks.loc[Marks['Car Brand'].isin(claimed['Car Brand'])]:
sub = list()
sub.append()
possible_owners.append(sub)
else:
not_found = 'No possible Owners Identified'
possible_owners.append(not_found)
#Then I will add possible_owners as a new column to Marks
error code:ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),
a.item(), a.any() or a.all().
I have also tried to do a merge, excel xlookup but (that has many limitations), and I am stuck trying to understand how to return possible matches even if there are multiple and line them up in one row.
Question: how can I compare the two frames, return possible values from the Owner Name column and put these values in a new column in Marks' table?
Excuse my code, I am fairly new to Python.
You could pre-process the claimed dataframe then merge:
lookup = claimed.groupby('Car Brand').apply(lambda x: x['Owner Name'].to_list()).to_frame()
df_m = Marks.merge(lookup, on='Car Brand', how='left').rename(columns={0:'Possible Owners'})
print(df_m)
Result
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark [Rob]
3 Volvo Mark [John, Kelly]
You can always use a list comprehension with the df.Series.isin to do the work.
result = [claimed[claimed['Car Brand'].isin([i])]['Owner Name'].to_numpy() for i in Marks['Car Brand']]
Marks['Possible Owners'] = result
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark [Rob]
3 Volvo Mark [John, Kelly]

Merging/concatenating two datasets on a specific column (different lengths) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two different datasets
df1
Name Surname Age Address
Julian Ross 34 Main Street
Mary Jane 52 Cook Road
len(1200)
df2
Name Country Telephone
Julian US NA
len(800)
df1 contains the full list of unique names; df2 contains less rows as many Name were not added.
I would like to get a final dataset with the full list of names in df1 (and all the fields that are there) plus the fields in df2. I would then expect a final dataset of length 1200 with some empty fields corresponding to the missing name in df2.
I have tried as follows:
pd.concat([df1.set_index('Name'),df2.set_index('Name')], axis=1, join='inner')
but it returns the length of the smallest dataset (i.e. 800).
I have also tried
df1.merge(df2, how = 'inner', on = ['Name'])
... same result.
I am not totally familiar with joining/merging/concatenating functions, even after reading the document https://pandas.pydata.org/docs/user_guide/merging.html .
I know that probably this question will be a duplicate of some others and I will be happy to delete it if necessary, but I would be really grateful if you could provide same help and explaining how to get the expected result:
df
Name Surname Age Address Country Telephone
Julian Ross 34 Main Street US NA
Mary Jane 52 Cook Road
IIUC, Use pd.merge like below:
>>> df1.merge(df2, how='left', on='Name')
Name Surname Age Address Country Telephone
0 Julian Ross 34 Main Street US NaN
1 Mary Jane 52 Cook Road NaN NaN
If you want to keep the number of rows of df1, you have to use how='left' in the case where there is no duplicate names in df2.
Read Pandas Merging 101

dataframe duplicate with conditions?

ok, I have a bit of a humdinger.
I hava a dataframe that can be upwards of 120,000 entries
The frames will be similar to this:
ID UID NAME DATE
1 1234 Bob 02/02/2020
2 1235 Jim 02/04/2020
3 1234 Bob
4 1234 Bob 02/02/2020
5 1236 Jan 20/03/2020
6 1235 Jim
I need to be able to eliminate all duplicates, however; i need to check if in the duplicates, if there is a date, then that one, or one of the ones that does have a date, is the one kept, and remove all others. if there is no date in any of the duplicates, then just use whichever is easiest.
I am struggling to come up with a way to do this elegantly.
My thought is:
iterate through all entries, for each entry, create a temp DF and place all duplicates in it, iterate through THAT df and if i find a date, save the index and then delete each entry that is not that entry.. but that seems VERY bulky and slow.
any better suggestions??
Since the blanks are empty string '', you can do this:
(sample_df.sort_values(['UID','NAME','DATE'],
ascending=(True, True, False))
.groupby(['UID','NAME'])
.first()
.reset_index())
which gives you:
UID NAME DATE
0 1234 Bob 02/02/2020
1 1235 Jim 02/04/2020
2 1236 Jan 20/03/2020
Note the ascending flag in sort_values. Pandas will sort string according to their length, and to have non-empty DATE sorted before empty DATE (i.e. ''), you need to sort the column in descending order.
After sorting, you can simply group each pair of (UID, NAME) and keep the first entry.

Python / Pandas / Pulp Optimization Duplicates

I am trying to optimize a grouping / selection of trial members with limited space, and am running into some trouble. I have the pandas data frames ready for optimization, and can run the linear optimization with no problems, except for one constraint I need to add. I am trying to use binaries for selection (but I am not tied to that for any reason, so if a different method would resolve this, I could switch) from a large list. I need to minimize combined trial time for selection in the next round of trials, but some subjects already ran multiple trials due to the nature of the experiment. I would like to select the best combination of subjects based on minimizing time, but allow some subjects to be in the list multiple times for the optimization (so I do not have to manually remove them beforehand). For instance:
Name Trial ID Time (ms) Selected?
Mary Smith A 11001 33 1
John Doe A 11002 24 0
James Smith B 11003 52 0
Stacey Doe A 11004 21 1
John Doe B 11002 19 1
Is there some way to allow 2 John Doe entries for the optimization but constrain the output to only one selection of him? Thanks for your time!
If you have a requirement to record all the values you want to remove, you could use the duplicated function, like this
# First sort your dataframe
df.sort_values(['Name', 'Time (ms)'], inplace=True)
# Make a new column of duplicated values based only on name
df['duplicated'] = df.duplicated(subset=['Name'])
# You can then access the duplicates, but still have a log of the rejects
df.query('not duplicated')
# Name Trial ID Time (ms) Selected? duplicated
# 2 James Smith B 11003 52 0 False
# 1 John Doe A 11002 24 0 False
# 0 Mary Smith A 11001 33 1 False
# 3 Stacey Doe A 11004 21 1 False
df.query('duplicated')
# Name Trial ID Time (ms) Selected? duplicated
# 4 John Doe B 11002 19 1 True

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Categories