Find similarties using nlp/spacy - python

I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow
I dont know how to get similarties 100% in this case.
I tried to use spacy but i dont how to fix difference and correct the id in df.
This is my dataframe1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
This my reference dataframe:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB:The sentences are not in the same order / The dataframes don't have equal length

If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):
import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
def ngrams(description, n=3):
description = description.lower()
description = re.sub(r'[,-./]|\sBD',r'', description)
ngrams = zip(*[description[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def build_index(descriptions, vectorizer):
ref_vectors = vectorizer.fit_transform(descriptions)
index = nmslib.init(method='hnsw',
space='cosinesimil_sparse',
data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(ref_vectors)
index.createIndex()
return index
def search_index(queries, vectorizer, index):
query_vectors = vectorizer.transform(query_df['description'])
results = index.knnQueryBatch(query_vectors, k=1)
return [res[0][0] for res in results]
# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
index = build_index(ref_df['description'], vectorizer)
results = search_index(query_df['description'], vectorizer, index)
query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)
This gives:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10 YEAR OLD LIVES IN AT13
We can do more pre-processing in ngrams, EG: stop words, handling symbols, etc.

As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy which, as the name suggests, performs fuzzy string matching.
It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
...even the most basic ratio function seems to give us the correct result.
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd
fuzzy_ratio = np.vectorize(fuzz.ratio)
dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)
dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)
Result:
0 1 2 3 4
0 52 59 66 82 63
1 49 82 65 66 62
2 39 58 78 65 81
The row-wise maximum values suggest the following mappings:
'STEVE has 50 OLD LIVES IN At.14', 'STEVE HAS 50year OLD LIVES IN At 14'
'JOHN HAS 25 YEAR OLD LIVES IN At/12', 'JOHN HAS 25year OLD LIVES IN At 2'
'ALICIE HAS 10 YEAR OLD LIVES IN AT15', 'ALICIE HAS 10 YEAR OLD LIVES IN AT15'
Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.

Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.
Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,
text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()
It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.
This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.
If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.

I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:
import re
def text_clean(text):
#remove everything except alphabets
text = re.sub('[^A-Za-z.]', ' ', text)
text = text.lower()
return text
Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.

Related

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

How to remove duplicated words in csv rows in python?

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn.
Thank you
[I am using python 3.7 version]
To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object
Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply on string, it might be slow.
Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

python how to dynamically find a persons name in a string

im working on a project where i have to use speech to text as an input to determine who to call, however using the speech to text can give some unexpected results so i wanted to have a little dynamic matching of the strings, i'm starting small and try to match 1 single name, my name is Nick Vaes, and i try to match my name to the spoken text, but i also want it to match when for example some text would be Nik or something, idealy i would like to have something that would match everything if only 1 letter is wrong so
Nick
ick
nik
nic
nck
would all match my name, the current simple code i have is:
def user_to_call(s):
if "NICK" or "NIK" in s.upper(): redirect = "Nick"
if redirect: return redirect
for a 4 letter name its possible to put all possibilities in the filter, but for names with 12 letters it is a little bit of overkill since i'm pretty sure it can be done way more efficient.
You need to use Levenshtein_distance
A python implementation is nltk
import nltk
nltk.edit_distance("humpty", "dumpty")
What you basically need is fuzzy string matching, see:
https://en.wikipedia.org/wiki/Approximate_string_matching
https://www.datacamp.com/community/tutorials/fuzzy-string-python
Based on that you can check how similar is the input compared your dictionary:
from fuzzywuzzy import fuzz
name = "nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey", "njick", "nickk", "nickn"]
for str in tomatch:
ratio = fuzz.ratio(str.lower(), name.lower())
print(ratio)
This code will produce the following output:
100
86
86
86
86
80
89
89
89
You have to experiment with different ratios and check which will suit your requirements to miss only one letter
From what I understand, you are not looking at any fuzzy matching. (Because you did not upvote other responses).
If you are just trying to evaluate what you specified in your request, here is the code. I have put some additional conditions where I printed the appropriate message. Feel free to remove them.
def wordmatch(baseword, wordtoMatch, lengthOfMatch):
lis_of_baseword = list(baseword.lower())
lis_of_wordtoMatch = list(wordtoMatch.lower())
sum = 0
for index_i, i in enumerate(lis_of_wordtoMatch):
for index_j, j in enumerate(lis_of_baseword):
if i in lis_of_baseword:
if i == j and index_i <= index_j:
sum = sum + 1
break
else:
pass
else:
print("word to match has characters which are not in baseword")
return 0
if sum >= lengthOfMatch and len(wordtoMatch) <= len(baseword):
return 1
elif sum >= lengthOfMatch and len(wordtoMatch) > len(baseword):
print("word to match has no of characters more than that of baseword")
return 0
else:
return 0
base = "Nick"
tomatch = ["Nick", "ick", "nik", "nic", "nck", "nickey","njick","nickk","nickn"]
wordlength_match = 3 # this says how many words to match in the base word. In your case, its 3
for t_word in tomatch:
print(wordmatch(base,t_word,wordlength_match))
the output looks like this
1
1
1
1
1
word to match has characters which are not in baseword
0
word to match has characters which are not in baseword
0
word to match has no of characters more than that of baseword
0
word to match has no of characters more than that of baseword
0
Let me know if this served your purpose.

Processing malformed text data with machine learning or NLP

I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...

Figure out if a business name is very similar to another one - Python

I'm working with a large database of businesses.
I'd like to be able to compare two business names for similarity to see if they possibly might be duplicates.
Below is a list of business names that should test as having a high probability of being duplicates, what is a good way to go about this?
George Washington Middle Schl
George Washington School
Santa Fe East Inc
Santa Fe East
Chop't Creative Salad Co
Chop't Creative Salad Company
Manny and Olga's Pizza
Manny's & Olga's Pizza
Ray's Hell Burger Too
Ray's Hell Burgers
El Sol
El Sol de America
Olney Theatre Center for the Arts
Olney Theatre
21 M Lounge
21M Lounge
Holiday Inn Hotel Washington
Holiday Inn Washington-Georgetown
Residence Inn Washington,DC/Dupont Circle
Residence Inn Marriott Dupont Circle
Jimmy John's Gourmet Sandwiches
Jimmy John's
Omni Shoreham Hotel at Washington D.C.
Omni Shoreham Hotel
I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings.
First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is here, and a PDF is freely available here. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering.
Parsing consists of applying various cleaning techniques. Some examples:
Standardizing lettercase (e.g., all lowercase)
Standardizing punctuation (e.g., commas must be followed by spaces)
Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
Standardizing legal control terms (e.g., converting "Co." to "Company")
In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.
Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.
In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.
Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.
I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.
I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.
Parsing
Let's use this odd name as an example.
name = "THE | big,- Pharma: LLC" # example of a company name
We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:
from cleanco import cleanco
name = cleanco(name).clean_name() # 'THE | big,- Pharma'
Remove all punctuation:
name = name.translate(None, string.punctuation) # 'THE big Pharma'
(for unicode strings, the following code works instead (source, regex):
import regex
name = regex.sub(ur"[[:punct:]]+", "", name) # u'THE big Pharma'
Split the name into tokens using NLTK:
import nltk
tokens = nltk.word_tokenize(name) # ['THE', 'big', 'Pharma']
Lowercase all tokens:
tokens = [t.lower() for t in tokens] # ['the', 'big', 'pharma']
Remove stop words. Note that it might cause problems with companies like On Mars will be incorrectly matched to Mars, because On is a stopword.
from nltk.corpus import stopwords
tokens = [t for t in tokens if t not in stopwords.words('english')] # ['big', 'pharma']
I don't cover accented and special characters here (improvements welcome).
Matching
Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.
To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.
You could use the Levenshtein distance, which could be used to measure the difference between two sequences (basically an edit distance).
Levenshtein Distance in Python
def levenshtein_distance(a,b):
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
if __name__=="__main__":
from sys import argv
print levenshtein_distance(argv[1],argv[2])
There is great library for searching for similar/fuzzy strings for python: fuzzywuzzy. It's a nice wrapper library upon mentioned Levenshtein distance measuring.
Here how your names could be analysed:
#!/usr/bin/env python
from fuzzywuzzy import fuzz
names = [
("George Washington Middle Schl",
"George Washington School"),
("Santa Fe East Inc",
"Santa Fe East"),
("Chop't Creative Salad Co",
"Chop't Creative Salad Company"),
("Manny and Olga's Pizza",
"Manny's & Olga's Pizza"),
("Ray's Hell Burger Too",
"Ray's Hell Burgers"),
("El Sol",
"El Sol de America"),
("Olney Theatre Center for the Arts",
"Olney Theatre"),
("21 M Lounge",
"21M Lounge"),
("Holiday Inn Hotel Washington",
"Holiday Inn Washington-Georgetown"),
("Residence Inn Washington,DC/Dupont Circle",
"Residence Inn Marriott Dupont Circle"),
("Jimmy John's Gourmet Sandwiches",
"Jimmy John's"),
("Omni Shoreham Hotel at Washington D.C.",
"Omni Shoreham Hotel"),
]
if __name__ == '__main__':
for pair in names:
print "{:>3} :: {}".format(fuzz.partial_ratio(*pair), pair)
>>> 79 :: ('George Washington Middle Schl', 'George Washington School')
>>> 100 :: ('Santa Fe East Inc', 'Santa Fe East')
>>> 100 :: ("Chop't Creative Salad Co", "Chop't Creative Salad Company")
>>> 86 :: ("Manny and Olga's Pizza", "Manny's & Olga's Pizza")
>>> 94 :: ("Ray's Hell Burger Too", "Ray's Hell Burgers")
>>> 100 :: ('El Sol', 'El Sol de America')
>>> 100 :: ('Olney Theatre Center for the Arts', 'Olney Theatre')
>>> 90 :: ('21 M Lounge', '21M Lounge')
>>> 79 :: ('Holiday Inn Hotel Washington', 'Holiday Inn Washington-Georgetown')
>>> 69 :: ('Residence Inn Washington,DC/Dupont Circle', 'Residence Inn Marriott Dupont Circle')
>>> 100 :: ("Jimmy John's Gourmet Sandwiches", "Jimmy John's")
>>> 100 :: ('Omni Shoreham Hotel at Washington D.C.', 'Omni Shoreham Hotel')
Another way of solving such kind of problems could be Elasticsearch, which also supports fuzzy searches.
I searched for "python edit distance" and this library came as the first result: http://www.mindrot.org/projects/py-editdist/
Another Python library that does the same job is here: http://pypi.python.org/pypi/python-Levenshtein/
An edit distance represents the amount of work you need to carry out to convert one string to another by following only simple -- usually, character-based -- edit operations. Every operation (substition, deletion, insertion; sometimes transpose) has an associated cost and the minimum edit distance between two strings is a measure of how dissimilar the two are.
In your particular case you may want to order the strings so that you find the distance to go from the longer to the shorter and penalize character deletions less (because I see that in many cases one of the strings is almost a substring of the other). So deletion shouldn't be penalized a lot.
You could also make use of this sample code: http://norvig.com/spell-correct.html
This a bit of an update to Dennis comment. That answer was really helpful as was the links he posted but I couldn't get them to work right off. After trying the Fuzzy Wuzzy search I found this gave me a bunch better set of answers. I have a large list of merchants and I just want to group them together. Eventually I'll have a table I can use to try some machine learning to play around with but for now this takes a lot of the effort out of it.
I only had to update his code a little bit and add a function to create the tokens2frequency dictionary. The original article didn't have that either and then the functions didn't reference it correctly.
import pandas as pd
from collections import Counter
from cleanco import cleanco
import regex
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# token2frequency is just a Counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency[t]**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a)
b_tokens = set(b)
a_uniq = sequence_uniqueness(a, token2frequency)
b_uniq = sequence_uniqueness(b, token2frequency)
if a_uniq==0 or b_uniq == 0:
return 0
else:
return sequence_uniqueness(a_tokens.intersection(b_tokens), token2frequency)/(a_uniq * b_uniq) ** 0.5
def parse_name(name):
name = cleanco(name).clean_name()
#name = name.translate(None, string.punctuation)
name = regex.sub(r"[[:punct:]]+", "", name)
tokens = nltk.word_tokenize(name)
tokens = [t.lower() for t in tokens]
tokens = [t for t in tokens if t not in stopwords.words('english')]
return tokens
def build_token2frequency(names):
alltokens = []
for tokens in names.values():
alltokens += tokens
return Counter(alltokens)
with open('marchants.json') as merchantfile:
merchants = pd.read_json(merchantfile)
merchants = merchants.unique()
parsed_names = {merchant:parse_name(merchant) for merchant in merchants}
token2frequency = build_token2frequency(parsed_names)
grouping = {}
for merchant, tokens in parsed_names.items():
grouping[merchant] = {merchant2: name_similarity(tokens, tokens2, token2frequency) for merchant2, tokens2 in parsed_names.items()}
filtered_matches = {}
for merchant in pcard_merchants:
filtered_matches[merchant] = {merchant1: ratio for merchant1, ratio in grouping[merchant].items() if ratio >0.3 }
This will give you a final filtered list of names and the other names they match up to. It's the same basic code as the other post just with a couple of missing pieces filled in. This also is run in Python 3.8
Consider using the Diff-Match-Patch library. You'd be interested in the Diff process - applying a diff on your text can give you a good idea of the differences, along with a programmatic representation of them.
What you can do is separate the words by whitespaces, commas, etc. and then you you count the number of words it have in common with another name and you add a number of words thresold before it is considered "similar".
The other way is to do the same thing, but take the words and splice them for each caracters. Then for each words you need to compare if letters are found in the same order (from both sides) for an x amount of caracters (or percentage) then you can say that the word is similar too.
Ex: You have sqre and square
Then you check by caracters and find that sqre are all in square and in the same order, then it's a similar word.
The algorithms that are based on the Levenshtein distance are good (not perfect) but their main disadvantage is that they are very slow for each comparison and concerning the fact that you would have to compare every possible combination.
Another way of working out the problem would be, to use embedding or bag of words to transform each company name (after some cleaning and prepossessing ) into a vector of numbers. And after that you apply an unsupervised or supervised ML method depending on what is available.
I created matchkraft (https://github.com/MatchKraft/matchkraft-python). It works on top of fuzzy-wuzzy and you can fuzzy match company names in one list.
It is very easy to use. Here is an example in python:
from matchkraft import MatchKraft
mk = MatchKraft('<YOUR API TOKEN HERE>')
job_id = mk.highlight_duplicates(name='Stackoverflow Job',
primary_list=[
'George Washington Middle Schl',
'George Washington School',
'Santa Fe East Inc',
'Santa Fe East',
'Rays Hell Burger Too',
'El Sol de America',
'microsoft',
'Olney Theatre',
'El Sol'
]
)
print (job_id)
mk.execute_job(job_id=job_id)
job = mk.get_job_information(job_id=job_id)
print (job.status)
while (job.status!='Completed'):
print (job.status)
time.sleep(10)
job = mk.get_job_information(job_id=job_id)
results = mk.get_results_information(job_id=job_id)
if isinstance(results, list):
for r in results:
print(r.master_record + ' --> ' + r.match_record)
else:
print("No Results Found")

Categories