Fuzzy matching strings embedded within strings

Fuzzy matching strings embedded within strings - python

I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:
locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']
sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'
result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]
In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:
Milan
should match
Melan, Mlian, or Mlan but not Milano
The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.
Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?
def find_keyword_comments(sents, locations, state_init):
keywords = '|'.join(locations)
keywords1 = '|'.join(state_init)
word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
newlist = filter(word.match, test_comments)
newlist1 = filter(word1.match, test_comments)
final = list(newlist) + list(newlist1)
return final

I would recommend you look at metrics for fuzzy matching, mainly the one you are interested is Levenshtein Distance (sometimes called the edit distance).
Here are some implementations in pure python, but you can leverage a few modules to make your life easier:
fuzzywuzzy is a very common (pip-installable) package which implements this distance for what they call the pure ratio. It provides a bit more functionality than you are maybe looking for (partial string matching, ignoring punctuation marks, token order insensitivity...). The only drawback is that the ratio takes into account the length of the string as well. See this response for further basic usage
from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!") # 96
python-Levenshtein is a pretty fast package because it is basically a wrapper in python to the C library underneath. The documentation is not the nicest, but should work. It is now back in the PyPI index so it is pip installable.

Related

Match and Group similar words that are related to each other (relevant) in a list

It is not just grouping the words in similarities but also meaning. Say that I have the following list:
func = ['Police Man','Police Officer','Police','Admin','Administrator','Dietitian','Food specialist','Administrative Assistant','Economist','Economy Consultant']
I want to find words with similar meaning and function. I tried fuzzywuzzy but it does not achieve what I want:
for i in func:
for n in func:
print(i,":",n)
print(fuzz.ratio(i, n))
This is part of the fuzzing and it does not do the job:
Dietitian : Dietitian
100
Dietitian : Food specialist
25
I believe I should use library nltk or stemming? What is the best approach to find relevant words and functions in a list?

I believe I should use ... stemming?
You definitely don't want to use stemming. Stemming will only take words to their roots, so stem("running") = "run". It doesn't do anything based on meaning, so stem("sprinting") = "sprint" != "run". :(
I believe I should use nltk ...
WordNet will let you search for sets of synonyms called "synsets" and you can access it through nltk or even through a web interface. It's not great at compound words, though. :( It's mostly just individual words.
So, you can look up "officer" and "policeman" and see that they have an overlapping meaning. Of course, "officer" also has OTHER meanings; how close do words have to be to qualify for your search? E.g. if "Food Specialist" is the same as "Dietician", is "Food Specialist" also the same as "Chef"?
If WordNet does seem like a useful tool, check out their Python API. You'd want something like
common = [synset for synset in wn.synsets("officer") if synset in wn.synsets("policeman")
return len(common) > 0

Remove repeating characters from sentence but retain the words meaning

I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school
to I'm so happy about offline school. See, haaappppyyyy became happy and offline & school stay the same instead becoming ofline & schol
I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for
Using Regex :
tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)
output :
I'm so haappyy about offline school #it makes 2 chars for every repating chars
using itertools :
tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))
output :
I'm so hapy about ofline schol
How can I fix this? should I make a lists of words I want to exclude?
In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :
wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk
I want to make the second and the third word into the first word, the base form

You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.
See an example Python code:
import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_]+', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"
The code uses the Textblob library, but you may use any you like.
Note that ([^\W\d_])\1{2,} matches any three or more consecutive letters, [^\W\d_]+ matches one or more letters.

This answer was originally written for Regex to reduce repeated chars in a string which was closed as duplicate before I could submit my post. So I "recycled" it here.
Regex is not always the best solution
Regex for validation of formats or input
A regex is often used for low-level pattern recognition and substitution.
It may be useful for validation of formats. You can see it as "dump" automation.
Linguistics (NLP)
When it comes to natural language (NLP), or here spelling (dictionary) the semantics may play a role. Depending on the context "ass" and "as" may both be correctly spelled, although the semantics are very different.
(I apologize for the rude examples, but I am not a native-speaker and those two had the most distinct meaning depending on re-duplication).
For those cases a regex or simple pattern-recognition may be not sufficient. It can cause more effort to apply it correctly than the research for a language-specific library or solution (including a basic application).
Examples for spelling that a regex may struggle with
Like the difference between "haappy" (orthographically invalid, but only the duplicated vowels "aa", not the consonants "pp") and "yeees" (contains no duplicates in correct spelling) or "kiss" (is correctly spelled with duplicate consonants)
Spelling correction requires more
For example a dictionary to lookup if duplicate characters (vowels or consonants) are valid for correct spelling of the word in its form.
Consider a spelling-correction module
You could use textblob module for spelling correction:
To install:
pip install textblob
Example for some test-cases (independent words):
from textblob import TextBlob
incorrect_words = ["cmputr", "yeees", "haappy"] # incorrect spelling
text = ",".join(incorrect_words) # join them as comma separated list
print(f"original words: {text}")
b = TextBlob(text)
# prints the corrected spelling
print(f"corrected words: {b.correct()}")
Prints:
original words: cmputr,yeees,haappy
corrected words: computer,eyes,happy
Surprise: You might have expected "yes" (so did I). But the correction results not in removal of 2 duplicated vowels "ee", but rearrangement to keep almost all letters (5 of 6, only removed one "e").
Example for the given sentence:
from textblob import TextBlob
tweet = "I'm so haaappppyyyy about offline school" # either escape or use different quotes when a single-quote (') is enclosed
print(TextBlob(tweet).correct())
Prints:
I'm so haaappppyyyy about office school
Unfortunately quite worse:
not "happy"
semantically out-of-scope with "office" instead "offline"
Apparently a preceeding cleaning step using regex, like Wiktor suggests, may ameliorate the result.
See also:
Stackabuse: Spelling Correction in Python with TextBlob, tutorial
documentation: TextBlob: Simplified Text Processing

Well, first of all you need a list (or set) of all allowed words, to compare with.
I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list.
The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)

Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].

I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.

Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...

If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True

Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)

If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

strategies for finding duplicate mailing addresses

I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:
addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'
addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'
I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?
addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'
addr_3 = '570348THAV'
adrr_4 = '570348AV'
Thankful,
Eduardo

First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):
adr = " ".join(adr.tolower().split())
Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":
adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)
Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).
Be sure to read all the help for the re module; it's powerful but cryptic.
Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:
http://en.wikipedia.org/wiki/Soundex
http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html
adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]
Then you can work with the list or join it back to a string as you think best.
The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.
Good luck.

Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.
Take for example this address
56 5th avenue
And this
5, 65th avenue
with your method both of them will be:
565THAV
What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.
The algorithm can go like this:
replace all commas dashes with spaces. Use he translate method for that.
Build a dictionary with words and their abbreviated form
Remove the TH part if it was following a number.

This should be helpful in building your dictionary of abbreviations:
https://pe.usps.com/text/pub28/28apc_002.htm

I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.
What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.
It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.
Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.
Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).

In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.

I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.
You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.