I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:
addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'
addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'
I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?
addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'
addr_3 = '570348THAV'
adrr_4 = '570348AV'
Thankful,
Eduardo
First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):
adr = " ".join(adr.tolower().split())
Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":
adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)
Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).
Be sure to read all the help for the re module; it's powerful but cryptic.
Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:
http://en.wikipedia.org/wiki/Soundex
http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html
adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]
Then you can work with the list or join it back to a string as you think best.
The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.
Good luck.
Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.
Take for example this address
56 5th avenue
And this
5, 65th avenue
with your method both of them will be:
565THAV
What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.
The algorithm can go like this:
replace all commas dashes with spaces. Use he translate method for that.
Build a dictionary with words and their abbreviated form
Remove the TH part if it was following a number.
This should be helpful in building your dictionary of abbreviations:
https://pe.usps.com/text/pub28/28apc_002.htm
I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.
What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.
It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.
Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.
Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).
In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.
I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.
You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.
Related
I have texts where delimiter can be anything in the list [;,.?]
txt1 = "Kids of today have started selling drugs or taken drugs at this age, then we are finished as parent,what generation are we going to have when our generation is no more,am sick to my stomach, it means we do not have tomorrow leaders or future leader, drugs at this stage woowowow parent and Guidance's fasten your belt if not we will wake up someday to see what we never thought could happen"
txt2 = "There was a clear warning sign, and this person chose to take a risk regardless. It was quite a stupid decision to climb the fence, but even this is probably a common activity that generally never results in death. More of a freak accident than a definite way for someone to die. At the very most, the only changes that should be made by the airport / authorities would be to the fence design, making it more difficult for people to climb up. Barricading the area off completely and banning people from the area would be comparable to fencing off a scenic mountain path that hundreds of people like to climb and enjoy safely, but which does produce the occasional fatality when people slip. Just because this area carries a (clearly communicated) risk shouldn't be a reason for the authorities to step in and make adjustments. People take risks and are responsible for their own safety in areas like this. One fatality is a tiny drop in the bucket compared to the hundreds of people doing this each month without incident."
How to break multiline sentences into independent sentences, depending on the presence of delimiter. For example, in txt1, delimiter should be ','(comma) whereas, in txt2, delimiter should be '.'(dot).
I have used re.split() for this, but I am not getting desired results. I used:
print(re.split(';|,|.|?',txt1))
you have to add an escape character\ in front of . and ?
print(re.split(';|,|\.|\?',txt1))
to avoid the blank characters/empty strigs, do a list comprehension
[x for x in re.split(';|,|\.|\?',txt1) if x]
Both dot and question mark are regex metacharacters, which means that these characters, when used unescaped, have a special meaning, and do not mean their literal values. One quick fix to your problem would be to split on a regex alternation:
print(re.split('[;,.?]', txt1))
try this:
import re
DATA = "sample, text"
print(re.split(r'[;,.?]+', DATA))
You can directly pass the list of delimiters if you have it.
Create a string out of list you have in the form of '[your delimiters]'
del_list = '[your delimiters]'
print(re.split('{0}'.format(del_list), txt1))
Given a very long string -
"Given the large category of plants, the split ratio was determined to be 88.4. However, we're not sure if the split ratio was consistent across all subcategories or just a calculated average. If however, it deviated, it would be nonetheless, quite strange.
The words - split ratio. In the output, I want them to appear as split-ratio (as a single word) and I also only want to retain sentences where these words occur. So in this case, only the first two sentences.
Is this possible?
You can use replace in a list comprehension:
s = """Given the large category of plants, the split ratio was
determined to be 88.4. However, we're not sure
if the split ratio was consistent across all subcategories
or just a calculated average. If however, it deviated,
it would be nonetheless, quite strange."""
print('. '.join([x.replace('split ratio', 'split-ratio') for x in s.split('. ') if 'split ratio' in x]) + '.')
will print out only lines that contain 'split ratio' with each of them converted to 'split-ratio'.
Since python is in the tag line I expect you want it in that language right? And to be clear a simple find-replace in a normal text editor isn't going to solve this issue I suppose, you need actual logic to apply onto something.
I would have to stop and look up python for a bit. But in any language the easiest way I can think of is to just parse out the file/stream and make the changes as you go. Read in the stream and look for the pattern you want a match for = "split ratio" - regardless, as you are reading in the stream, write out a new one that favors your changes. But do it in the block size (or string length) of the pattern you are matching.
When you find true for the pattern you are constantly comparing, stop. Don't output that string, instead output the one you want to replace it with into the new target stream/file.
However, a search for python search and replace algorithm gives me this:
https://www.geeksforgeeks.org/python-string-replace/
Someone did the hard work for you already. Love that super high level programming language that leaves folks in the dark as to what is actually happening. Oh well.
Enjoy.
atomkey.
I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?
What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0
Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.
Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.
Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.
Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:
256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>
I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...
Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):
What I had in mind was something that
Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.
I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.
Edit2:
As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.
Phone (or possible email or website):
((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+#[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))
parentheses:
\((.*?)\)
I'm not sure how to use those to construct a everything-but-these statement.
It is possible that in your case it is easier to focus on what you don't want:
html tags (<br>)
phone numbers
everything in parenthesis
Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)
This attempts to isolate the last two lines out of the string:
>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S
Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.
As far as I understood you problem, I think you are taking the wrong way to solve it.
Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.
In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.
Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.
So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically