Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:
256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>
I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...
Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):
What I had in mind was something that
Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.
I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.
Edit2:
As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.
Phone (or possible email or website):
((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+#[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))
parentheses:
\((.*?)\)
I'm not sure how to use those to construct a everything-but-these statement.
It is possible that in your case it is easier to focus on what you don't want:
html tags (<br>)
phone numbers
everything in parenthesis
Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)
This attempts to isolate the last two lines out of the string:
>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S
Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.
As far as I understood you problem, I think you are taking the wrong way to solve it.
Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.
In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.
Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.
So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?
Related
I am trying to remove anything starting with \ud
My text:
onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons
The answer I am looking for:
onceuponadollhouse: "Iconic apart and better together â€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code We stand for one another by sharing our lessons
So the ideal way would be to take a step back, work out where in the process the encoding is getting mangled, then fix it. Somehow you're getting (a) surrogate pairs, which are the pairs of characters starting with \ud; and (b) UTF-8 interpreted as Latin-1 or some similar encoding, like the â„¢ after "Barbie".
Taking a step back and making sure that your input text is interpreted correctly would be ideal; here you're losing the emojis "woman with bunny ears" and "ribbon"; another time it might be somebody's name or other piece of important information.
If you're in a situation where you can't do it properly, and you need to strip the surrogate pairs, you can use re.sub:
import re
text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'
stripped = re.sub('[\ud800-\udfff]+', '', text)
print(stripped)
Depending on your purpose, it might be useful to replace those characters with a placeholder; since they always come in pairs, you might do something like this:
import re
text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'
stripped = re.sub('[\ud800-\udfff]{2}', '<unknown character>', text)
print(stripped)
Check out the emot python package. I discovered it this morning in from this article: https://towardsdatascience.com/5-python-libraries-that-you-dont-know-but-you-should-fd6f810773a7
The examples given in the documentation only interpret and emojis, but it also gives their location, so it wouldn't be too much of stretch to replace them.
My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?
I'm trying to write code that parses a large text file. However, in order to get said text file, I run the original PDF file through pdfminer. While this works, it also returns my text file with many random spaces (see below)
SM ITH , JO HN , PHD
1234 S N O RT H AV E
Is there any easy way in Python to remove only certain spaces so words aren't separated? For the sample above, I want it to look like
SMITH, JOHN, PHD
1234 S NORTH AVE
Thanks.
Most likely what you're trying to do is impossible to do perfectly, and very hard to do well enough to satisfy you. I'll explain below.
But there's a good chance you shouldn't be doing it in the first place. pdfminer is highly configurable, and something like just specifying a smaller -M value will give you the text you wanted in the first place. You'll need to do a bit of trial and error, but if this works, it'll be far easier than trying to post-process things after the fact.
If you want to do this, you need to come up with a rule that determines which spaces are "random extra spaces" and which are real spaces before you can code that in Python. And I don't know that there is any such rule.
In your example, you can handle most of them by just turning multiple spaces into single spaces, and single spaces into nothing. It should be obvious how to do that. Even if you can't think of a clever solution, a triple replace works fine:
s = re.sub(r'\s\s+', r'<space>', s)
s = re.sub(r'\s', r'', s)
s = re.sub(r'<space>', r' ', s)
However, this rule isn't quite right, because in JO HN , PHD, the space after the comma isn't a random extra space, but it's not showing up as two or more spaces. And the same for the space in "1234 S". And, most likely, the same thing is true in lots of other cases for your real data.
A different somewhat close rule is that you only remove single spaces between letters. Again, if that works, it's easy to code. For example:
s = re.sub(r'(\w)\s(\w)', r'\1\2', s)
s = re.sub(r'\s+', r' ', s)
But now that leaves a space before the comma after SMITH and JOHN.
Maybe you need to put in a little information about English punctuation—strip the spaces around punctuation, then add back in the spaces after a comma or period, around quotes, etc.
Or… well, nobody but you can know what your data look like and figure it out.
If you can't come up with a good rule, the only option is to build some complicated heuristics around looking up possible words in a dictionary and guessing which one is more likely—which still won't get everything right (e.g., how do you know whether "B OO K M AR K" is "BOOK MARK" or "BOOKMARK"?), but it's the best you could possibly do.
What you are trying to do is impossible, e.g., should "DESK TOP" be "DESK TOP" or "DESKTOP"?
I have a Python script that we're using to parse CSV files with user-entered phone numbers in it - ergo, there are quite a few weird format/errors. We need to parse these numbers into their separate components, as well as fix some common entry errors.
Our phone numbers are for Sydney or Melbourne (Australia), or Auckland (New Zealand), given in international format.
Our standard Sydney number looks like:
+61(2)8328-1972
We have the international prefix +61, followed by a single digit area code in brackets, 2, followed by the two halves of the local component, separated by a hyphen, 8328-1972.
Melbourne numbers simply have 3 instead of 2 in the area code, e.g.
+61(3)8328-1972
The Auckland numbers are similar, but they have a 7-digit local component (3 then 4 numbers), instead of the normal 8 digits.
+64(9)842-1000
We also have matches for a number of common errors. I've separated the regex expressions into their own class.
class PhoneNumberFormats():
"""Provides compiled regex objects for different phone number formats. We put these in their own class for performance reasons - there's no point recompiling the same pattern for each Employee"""
standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})(?P<local_second_half>\d{4})')
space_instead_of_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4}) (?P<local_second_half>\d{4})')
We have one for standard_format numbers, then others for various common error cases e.g. putting an extra zero before the area code (02 instead of 2), or missing hyphens in the local component (e.g.83281972instead of8328-1972`) etc.
We then call these from cascaded if/elifs:
def clean_phone_number(self):
"""Perform some rudimentary checks and corrections, to make sure numbers are in the right format.
Numbers should be in the form 0XYYYYYYYY, where X is the area code, and Y is the local number."""
if not self.telephoneNumber:
self.PHFull = ''
self.PHFull_message = 'Missing phone number.'
else:
if PhoneNumberFormats.standard_format.search(self.telephoneNumber):
result = PhoneNumberFormats.standard_format.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = ''
elif PhoneNumberFormats.extra_zero.search(self.telephoneNumber):
result = PhoneNumberFormats.extra_zero.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Extra zero in area code - ask user to remediate.'
elif PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber):
result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Missing hyphen in local component - ask user to remediate.'
elif PhoneNumberFormats.space_instead_of_hyphen.search(self.telephoneNumber):
result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Space instead of hyphen in local component - ask user to remediate.'
else:
self.PHFull = ''
self.PHFull_message = 'Number didn\'t match recognised format. Original text is: ' + self.telephoneNumber
My aim is to make the matching as tight as possible, yet still at least catch the common errors.
There are number of problems with what I've done above though:
I'm using \d{3,4} to match the first half of the local component. Ideally, however, we only really want to catch a 3-digit first half if if it's a New Zealand number (i.e. starts with +64(9)). That way, we can flag Sydney/Melbourne numbers that are missing a digit. I could separate out auckland_number into it's own regex pattern in PhoneNumberFormats, however, that means it wouldn't catch a New Zealand number combined with the error cases (extra_zero, missing_hyphen, space_instead_of_hyphen). So unless I recreate version of them just for Auckland, like auckland_extra_zero, which seems pointlessly repetitive, I can't see how to address this easily.
We don't pickup combinations of errors - e.g. if they have a extra zero, and a missing hyphen, we won't pick this up. Is there an easy way to do this using regex, without explicitly creating permutations of the different errors?
I'd like to address the above two issues, and hopefully tighten it up a bit to catch anything that I've missed. Is there a smarter way to do what I've attempted to do above?
Cheers,
Victor
Additional Comments:
The following is just to provide some context:
This script is for a global company, with one office in Sydney, one in Melbourne and one in Auckland.
The numbers come from an internal Active Directory listing of employees (i.e. it's not a customer listing, but our own office phones).
Hence, we're not looking for a general Australian phone number matching script, rather, we're looking at a general sript to parse numbers from three specific offices. General, it's only the last 4 numbers that should differ.
Mobile phones aren't required.
The script is designed to parse a CSV dump of the Active Directory, and reformat the numbers into an acceptable format for another program (QuickComm)
This program is from a external vendor, and requires numbers in the exact format that I've produced in the code above - that's why the numbers are spat out like 0283433422.
The script I've written can't change the records, it only works on a CSV dump of them - the records are stored in Active Directory, and the only way to access them to get them fixed is to email the employee and ask them to login and change their own records.
So this script is run by a PA, to produce the output required by this program. She/he will also get a list of people who have incorrectly formatted numbers - hence the messages about asking the user to remediate. In theory, there should only a be small number of these. We then email/ring these employees, asking them to fix their records - the script is run once a month (numbers may change), we also need to flag new employees that manage to enter their records in wrong as well.
#John Macklin: Are you recommending I scrap regexes, and just try to pull specific-position digits out of the string?
I was looking for a way to catch the common error cases, in combinations (e.g. space instead of hyphen, combined with an extra zero), but is this not easily feasible?
Don't use complicated regexes. Delete EVERYTHING except digits -- non-digits are error-prone cruft. If the third digit is 0, delete it.
Expect 61 followed by valid AUS area code ([23478] for generality NB 4 is for mobiles) then 8 digits
or 64 followed by valid NZL area code (whatever that is) followed by 7 digits. Anything else is bad. In the good stuff, insert the +()- at the appropriate places.
By the way (1) area code 2 is for the whole of NSW+ACT, not just Sydney, 3 is for VIC+TAS (2) lots of people these days don't have landlines, just mobiles, and people tend to retain the same mobile phone number longer than they maintain the same landline phone number or the same postal address, so mobile phone number is great for fuzzy matching customer records -- so I'm more than a little curious why you don't include them.
The following tell you all you ever wanted to know, plus a whole lot more, about the Australian and New Zealand phone numbering schemes.
Comment on the regexes:
(1) You are using the search method with a "^" prefix. Using the match method with no prefix is somewhat less inelegant.
(2) You don't seem to be checking for trailing rubbish in your phone number field:
>>> import re
>>> standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\
)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
>>> m =standard_format.search("+61(3)1234-567890whoopsie")
>>> m.groups()
('61', '3', '1234', '5678')
>>>
You may like to (a) end some of your regexes with \Z (NOT $) so that they don't match OK when there is trailing rubbish or (b) introduce another group to catch trailing rubbish.
and a social engineering comment: Have you yet tested the user reaction to a staff member carrying out this directive: "Space instead of hyphen in local component - ask user to remediate"? Can't the script just fix it and carry on?
and some comments on the code:
the self.PHFull code
(a) is terribly repetitive (if you must have regexes put them in a list with corresponding action codes and error messages and iterate over the list)
(b) is the same for "error" cases as for standard cases (so why are you asking the users to "remediate"???)
(c) throws away the country code and substitutes a 0 i.e. your standard +61(2)1234-5678 is being kept as 0212345678 aarrgghhh ... even if you have the country stored with the address that's no good if an NZer migrates to Aus and the address gets updated but not the phone number and please don't say that you are relying on the current (no NZ customers outside the Auckland area???) non-overlap of area codes ...
Update after full story revealed
Keep it SIMPLE for both you and the staff. Instructions to staff using Active Directory should be (depending on which office) "Fill in +61(2)9876-7 followed by your 3-digit extension number". If they can't get that right after a couple of attempts, it's time they got the DCM.
So you use one regex per office, filling in the constant part, so that say the SYD offices have numbers of the form +61(2)9876-7ddd you use the regex r"\+61\(2\)9876-7\d{3,3}\Z". If a regex matches, then you remove all non-digits and use "0" + the_digits[2:] for the next app. If no regexes match, send a rocket.
+1 for #John Machin's recommendations.
The World Telephone Number Guide is quite useful for national numbering plans, especially the exceptions.
The ITU has freely available standards for lots of stuff too.
Phone numbers are formatted that way to make them easier to remember for people-- there's no reason that I can see for storing them like that. Why not split by commas and parse each number by simply ignoring anything that's not a digit?
>>> import string
>>> def parse_number(number):
n = ''
for x in number:
if x in string.digits:
n += x
return n
Once you've got it like that you can do verification based on the itl prefix and area code. (if the 3rd digit is 3 then there should be 7 more digits, etc)
After it's verified, splitting into components is easy. The first two digits are the prefix, the next is the area code, etc. You can do a check for all the common mistakes without using regex. Outputting is also pretty easy in this case.
I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:
addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'
addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'
I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?
addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'
addr_3 = '570348THAV'
adrr_4 = '570348AV'
Thankful,
Eduardo
First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):
adr = " ".join(adr.tolower().split())
Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":
adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)
Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).
Be sure to read all the help for the re module; it's powerful but cryptic.
Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:
http://en.wikipedia.org/wiki/Soundex
http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html
adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]
Then you can work with the list or join it back to a string as you think best.
The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.
Good luck.
Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.
Take for example this address
56 5th avenue
And this
5, 65th avenue
with your method both of them will be:
565THAV
What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.
The algorithm can go like this:
replace all commas dashes with spaces. Use he translate method for that.
Build a dictionary with words and their abbreviated form
Remove the TH part if it was following a number.
This should be helpful in building your dictionary of abbreviations:
https://pe.usps.com/text/pub28/28apc_002.htm
I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.
What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.
It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.
Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.
Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).
In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.
I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.
You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.