Consider the following original strings showed in the first columns of the following table:
Original String Parsed String Desired String
'W. & J. JOHNSON LMT.COM' #W J JOHNSON LIMITED #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.' #NORTH ROOF WORKS CO LTD #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED' #DAVID DOE CO LIMITED #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.' #GEORGE TV APPLIANCE LTD #GEORGE TV APPLIANCE LTD
'LOVE BROS. & OTHERS LTD.' #LOVE BROS OTHERS LTD #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A B MICHAEL CLEAN CO LTD #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.' #C M B B CLEANER INC #CMBB CLEANER INC
Punctuation needs to be removed which I have done as follows:
def transform(word):
word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)
However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?
Is this a bad approach to substitute the mentioned characters to get the desired strings?
Thanks for allowing me to continue learning :)
I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:
import re
strings = ['W. & J. JOHNSON LMT.COM',
'NORTH ROOF & WORKS CO. LTD.',
'DAVID DOE & CO., LIMITED',
'GEORGE TV & APPLIANCE LTD.',
'LOVE BROS. & OTHERS LTD.',
'A. B. & MICHAEL CLEAN CO. LTD.',
'C.M. & B.B. CLEANER INC.'
]
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
s = re.sub(r'\s*[& ]\s*', ' ', s)
print s
Output
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Demo on rextester
Update
This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
print s
Demo on rextester
Doing this in regex alone won't be pretty and is not the best solution, yet, here it is! You're better off doing a multiple step approach. What I've done is identified all the cases that are possible and opted to find a solution where there's no replacement string since you're not always replacing the characters with spaces.
Rules
Non "Stacked" Abbreviations
These are locations like A. B. or W. & J., but not C.M. & B.B.
I've identified these as locations where an abbreviation part (e.g. A.) exists before and after, but the latter is not followed by another alpha character
Preceding Space
These locations don't exist in your text but could if a space preceded a non-alpha character without a space following it (say at the end of a line)
We match the characters after the first space in these cases
Proceeding Space
These are locations like & and the dot in J.
We match the character before the last space in those examples
No Spaces
These are locations like 'LOVE (the apostrophe in that string)
We only match the non-alpha-non-whitespace characters
Regex
An all-in-one regex that accomplishes this is as follows:
See regex in use here
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )
Works as follows (broken into each alternation):
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z])) matches non-alpha characters between A. and B. but not A. and B.B
(?<=\b[a-z]) positive lookbehind ensuring what precedes is an alpha character and assert a word boundary position to its left
[^a-z]+ match any non-alpha character one or more times
(?=[a-z]\b(?![^a-z][a-z])) positive lookahead ensuring the following exists
[a-z]\b match any alpha character and assert a word boundary position to its right
(?![^a-z][a-z]) negative lookahead ensuring what follows is not a non-alpha character followed by an alpha character
(?<= ) *(?:\.com\b|[^a-z\s]+) * ensures a space precedes, then matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces
(?<= ) positive lookbehind ensuring a space precedes
* match any number of spaces
(?:\.com\b|[^a-z\s]+) match .com and ensure a non-word character follows, or match any non-word-non-whitespace character one or more times
* match any number of spaces
*(?:\.com\b|[^a-z\s]+) *(?= ) matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces, then ensures a space follows
Same as previous but instead of the positive lookbehind at the beginning, there's a positive lookahead at the end
(?<! )(?:\.com\b|[^a-z\s]+)(?! ) matches .com or any non-alpha-non-whitespace characters one or more times ensuring no spaces surround it
Same as previous two options but uses negative lookbehind and negative lookahead
Code
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'"
]
r = re.compile(r'(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )', re.IGNORECASE)
def transform(word):
return re.sub(r, '', word)
for s in strings:
print(transform(s))
Outputs:
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Edit
Using a callback, you can extend this logic to include special cases as mentioned in a comment below my answer to match specific cases and have conditional replacements.
These special cases include:
FONTAINE'S to FONTAINE
PREMIUM-FIT AUTO to PREMIUM FIT AUTO
62325 W.C. to 62325 WC
I added a new alternation to the regex: (\b[\'-]\b(?:[a-z\d] )?) to capture 'S or - between letters (also -S or similar) and replace it with a space using the callback (if the capture group exists).
I still suggest using multiple regular expressions to accomplish this, but I wanted to show that it is possible with a single pattern.
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'",
"'FONTAINE'S PREMIUM-FIT AUTO 62325 W.C.'"
]
r = re.compile(r'(?<=\b[a-z\d])[^a-z\d]+(?=[a-z\d]\b(?![^a-z\d][a-z\d]))|(?<= ) *(?:\.com\b|[^a-z\d\s]+) *| *(?:\.com\b|[^a-z\d\s]+) *(?= )|(\b[\'-]\b(?:[a-z\d] )?)|(?<! )(?:\.com\b|[^a-z\d\s]+)(?! )', re.IGNORECASE)
def repl(m):
return ' ' if m.group(1) else ''
for s in strings:
print(r.sub(repl, s))
Here's the simplest I could get it with one regex pattern:
\.COM|(?<![A-Z]{2}) (?![A-Z]{2})|[.&,]| (?>)&
Basically, it removes characters that fit 3 criteria:
Literal ".COM"
Spaces that are not preceded or followed by 2 capital letters
Dots, ampersands, and commas, regardless of where they appear
Spaces followed by ampersands
Demo: https://regex101.com/r/EMHxq9/2
Related
I am trying to get the span of the city name from some addresses, however I am struggling with the required regex. Examples of the address format is below.
flat 1, tower block, 34 long road, Major city
flat 1, tower block, 34 long road, town and parking space
34 short road, village on the river and carpark (7X3 8RG)
The expected text to be captured in each case is "Major city", "town" and "village on the river". The issue is that sometimes "and parking space" or a variant is included in the address. Using a regex such as "(?<=,\s)\w+" would return "town and parking space" in the case of example 2.
The city is always after the last comma of the address.
I have tried to re-work this question but have not successfuly managed to exclude the "and parking space" section.
I have already created a regex that excludes the postcodes this is just included as an answer would ideally allow for that part of the regex to be bolted on the end.
How would I create a regex that starts after the last comma and runs to the end of the address but stops at any "and parking" or postcodes?
You can capture these strings using
,\s*((?:(?!\sand\s)[^,])*)(?=[^,]*$)
,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
.*,\s*((?:(?!\sand\s)[^,])*)
.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
See this regex demo or this regex demo.
Details:
, - a comma ]
\s* - zero or more whitespaces
((?:(?!\sand\s)[^,])*) - Group 1: any char other than a comma, zero or more occurrences, that does not start whitespace + and + whitespace char sequence
(?=[^,]*$) - there must be any zero or more chars other than a comma till end of string.
In Python, you would use
m = re.search(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)', text)
if m:
print(m.group(1))
See the demo:
import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
rx = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')
for text in texts:
m = re.search(rx, text)
if m:
print(m.group(1))
Output:
Major city
town
village on the river
I would do:
import re
exp = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
for e in (re.split(',\s*', x)[-1] for x in exp):
print(re.sub(r'(?:\s+and car.*)|(?:\s+and parking.*)','',e))
Prints:
Major city
town
village on the river
Works like this:
Split the string on ,\s* and take the last portion;
Remove anything from the end of that string that starts with the specified (?:\s+and car.*)|(?:\s+and parking.*)
You can easily add addition clauses to remove with this approach.
Suppose I have a sentence:
Meet me at 201 South First St. at noon
And I want to get the address like this:
South First
What would be the appropriate Regex expression for it ? I currently have this, but it is not working:
x = re.search(r"\d+\s?=([A-Z][a-z]*)\s(Rd.|Dr.|Ave.|St.)",searchstring)
Where searchstring is the sentence. The address is always preceded by 1 or more digits followed by a space and followed by either Rd. Dr. Ave. or St. The address also always starts with a capital letter.
The first group, the part where you try to match the address is [A-Z][a-z]*, it means one uppercase letter followed by any lowercase letters. Probably what you want is any uppercase or lowercase letter or space: [A-Za-z ]*. Also note that the dots in the second group mean any character and not the literal ., so you have to escape it. The solution would look like this:
>>> re.search(r'\d+\s?([A-Za-z ]*)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
Or just use . to accept anything.
>>> re.search(r'\d+\s?(.*?)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
You may use
\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.
See the regex demo.
Details
\d+ - one or more digits
\s* - 0 or more whitespaces
([A-Z].*?) - capturing group #1: an uppercase ASCII letter and then any 0 or more chars other than line break chars as few as possible
\s+ - 1+ whitespaces
(?:Rd|Dr|Ave|St) - Rd, Dr, Ave or St
\. - a dot
See a Python demo:
m = re.search(r'\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.', text)
if m:
print(m.group(1))
Output: South First.
Here is how:
import re
s = 'Meet me at 201 South First St. at noon'
print(re.findall('(?<=\d )[A-Z].*(?= d.|Dr.|Ave.|St.)', s)[0])
Output:
'South First'
Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.
I want to extract postal codes of Alberta (Canada) region from an address string.
For example:
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
Should extract T1A 2B3.
The regular expression to match the postal code is [T]\d[A-Z] *\d[A-Z]\d. However, I do not know that given an entire address, how can I extract only the postal code? I guess it has to do something with backreferences () but I cannot figure it out.
How can I achieve this in Python?
Extracting just the substring that matched the regexp is easy enough:
test = re.compile(r'[T]\d[A-Z] *\d[A-Z]\d')
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
test.search(addr).group()
test.search will return a match object, which has all kinds of stuff you can extract.
Building on #Peter's Answer here is how you can do it for some more postal codes:
US:
addr= 'Statue of liberty, New York, NY 10004, USA'
test = re.compile(r'\d{5}')
test.search(addr).group()
UK:
addr= 'Olympic Park, Montfichet Rd, London E20 1EJ, United Kingdom'
test = re.compile(r'[A-Z]\d\d\s\d[A-Z]\d')
Canada:
addr= 'Toronto City Hall, 100 Queen St W, Toronto, ON M5H 2N2'
test = re.compile(r'[A-Z]\d[A-Z]\s\d[A-Z]\d')
[A-Z] Matches any uppercase letter in range A-Z
[a-zA-Z] Matches any uppercase letter in range A-Z (case insensitive)
\d matches any digit
\d{n} matches any occurrence of n digits
\s matches any whitespace character
You can also use Regex101, which is a very helpful tool for testing Regexes.
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)