The strings have two types.
The first type:
'The Five College Region of Western Massachusetts:'
#Doesn't contain "("
The second type:
'Tuskegee (Tuskegee University)[5]'
#Containing "("
If the string contains "(", remove all characters after"(" and the white space before"(".
If not, extract all characters.
I have firgured out how to extract second type of strings .
r'(.+) \('
You don't need regex for this.
university = 'Tuskegee (Tuskegee University)[5]'
print(university.split("(", 1)[0].strip())
Use re.sub to remove everything after ( if you want to use regex:
import re
re.sub(r' \(.*', '', 'Tuskegee (Tuskegee University)[5]')
# 'Tuskegee'
re.sub(r' \(.*', '', 'The Five College Region of Western Massachusetts:')
# 'The Five College Region of Western Massachusetts:'
You can use a regex re.sub('\s*\(.*',..) to match whitespace plus a "(".
If this matches, it will replace this with the empty string. If not, nothing is replaced.
import re
re.sub('\s*\(.*', '', 'The Five College Region of Western Massachusetts:')
#'The Five College Region of Western Massachusetts:'
re.sub('\s*\(.*', '', 'Tuskegee (Tuskegee University)[5]')
#'Tuskegee'
Related
Suppose I have a sentence:
Meet me at 201 South First St. at noon
And I want to get the address like this:
South First
What would be the appropriate Regex expression for it ? I currently have this, but it is not working:
x = re.search(r"\d+\s?=([A-Z][a-z]*)\s(Rd.|Dr.|Ave.|St.)",searchstring)
Where searchstring is the sentence. The address is always preceded by 1 or more digits followed by a space and followed by either Rd. Dr. Ave. or St. The address also always starts with a capital letter.
The first group, the part where you try to match the address is [A-Z][a-z]*, it means one uppercase letter followed by any lowercase letters. Probably what you want is any uppercase or lowercase letter or space: [A-Za-z ]*. Also note that the dots in the second group mean any character and not the literal ., so you have to escape it. The solution would look like this:
>>> re.search(r'\d+\s?([A-Za-z ]*)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
Or just use . to accept anything.
>>> re.search(r'\d+\s?(.*?)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
You may use
\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.
See the regex demo.
Details
\d+ - one or more digits
\s* - 0 or more whitespaces
([A-Z].*?) - capturing group #1: an uppercase ASCII letter and then any 0 or more chars other than line break chars as few as possible
\s+ - 1+ whitespaces
(?:Rd|Dr|Ave|St) - Rd, Dr, Ave or St
\. - a dot
See a Python demo:
m = re.search(r'\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.', text)
if m:
print(m.group(1))
Output: South First.
Here is how:
import re
s = 'Meet me at 201 South First St. at noon'
print(re.findall('(?<=\d )[A-Z].*(?= d.|Dr.|Ave.|St.)', s)[0])
Output:
'South First'
Consider the following original strings showed in the first columns of the following table:
Original String Parsed String Desired String
'W. & J. JOHNSON LMT.COM' #W J JOHNSON LIMITED #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.' #NORTH ROOF WORKS CO LTD #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED' #DAVID DOE CO LIMITED #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.' #GEORGE TV APPLIANCE LTD #GEORGE TV APPLIANCE LTD
'LOVE BROS. & OTHERS LTD.' #LOVE BROS OTHERS LTD #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A B MICHAEL CLEAN CO LTD #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.' #C M B B CLEANER INC #CMBB CLEANER INC
Punctuation needs to be removed which I have done as follows:
def transform(word):
word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)
However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?
Is this a bad approach to substitute the mentioned characters to get the desired strings?
Thanks for allowing me to continue learning :)
I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:
import re
strings = ['W. & J. JOHNSON LMT.COM',
'NORTH ROOF & WORKS CO. LTD.',
'DAVID DOE & CO., LIMITED',
'GEORGE TV & APPLIANCE LTD.',
'LOVE BROS. & OTHERS LTD.',
'A. B. & MICHAEL CLEAN CO. LTD.',
'C.M. & B.B. CLEANER INC.'
]
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
s = re.sub(r'\s*[& ]\s*', ' ', s)
print s
Output
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Demo on rextester
Update
This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
print s
Demo on rextester
Doing this in regex alone won't be pretty and is not the best solution, yet, here it is! You're better off doing a multiple step approach. What I've done is identified all the cases that are possible and opted to find a solution where there's no replacement string since you're not always replacing the characters with spaces.
Rules
Non "Stacked" Abbreviations
These are locations like A. B. or W. & J., but not C.M. & B.B.
I've identified these as locations where an abbreviation part (e.g. A.) exists before and after, but the latter is not followed by another alpha character
Preceding Space
These locations don't exist in your text but could if a space preceded a non-alpha character without a space following it (say at the end of a line)
We match the characters after the first space in these cases
Proceeding Space
These are locations like & and the dot in J.
We match the character before the last space in those examples
No Spaces
These are locations like 'LOVE (the apostrophe in that string)
We only match the non-alpha-non-whitespace characters
Regex
An all-in-one regex that accomplishes this is as follows:
See regex in use here
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )
Works as follows (broken into each alternation):
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z])) matches non-alpha characters between A. and B. but not A. and B.B
(?<=\b[a-z]) positive lookbehind ensuring what precedes is an alpha character and assert a word boundary position to its left
[^a-z]+ match any non-alpha character one or more times
(?=[a-z]\b(?![^a-z][a-z])) positive lookahead ensuring the following exists
[a-z]\b match any alpha character and assert a word boundary position to its right
(?![^a-z][a-z]) negative lookahead ensuring what follows is not a non-alpha character followed by an alpha character
(?<= ) *(?:\.com\b|[^a-z\s]+) * ensures a space precedes, then matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces
(?<= ) positive lookbehind ensuring a space precedes
* match any number of spaces
(?:\.com\b|[^a-z\s]+) match .com and ensure a non-word character follows, or match any non-word-non-whitespace character one or more times
* match any number of spaces
*(?:\.com\b|[^a-z\s]+) *(?= ) matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces, then ensures a space follows
Same as previous but instead of the positive lookbehind at the beginning, there's a positive lookahead at the end
(?<! )(?:\.com\b|[^a-z\s]+)(?! ) matches .com or any non-alpha-non-whitespace characters one or more times ensuring no spaces surround it
Same as previous two options but uses negative lookbehind and negative lookahead
Code
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'"
]
r = re.compile(r'(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )', re.IGNORECASE)
def transform(word):
return re.sub(r, '', word)
for s in strings:
print(transform(s))
Outputs:
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Edit
Using a callback, you can extend this logic to include special cases as mentioned in a comment below my answer to match specific cases and have conditional replacements.
These special cases include:
FONTAINE'S to FONTAINE
PREMIUM-FIT AUTO to PREMIUM FIT AUTO
62325 W.C. to 62325 WC
I added a new alternation to the regex: (\b[\'-]\b(?:[a-z\d] )?) to capture 'S or - between letters (also -S or similar) and replace it with a space using the callback (if the capture group exists).
I still suggest using multiple regular expressions to accomplish this, but I wanted to show that it is possible with a single pattern.
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'",
"'FONTAINE'S PREMIUM-FIT AUTO 62325 W.C.'"
]
r = re.compile(r'(?<=\b[a-z\d])[^a-z\d]+(?=[a-z\d]\b(?![^a-z\d][a-z\d]))|(?<= ) *(?:\.com\b|[^a-z\d\s]+) *| *(?:\.com\b|[^a-z\d\s]+) *(?= )|(\b[\'-]\b(?:[a-z\d] )?)|(?<! )(?:\.com\b|[^a-z\d\s]+)(?! )', re.IGNORECASE)
def repl(m):
return ' ' if m.group(1) else ''
for s in strings:
print(r.sub(repl, s))
Here's the simplest I could get it with one regex pattern:
\.COM|(?<![A-Z]{2}) (?![A-Z]{2})|[.&,]| (?>)&
Basically, it removes characters that fit 3 criteria:
Literal ".COM"
Spaces that are not preceded or followed by 2 capital letters
Dots, ampersands, and commas, regardless of where they appear
Spaces followed by ampersands
Demo: https://regex101.com/r/EMHxq9/2
I have the following string which contains a repeating pattern of text followed by parentheses with an ID number.
The New York Yankees (12980261666)\n\nRedsox (1901659429)\nMets (NYC)
(21135721896)\nKansas City Royals (they are 7-1) (222497247812331)\n\n
other team (618006)\n
I'm struggling to write a regex that would return:
The New York Yankees (12980261666)
Redsox (1901659429)
Mets (NYC) (21135721896)
Kansas City Royals (they are 7-1) (222497247812331)
other team (618006)
The newline character could be replaced later with a string.replace('/n', '').
use the negate character to achieve this.
String pat="([^\\n])"
I want to extract postal codes of Alberta (Canada) region from an address string.
For example:
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
Should extract T1A 2B3.
The regular expression to match the postal code is [T]\d[A-Z] *\d[A-Z]\d. However, I do not know that given an entire address, how can I extract only the postal code? I guess it has to do something with backreferences () but I cannot figure it out.
How can I achieve this in Python?
Extracting just the substring that matched the regexp is easy enough:
test = re.compile(r'[T]\d[A-Z] *\d[A-Z]\d')
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
test.search(addr).group()
test.search will return a match object, which has all kinds of stuff you can extract.
Building on #Peter's Answer here is how you can do it for some more postal codes:
US:
addr= 'Statue of liberty, New York, NY 10004, USA'
test = re.compile(r'\d{5}')
test.search(addr).group()
UK:
addr= 'Olympic Park, Montfichet Rd, London E20 1EJ, United Kingdom'
test = re.compile(r'[A-Z]\d\d\s\d[A-Z]\d')
Canada:
addr= 'Toronto City Hall, 100 Queen St W, Toronto, ON M5H 2N2'
test = re.compile(r'[A-Z]\d[A-Z]\s\d[A-Z]\d')
[A-Z] Matches any uppercase letter in range A-Z
[a-zA-Z] Matches any uppercase letter in range A-Z (case insensitive)
\d matches any digit
\d{n} matches any occurrence of n digits
\s matches any whitespace character
You can also use Regex101, which is a very helpful tool for testing Regexes.
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)