Why doesn't this regular expression work in all cases?

Why doesn't this regular expression work in all cases? - python

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?

It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space

text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Related

Regex for python: how do I extract a string between words?

Suppose I have a sentence:
Meet me at 201 South First St. at noon
And I want to get the address like this:
South First
What would be the appropriate Regex expression for it ? I currently have this, but it is not working:
x = re.search(r"\d+\s?=([A-Z][a-z]*)\s(Rd.|Dr.|Ave.|St.)",searchstring)
Where searchstring is the sentence. The address is always preceded by 1 or more digits followed by a space and followed by either Rd. Dr. Ave. or St. The address also always starts with a capital letter.

The first group, the part where you try to match the address is [A-Z][a-z]*, it means one uppercase letter followed by any lowercase letters. Probably what you want is any uppercase or lowercase letter or space: [A-Za-z ]*. Also note that the dots in the second group mean any character and not the literal ., so you have to escape it. The solution would look like this:
>>> re.search(r'\d+\s?([A-Za-z ]*)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'
Or just use . to accept anything.
>>> re.search(r'\d+\s?(.*?)\s+(Rd|Dr|Ave|St)\.', 'Meet me at 201 South First St. at noon')[1]
'South First'

You may use
\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.
See the regex demo.
Details
\d+ - one or more digits
\s* - 0 or more whitespaces
([A-Z].*?) - capturing group #1: an uppercase ASCII letter and then any 0 or more chars other than line break chars as few as possible
\s+ - 1+ whitespaces
(?:Rd|Dr|Ave|St) - Rd, Dr, Ave or St
\. - a dot
See a Python demo:
m = re.search(r'\d+\s*([A-Z].*?)\s+(?:Rd|Dr|Ave|St)\.', text)
if m:
print(m.group(1))
Output: South First.

Here is how:
import re
s = 'Meet me at 201 South First St. at noon'
print(re.findall('(?<=\d )[A-Z].*(?= d.|Dr.|Ave.|St.)', s)[0])
Output:
'South First'

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".

What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo

The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?

You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1

Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.

You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Python create acronym from first characters of each word and include the numbers

I have a string as follows:
theatre = 'Regal Crown Center Stadium 14'
I would like to break this into an acronym based on the first letter in each word but also include both numbers:
desired output = 'RCCS14'
My code attempts below:
acronym = "".join(word[0] for word in theatre.lower().split())
acronym = "".join(word[0].lower() for word in re.findall("(\w+)", theatre))
acronym = "".join(word[0].lower() for word in re.findall("(\w+ | \d{1,2})", theatre))
acronym = re.search(r"\b(\w+ | \d{1,2})", theatre)
In which I wind up with something like: rccs1 but can't seem to capture that last number. There could be instances when the number is in the middle of the name as well: 'Regal Crown Center 14 Stadium' as well. TIA!

See regex in use here
(?:(?<=\s)|^)(?:[a-z]|\d+)
(?:(?<=\s)|^) Ensure what precedes is either a space or the start of the line
(?:[a-z]|\d+) Match either a single letter or one or more digits
The i flag (re.I in python) allows [a-z] to match its uppercase variants.
See code in use here
import re
r = re.compile(r"(?:(?<=\s)|^)(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium 14'
print(''.join(r.findall(s)))
The code above finds all instances where the regex matches and joins the list items into a single string.
Result: RCCS14

You can use re.sub() to remove all lowercase letters and spaces.
Regex: [a-z ]+
Details:
[]+ Match a single character present in the list between one and
unlimited times
Python code:
re.sub(r'[a-z ]+', '', theatre)
Output: RCCS14
Code demo

I can't comment since I don't have enough reputation, but S. Jovan answer isn't satisfying since it assumes that each word starts with a capital letter and that each word has one and only one capital letter.
re.sub(r'[a-z ]+', '', "Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14")
will returns 'RCCSYBFIEUBFBDBUUFGFUEH14'
However ctwheels answers will be able to work in this case :
r = re.compile(r"\b(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14'
print(''.join(r.findall(s)))
will print
RCCSYFDF14

import re
theatre = 'Regal Crown Center Stadium 14'
r = re.findall("\s(\d+|\S)", ' '+theatre)
print(''.join(r))
Gives me RCCS14

Using python regex with backreference matches

I have a doubt about regex with backreference.
I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:
import re
str = 'capitals'
re.search(r'(\w)\1{1,}', str)
Output None
import re
str = 'butterfly'
re.search(r'(\w)\1{1,}', str)
<_sre.SRE_Match object; span=(2, 4), match='tt'>

I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.
However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)
Check the demo: https://regex101.com/r/m5UfAe/1
So an alternative (and depending on your needs) is to sort the string analyzed:
import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))
returning the array with the repeated characters ['a','b','c','d']

Hope the code below will help you understand the Backreference concept of Python RegEx
There are two sets of information available in the given string str
Employee Basic Info:
starting with #employeename and ends with employeename
eg: #daniel dxc chennai 45000 male daniel
Employee designation
starting with %employeename then designation and ends with employeename%
eg: %daniel python developer daniel%
import re
#sample input
str="""
#daniel dxc chennai 45000 male daniel #henry infosys bengaluru 29000 male hobby-
swimming henry
#raja zoho chennai 37000 male raja #ramu infosys bengaluru 99000 male hobby-badminton
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""
#backreferencing employee name (\w+) <---- \1
#----------------------------------------------
basic_info=re.findall(r'#+(\w+)(.*?)\1',str)
print(basic_info)
#(%) <-- \1 and (\w+) <--- \2
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)
for i in range(len(designation)):
designation[i]=(designation[i][1],designation[i][2])
print(designation)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.