Regex to find name in sentence - python

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?

You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1

Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.

You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Related

Using regex to match adjacent words using first letter of abbreviations in text python

I have a list of abbreviations that I am trying to find in my text using regex. However I am struggling to find adjacent words by matching letters and have only achieved this with word matching. Here is my text
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
My list is as such: [USNS, UGF, NFZ, AR]
I would like to find the corresponding long forms in the text using the first letter of each abbreviation. It would also need to be non-case sensitive. My attempt so far has been as such:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
which returns United States Navy Seals however when I try and just use the first letter:
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
It then returns nothing. Furthermore some of the abbreviations contain more than the initial of a word in the text such as UGF - underground facility.
My actual goal is to eventually replace all abbreviations in the text (USNS, UGF, NFZ, AR) with their corresponding long forms (United States Navy Seals, underground facility, no-fly-zone, assault-rifle).
In your last regex [1]
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
you get no match because you made several mistakes:
\w+ means one or more word characters, \W+ is for one or more non-word characters.
the \b boundary anchor is sometimes in the wrong place (i.e. between the initial letter and the rest of the word)
re.search(r'\bU\w+\sS\w+?\sN\w+?\sS\w+', text)
should match.
And, well,
print(re.search(r'\bu\w+?g\w+\sf\w+', text))
matches of course underground facility but in a long text, there will be much more irrelevant matches.
Approach to generalization
Finally I built a little "machine" that dynamically creates regular expressions from the known abbreviations:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
abbrs = ['USNS', 'UGF', 'NFZ', 'AR']
for abbr in abbrs:
pattern = ''.join(map(lambda i: '['+i.upper()+i.lower()+'][a-z]+[ a-z-]', abbr))
print(pattern)
print(re.search(pattern, text, flags=re.IGNORECASE))
The output of above script is:
[Uu][a-z]+[ a-z-][Ss][a-z]+[ a-z-][Nn][a-z]+[ a-z-][Ss][a-z]+[ a-z-]
<re.Match object; span=(20, 45), match='United States Navy Seals '>
[Uu][a-z]+[ a-z-][Gg][a-z]+[ a-z-][Ff][a-z]+[ a-z-]
<re.Match object; span=(89, 110), match='underground facility '>
[Nn][a-z]+[ a-z-][Ff][a-z]+[ a-z-][Zz][a-z]+[ a-z-]
<re.Match object; span=(140, 152), match='no-fly-zone '>
[Aa][a-z]+[ a-z-][Rr][a-z]+[ a-z-]
<re.Match object; span=(170, 184), match='assault-rifle '>
Further generalization
If we assume that in a text each abbreviation is introduced after the first occurrence of the corresponding long form, and we further assume that the way it is written definitely starts with a word boundary and definitely ends with a word boundary (no assumptions about capitalization and the use of hyphens), we can try to extract a glossary automatically like this:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
# build a regex for an initial
def init_re(i):
return f'[{i.upper()+i.lower()}][a-z]+[ -]??'
# build a regex for an abbreviation
def abbr_re(abbr):
return r'\b'+''.join([init_re(i) for i in abbr])+r'\b'
# build an inverse glossary from a text
def inverse_glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
igloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
igloss[longform] = abbr
return igloss
igloss = inverse_glossary(text)
for long in igloss:
print('{} -> {}'.format(long, igloss[long]))
The output is
no-fly-zone -> NFZ
United States Navy Seals -> USNS
assault-rifle -> AR
underground facility -> UGF
By using an inverse glossary you may easily replace all long forms into their corresponding abbreviation. A bit harder is it to do for all but the first occurrence. There is much space for refinement, for example to correctly handle line breaks within long forms (also to use re.compile).
As to replace the abbreviations with the long forms, you have to build a normal glossary instead of an inverse one:
# build a glossary from a text
def glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
gloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
gloss[abbr] = longform
return gloss
gloss = glossary(text)
for abbr in gloss:
print('{}: {}'.format(abbr, gloss[abbr]))
The output here is
AR: assault-rifle
NFZ: no-fly-zone
UGF: underground facility
USNS: United States Navy Seals
The replacement itself is left to the reader.
[1]
Let's take a closer look at your first regex again:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
The boundary anchors (\b) are redundant. They can be removed without changing anything in the result because \W+? means at least one non-word character after the last character of States and Navy. They cause no problems here but I guess that they led to the confusion when you started by modifying from it to get a more general one.
You could use the below regex which would take care of the case sensitivity as well. Click here.
This would just find United States Navy Seals.
\s[u|U].*?[s|S].*?[n|N].*?[s|S]\w+
Similarly, for UF,
You can use - \s[u|U].*?[g|G].*?[f|F]\w+
Please find a pattern above. The characters are just joined with .*? and each character is used as [a|A] which would match either lower case or upper case. The start would be \s since it should be a word and the end would \w+.
Play around.

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks
The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']
I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Python create acronym from first characters of each word and include the numbers

I have a string as follows:
theatre = 'Regal Crown Center Stadium 14'
I would like to break this into an acronym based on the first letter in each word but also include both numbers:
desired output = 'RCCS14'
My code attempts below:
acronym = "".join(word[0] for word in theatre.lower().split())
acronym = "".join(word[0].lower() for word in re.findall("(\w+)", theatre))
acronym = "".join(word[0].lower() for word in re.findall("(\w+ | \d{1,2})", theatre))
acronym = re.search(r"\b(\w+ | \d{1,2})", theatre)
In which I wind up with something like: rccs1 but can't seem to capture that last number. There could be instances when the number is in the middle of the name as well: 'Regal Crown Center 14 Stadium' as well. TIA!
See regex in use here
(?:(?<=\s)|^)(?:[a-z]|\d+)
(?:(?<=\s)|^) Ensure what precedes is either a space or the start of the line
(?:[a-z]|\d+) Match either a single letter or one or more digits
The i flag (re.I in python) allows [a-z] to match its uppercase variants.
See code in use here
import re
r = re.compile(r"(?:(?<=\s)|^)(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium 14'
print(''.join(r.findall(s)))
The code above finds all instances where the regex matches and joins the list items into a single string.
Result: RCCS14
You can use re.sub() to remove all lowercase letters and spaces.
Regex: [a-z ]+
Details:
[]+ Match a single character present in the list between one and
unlimited times
Python code:
re.sub(r'[a-z ]+', '', theatre)
Output: RCCS14
Code demo
I can't comment since I don't have enough reputation, but S. Jovan answer isn't satisfying since it assumes that each word starts with a capital letter and that each word has one and only one capital letter.
re.sub(r'[a-z ]+', '', "Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14")
will returns 'RCCSYBFIEUBFBDBUUFGFUEH14'
However ctwheels answers will be able to work in this case :
r = re.compile(r"\b(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14'
print(''.join(r.findall(s)))
will print
RCCSYFDF14
import re
theatre = 'Regal Crown Center Stadium 14'
r = re.findall("\s(\d+|\S)", ' '+theatre)
print(''.join(r))
Gives me RCCS14

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Categories