Regex to extract titles from the text - python

Can anyone help with the regex to extract the text phrases after 'Title:' from the following text: (have just bolded the text to clearly depict the portion to be extracted)
Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):
Effective date: 7/1/07
Title:
2003247
or previous effective dates)
Title:
ST2 Assay for Chronic Heart Failure
Description/Background
Heart Failure
HF is one among many cardiovascular diseases that comprises a major cause of morbidity
and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .
I am using the regex: (?:Title: \n+(.*))|(?:Title:\n+(.*))|(?<=Title: )(.*)(?=Procedure)
However, it doesn't seem to capture the terms correctly! I am using Python 2.7.12

I suggest using
Title:\s*(.*?)\s*Procedure|Title:\s*(.*)
See the regex demo.
Details:
Title: - literal text Title:
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars other than linebreak symbols as few as possible up to the first
\s*Procedure - 0+ whitespaces + the string Procedure
| - or
Title:\s* - Title: string + 0+ whitespaces
(.*) - Group 2: zero or more any chars other than linebreak symbols as many as possible (the rest of the line).
Python code:
import re
regex = r"Title:\s*(.*?)\s*Procedure|Title:\s*(.*)"
test_str = ("Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):\n\n"
"Effective date: 7/1/07\n\n"
"Title:\n\n"
"2003247\n\n"
"or previous effective dates)\n\n"
"Title:\n\n"
"ST2 Assay for Chronic Heart Failure\n\n"
"Description/Background\n\n"
"Heart Failure\n\n"
"HF is one among many cardiovascular diseases that comprises a major cause of morbidity and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .")
res = []
for m in re.finditer(regex, test_str):
if m.group(1):
res.append(m.group(1))
else:
res.append(m.group(2))
print(res)
# => ['Anorectal Fistula (Fistula-in-Ano)', '2003247', 'ST2 Assay for Chronic Heart Failure']

Related

Python regex for matching arbitrary number of elements between 2 substrings?

I'm trying to write a regex which finds all characters between a starting token ('MS' or 'PhD') and an ending token ('.' or '!'). What makes this tricky is that it's fairly common for both starting tokens to be present in my text data, I'm only interested in the characters bounded by the last starting token and first ending token. (And all such occurrences.)
start = 'MS|PhD'
end = '.|!'
input1 = "Candidate with MS or PhD in Statistics, Computer Science, or similar field."
output1 = "in Statistics, Computer Science, or similar field"
input2 = "Applicant with MS in Biology or Chemistry desired."
output2 = "in Biology or Chemistry desired"
Here's my best attempt, which is currently returning an empty list:
# start any char end
pattern = r'^(MS|PhD) .* (\.|!)$'
re.findall(pattern,"candidate with MS in Chemistry.")
>>>
[]
Could someone point me in the right direction?
You could use a capturing group and match MS or PhD and the . or ! outside of the group.
\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]
\b(?:MS|PhD)\s* A word boundary, match either MS or phD followed by 0+ leading whitspace chars to not capture them in the group
( capture group 1, which contains the desired value
(?: Non capture group
(?!\b(?:MS|PhD)\b). Match any char except a newline if it is not followed by either MS or phD
)* Close the non capture group and repeat it 0+ times
)[.,] Close group 1 and match either . or ,
Regex demo | Python demo
import re
regex = r"\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]"
s = ("Candidate with MS or PhD in Statistics, Computer Science, or similar field.\n"
"Applicant with MS in Biology or Chemistry desired.")
matches = re.findall(regex, s)
print(matches)
Output
['in Statistics, Computer Science, or similar field', 'in Biology or Chemistry desired']

finding an element between a tag and a list of tags using regex

I want to find elements between two different tags but the catch is the first tag is constant but the second tag can be any tag belonging to a particular list.
for example a string
'TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
I have a list of tags ['TRSF','SND=','ORG=','OGB=','OBI=']
edit : added the availability of '=' in the list itself
My output should look some what like this
TRSF : BOOK TRANSFER CREDIT
SND : abcd bank , 123
ORG : qwer123
OGB : qwerasd
OBI : 123433
The order of tags, as well as the availability of the tags, may change also new tags may come into the picture
till now I was writing separate regex and string parsing code for each type but that seems impractical as the combination can be infinite
Here is what I was doing :
org = re.findall("ORG=(.*?) OGB=",string_1)
snd = re.findall("SND=(.*?) ORG=",string_1)
,,obi = string_1.partition('OBI=')
Is there any way to do it like
<tag>(.*?)<tag in list>
or any other method ?
If the tag list is complete, you can use a regex like
\b(TRSF|SND|ORG|OGB|OBI)\b=?\s*(.*?)(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z)
See the regex demo. Details:
\b - a word boundary
(TRSF|SND|ORG|OGB|OBI) - a tag captured into Group 1
\b - a word boundary
=? - an optional =
\s* - 0+ whitespaces
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z) - either end of string (\Z) or zero or more whitespaces followed with a tag as a whole word.
See the Python demo:
import re
s='TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
tags = ['TRSF','SND','ORG','OGB','OBI']
print( dict(re.findall(fr'\b({"|".join(tags)})\b=?\s*(.*?)(?=\s*\b(?:{"|".join(tags)})\b|\Z)', s.strip(), re.DOTALL)) )
# => {'TRSF': 'BOOK TRANSFER CREDIT', 'SND': 'abcd bank , 123', 'ORG': 'qwer123', 'OGB': 'qwerasd', 'OBI': '123433'}
Note the re.DOTALL (equal to re.S) makes the . match any chars including line break chars.

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?
You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1
Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.
You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

How do I delimit my input by this capture group?

For this regular expression:
(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]
I want the input string to be split by the captured matching \s character - the green matches as seen over here.
However, when I run this:
import re
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
re.split(p, test_str)
It seems to split the string at the regions given by [.?!]+ and [A-Z0-9] (thus incorrectly omitting them) and leaves \s in the results.
To clarify:
Input: he paid a lot for it. Did he mind
Received Output: ['he paid a lot for it','\s','id he mind']
Expected Output: ['he paid a lot for it.','Did he mind']
You need to remove the capturing group from around (\s) and put the last character class into a look-ahead to exclude it from the match:
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
# ^^^^^ ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))
See IDEONE demo and the regex demo.
Any capturing group in a regex pattern will create an additional element in the resulting array during re.split.
To force the punctuation to appear inside the "sentences", you can use this matching regex with re.findall:
import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))
See IDEONE demo
Results:
['Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']
The regex demo
The regex follows the rules in your original pattern:
\s* - matches 0 or more whitespace to omit from the result
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) - 2 aternatives that are captured and returned by re.findall:
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])* - 0 or more sequences of...
(?:Mr|Dr|Ms|Jr|Sr)\. - abbreviated titles
\.(?!\s+[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then uppercase letters or digits
[^.!?] - any character but a ., !, and ?
or...
[^.!?]+ - any one or more characters but a ., !, and ?

Categories