Python regex for matching arbitrary number of elements between 2 substrings? - python

I'm trying to write a regex which finds all characters between a starting token ('MS' or 'PhD') and an ending token ('.' or '!'). What makes this tricky is that it's fairly common for both starting tokens to be present in my text data, I'm only interested in the characters bounded by the last starting token and first ending token. (And all such occurrences.)
start = 'MS|PhD'
end = '.|!'
input1 = "Candidate with MS or PhD in Statistics, Computer Science, or similar field."
output1 = "in Statistics, Computer Science, or similar field"
input2 = "Applicant with MS in Biology or Chemistry desired."
output2 = "in Biology or Chemistry desired"
Here's my best attempt, which is currently returning an empty list:
# start any char end
pattern = r'^(MS|PhD) .* (\.|!)$'
re.findall(pattern,"candidate with MS in Chemistry.")
>>>
[]
Could someone point me in the right direction?

You could use a capturing group and match MS or PhD and the . or ! outside of the group.
\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]
\b(?:MS|PhD)\s* A word boundary, match either MS or phD followed by 0+ leading whitspace chars to not capture them in the group
( capture group 1, which contains the desired value
(?: Non capture group
(?!\b(?:MS|PhD)\b). Match any char except a newline if it is not followed by either MS or phD
)* Close the non capture group and repeat it 0+ times
)[.,] Close group 1 and match either . or ,
Regex demo | Python demo
import re
regex = r"\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]"
s = ("Candidate with MS or PhD in Statistics, Computer Science, or similar field.\n"
"Applicant with MS in Biology or Chemistry desired.")
matches = re.findall(regex, s)
print(matches)
Output
['in Statistics, Computer Science, or similar field', 'in Biology or Chemistry desired']

Related

Capture the n previous words when matching a string

Let's say I have this text:
abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
I want to capture these personal names:
Mark Jones, Taylor Daniel Lautner, Allan Stewart Konigsberg Farrow.
Basically, when we find (P followed by any capital letter, we capture the n previous words that start with a capital letter.
What I have achieved so far is to capture just one previous word with this code: \w+(?=\s+(\(P+[A-Z])). But I couldn't evolve from that.
I appreciate it if someone can help :)
Regex pattern
\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]
In order to find all matching occurrences of the above regex pattern we can use re.findall
import re
text = """abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
"""
matches = re.findall(r'\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]', text)
>>> matches
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']
Regex details
\b : Word boundary to prevent partial matches
((?:[A-Z]\w+\s?)+): First Capturing group
(?:[A-Z]\w+\s?)+: Non capturing group matches one or more times
[A-Z]: Matches a single alphabet from capital A to Z
\w+: Matches any word character one or more times
\s? : Matches any whitespace character zero or one times
\s : Matches a single whitespace character
\(: Matches the character ( literally
P : Matches the character P literally
[A-Z] : Matches a single alphabet from capital A to Z
See the online regex demo
With your shown samples, could you please try following. Using Python's re library here to fetch the results. Firstly using findall to fetch all values from given string var where (.*?)\s+\((?=P[A-Z]) will catch everything which is having P and a capital letter after it, then creating a list lst. Later using substitute function to substitute everything non-spacing things followed by spaces 1st occurrences with NULL to get exact values.
import re
var="""abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)"""
lst = re.findall(r'(.*?)\s+\((?=P[A-Z])',var)
[re.sub(r'^\S+\s+','',s) for s in lst]
Output will be as follows:
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']

Pandas Regex: Separate name from string that starts with word or start of string, and ends in certain words

I have a pandas series that contains rows of share names amongst other details:
Netflix DIVIDEND
Apple Inc (All Sessions) COMM
Intel Corporation CONS
Correction Netflix Section 31 Fee
I'm trying to use a regex to retrieve the stock name, which I did with this look ahead:
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(r"(^.*?(?=DIVIDEND|\(All|CONS|COMM|Section))")
The only thing I'm having trouble with is the row Correction Netflix Section 31 Fee, where my regex is getting the sharename as Correction Netflix. I don't want the word "Correction".
I need my regular expression to check for either the start of the string, OR the word "Correction ".
I tried a few things, such as an OR | with the start of string character ^. I also tried a look behind to check for ^ or Correction but the error says they need to be constant length.
r"((^|Correction ).*?(?=DIVIDEND|\(All|CONS|COMM|Section))"
gives an error; ValueError: Wrong number of items passed 2, placement implies 1. I'm new to regex so I don't really know what this means.
You could use an optional part, and in instead of lookarounds use a capture group with a match:
^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)
^ Start of string
(?:Correction\s*)?
(\S.*?)\s* Capture in group 1, matching a non whitespace char and as least chars as possible and match (not capture) 0+ whitespace chars
(?: Non capture group for the alternation |
\([^()]*\) Match from ( till )
| Or
DIVIDEND|All|CONS|COMM|Section Match any of the words
) Close group
Regex demo
data = ["Netflix DIVIDEND", "Apple Inc (All Sessions) COMM", "Intel Corporation CONS", "Correction Netflix Section 31 Fee"]
pattern = r"^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)"
transactions_df = pd.DataFrame(data, columns = ['MarketName'])
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(pattern)
print(transactions_df)
Output
0 Netflix DIVIDEND Netflix
1 Apple Inc (All Sessions) COMM Apple Inc
2 Intel Corporation CONS Intel Corporation
3 Correction Netflix Section 31 Fee Netflix

finding an element between a tag and a list of tags using regex

I want to find elements between two different tags but the catch is the first tag is constant but the second tag can be any tag belonging to a particular list.
for example a string
'TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
I have a list of tags ['TRSF','SND=','ORG=','OGB=','OBI=']
edit : added the availability of '=' in the list itself
My output should look some what like this
TRSF : BOOK TRANSFER CREDIT
SND : abcd bank , 123
ORG : qwer123
OGB : qwerasd
OBI : 123433
The order of tags, as well as the availability of the tags, may change also new tags may come into the picture
till now I was writing separate regex and string parsing code for each type but that seems impractical as the combination can be infinite
Here is what I was doing :
org = re.findall("ORG=(.*?) OGB=",string_1)
snd = re.findall("SND=(.*?) ORG=",string_1)
,,obi = string_1.partition('OBI=')
Is there any way to do it like
<tag>(.*?)<tag in list>
or any other method ?
If the tag list is complete, you can use a regex like
\b(TRSF|SND|ORG|OGB|OBI)\b=?\s*(.*?)(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z)
See the regex demo. Details:
\b - a word boundary
(TRSF|SND|ORG|OGB|OBI) - a tag captured into Group 1
\b - a word boundary
=? - an optional =
\s* - 0+ whitespaces
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z) - either end of string (\Z) or zero or more whitespaces followed with a tag as a whole word.
See the Python demo:
import re
s='TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
tags = ['TRSF','SND','ORG','OGB','OBI']
print( dict(re.findall(fr'\b({"|".join(tags)})\b=?\s*(.*?)(?=\s*\b(?:{"|".join(tags)})\b|\Z)', s.strip(), re.DOTALL)) )
# => {'TRSF': 'BOOK TRANSFER CREDIT', 'SND': 'abcd bank , 123', 'ORG': 'qwer123', 'OGB': 'qwerasd', 'OBI': '123433'}
Note the re.DOTALL (equal to re.S) makes the . match any chars including line break chars.

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?
You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1
Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.
You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Regex to extract titles from the text

Can anyone help with the regex to extract the text phrases after 'Title:' from the following text: (have just bolded the text to clearly depict the portion to be extracted)
Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):
Effective date: 7/1/07
Title:
2003247
or previous effective dates)
Title:
ST2 Assay for Chronic Heart Failure
Description/Background
Heart Failure
HF is one among many cardiovascular diseases that comprises a major cause of morbidity
and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .
I am using the regex: (?:Title: \n+(.*))|(?:Title:\n+(.*))|(?<=Title: )(.*)(?=Procedure)
However, it doesn't seem to capture the terms correctly! I am using Python 2.7.12
I suggest using
Title:\s*(.*?)\s*Procedure|Title:\s*(.*)
See the regex demo.
Details:
Title: - literal text Title:
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars other than linebreak symbols as few as possible up to the first
\s*Procedure - 0+ whitespaces + the string Procedure
| - or
Title:\s* - Title: string + 0+ whitespaces
(.*) - Group 2: zero or more any chars other than linebreak symbols as many as possible (the rest of the line).
Python code:
import re
regex = r"Title:\s*(.*?)\s*Procedure|Title:\s*(.*)"
test_str = ("Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):\n\n"
"Effective date: 7/1/07\n\n"
"Title:\n\n"
"2003247\n\n"
"or previous effective dates)\n\n"
"Title:\n\n"
"ST2 Assay for Chronic Heart Failure\n\n"
"Description/Background\n\n"
"Heart Failure\n\n"
"HF is one among many cardiovascular diseases that comprises a major cause of morbidity and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .")
res = []
for m in re.finditer(regex, test_str):
if m.group(1):
res.append(m.group(1))
else:
res.append(m.group(2))
print(res)
# => ['Anorectal Fistula (Fistula-in-Ano)', '2003247', 'ST2 Assay for Chronic Heart Failure']

Categories