Extract data if between substrings else full string - python

I have string pattern like these:
Beginning through June 18, 2022 at Noon standard time\n
Jan 20, 2022
Beginning through April 26, 2022 at 12:01 a.m. standard time
I want to extract the data part presetnt after "through" and before "at" word using python regex.
June 18, 2022
Jan 20, 2022
April 26, 2022
I can extract for the long text using re group.
s ="Beginning through June 18, 2022 at Noon standard time"
re.search(r'(.*through)(.*) (at.*)', s).group(2)
However it will not work for
s ="June 18, 2022"
Can anyone help me on that.

You may use this regex with a capture group:
(?:.* through |^)(.+?)(?: at |$)
RegEx Demo
RegEx Details:
(?:.* through |^): Match anything followed by " though " or start position
(.+?): Match 1+ of any character and capture it in group #1
(?: at |$): Match " at " or end of string
Code:
import re
arr = ['Beginning through June 18, 2022 at Noon standard time',
'Jan 20, 2022',
'Beginning through April 26, 2022 at 12:01 a.m. standard time']
for i in arr:
print (re.findall(r'(?:.* through |^)(.+?)(?: at |$)', i))
Output:
['June 18, 2022']
['Jan 20, 2022']
['April 26, 2022']

How about playing with optional groups and backtracking.
^(?:.*?through )?(.*?)(?: at.*)?$
See this demo at regex101 or a Python demo at tio.run
Note that if just one of the substrings are present, it will either match from the first to end of the string or from start of string to the latter. If none are present, it will match the full string.
Another idea could be to use PyPI regex which supports branch reset groups.
^(?|.*?through (.+?) at|(.+))
This one extracts the part between if both are present, else the full string. Afaik the regex module is widely compatible to Python's regex functions, just use import regex as re instead.
Demo at regex101 or Python demo at tio.run

Related

Python Regex to extract meeting invite from Gmail Subject

I'm trying to extract the meeting date / time from meeting invites within Gmail's subject. Below is an example of a subject for a meeting invite:
Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)
What I would like to extract:
Tue Oct 25, 2022 11:30am - 12pm (CST)
I think the pattern could simply start with the space after the "#" and end with the ")". My Regex is very rusty so would appreciate any help :)
Many thanks!
Try this - it should match everything after the "# " and up to the end of the timezone ")"
import re
string = (
'Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)'
)
pattern = re.compile(r'(?<=# )[^)]+\)')
matches = re.findall(pattern, string)
print(matches)
# => 'Tue Oct 25, 2022 11:30am - 12pm (CST)'
See here for a breakdown of the RegEx I used. Bear in mind that re.findall returns a list of matches, which is helpful if you want to scan a long multiline string of text and get all the matches at once. If you only care about the 1st match, you can get it by index e.g. print(matches[0]).
It looks like you don't technically need regex for this.
Try the following:
>>> s = 'Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)'
>>> s[s.index('#') + 1 : s.rindex('(')].strip()
'Tue Oct 25, 2022 11:30am - 12pm (CST)'

spaCy matcher unable to identitfy the pattern besides the first

Unable to find where did my pattern go wrong to cause the outcome.
The Sentence I want to find:"#1 – January 31, 2015" and any date that follows this format.
The pattern pattern1=[{'ORTH':'#'},{'is_digital':True},{'is_space':True},{'ORTH':'-'},{'is_space':True},{'is_alpha':True},{'is_space':True},{'is_digital':True},{'is_punct':True},{'is_space':True},{'is_digital':True}]
The print code:print("Matches1:", [doc[start:end].text for match_id, start, end in matches1])
The result: ['#', '#', '#']
Expected result: ['#1 – January 31, 2015','#5 – March 15, 2017','#177 – Novenmber 22, 2019']
Spacy's matcher operates over tokens, single spaces in the sentence do not yield tokens.
Also there are different characters which resemble hyphens : dashes, minus signs etc.. one has to be careful about that.
The following code works:
import spacy
nlp = spacy.load('en_core_web_lg')
from spacy.matcher import Matcher
pattern1=[{'ORTH':'#'},{'IS_DIGIT':True},{'ORTH':'–'},{'is_alpha':True},{'IS_DIGIT':True},{'is_punct':True},{'IS_DIGIT':True}]
doc = nlp("#1 – January 31, 2015")
matcher = Matcher(nlp.vocab)
matcher.add("p1", None, pattern1)
matches1 = matcher(doc)
print(" Matches1:", [doc[start:end].text for match_id, start, end in matches1])
# Matches1: ['#1 – January 31, 2015']

Unexpected result in regex - what am I missing?

I am trying to extract immunization records of this form:
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
and also of this form:
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Here is my pattern string:
"Immunization:(.*?)\n[.\n*?]*?Date Received:(.*?)\n"
This is identifying the second pattern and extracting vaccination name and date but not the first pattern. I thought that [.\n*?]*? would take care of the two possibilities (that there are other fields between vaccination name and vaccination date...or not...but this doesn't seem to be doing the trick. What is wrong with my regex and how cna I fix it?
You can use:
import re
matches = re.findall(r"Immunization:\s+(.*?)\s+.*?Date Received:\s+(.*?)$", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)
Regex Demo | Python Demo
Regex Explanation:
Tested this on pythex with MULTILINE and DOTALL:
Input
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Pattern: Immunization:\s+(\w+).*?Date Received:\s+([^\n]+)
Match 1
Tetanus
07 Jan 2013
Match 2
TETANUS
07 Dec 2012 # 1155
Pythex
Pythex with different grouping
The . in [.\n] is taken as a literal '.', not as a symbol for any-character. This is why the date line immediately following the immunisation is accepted but you fail to jump across a character that is not a newline or a dot.
(.*\n)* comes to mind to help you out in the closest way to what you already have. However, it is a bit unfortunate to have so many nested * since this means a long breath for parsing the record and as a human I also find it more difficult to understand. It may be preferable to start every loop with a literal to help the decision making if a loop shall be entered/continued at all.
If I did not mess it up then
Immunization:(.*?)(\n.*)*\nDate Received:(.*)\n
would do without left recursion and "Date Received" would only be detected at the beginning of the line.

Extracting words next to a location or Duration in python

How can i extract words next to a location or Duration? What is the best possible regex in python to do this action?
Example:-
Kathick Kumar, Bangalore who was a great person and lived from 29th March 1980 - 21 Dec 2014.
In the above example i want to extract the words before location and the words before duration. Here the location and duration is not fixed, what will be the best possible regex for this in python? Or can we do this using nltk?
Desired output:-
Output-1: Karthick Kumar (Keyword here is Location)
Output-2: who was a great person and lived from (Keyword here is duration)
I suggest using Lookaheads.
In your example, assuming you want the words before Bangalore and 29th March 1980 - 21 Dec 2014, you could use lookaheads( and lookbehinds) to get the relevant match.
I've used this regex: (.*)(?>Bangalore)(.+)(?=29th March 1980 - 21 Dec 2014) and captured the text in parentheses, which can be accessed by using \1 and \2.
DEMO

Python Regex - Different Results in findall and sub

I am trying to replace occurrences of the work 'brunch' with 'BRUNCH'. I am using a regex which correctly identifies the occurrence, but when I try to use re.sub it is replacing more text than identified with re.findall. The regex that I am using is:
re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
The string is
str = 'Valid only for dine-in January 2 - March 31, 2015. Excludes brunch, happy hour, holidays, and February 13 - 15, 2015.'
I want it to produce:
'Valid only for dine-in January 2 - March 31, 2015. Excludes BRUNCH, happy hour, holidays, and February 13 - 15, 2015.'
The steps:
>>> reg.findall(str)
>>> ['brunch']
>>> reg.sub('BRUNCH',str)
>>> Valid only for dine-in January 2 - March 31, 2015BRUNCH, happy hour, holidays, and February 13 - 15, 2015.
Edit:
The final solution that I used was:
re.compile(r'((?:^|\.))(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)',re.IGNORECASE)
re.sub('\g<1>\g<2>BRUNCH',str)
For re.sub use
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)
Replace by \1\2BRUNCH.See demo.
https://regex101.com/r/eZ0yP4/16
Through regex:
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)brunch
DEMO
Replace the matched characters by \1\2BRUNCH
Why does it match more than brunch
Because your regex actually does match more than brunch
See link on how the regex match
Why doesnt it show in findall?
Because you have wraped only the brunch in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
>>> reg.findall(str)
['brunch']
After wraping entire ([^.]*brunch) in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*brunch)',re.IGNORECASE)
>>> reg.findall(str)
[' Excludes brunch']
re.findall ignores those are not caputred

Categories