I need to match a string to identify if it's valid date range or not, my string could include both months in text and years in numbers, with out specific order ( there's no fixed format like MM-YYYY-DD etc ).
A valid string could be:
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Invalid string:
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
I already looked into issues such as
a) Constructing Regular Expressions to match numeric ranges
b) Regex to match month name followed by year
and many others, but most of the input strings in those issues seems to have the luxury of some fixed pattern for the month and year, which I don't have.
I tried this regex in python:
import re
pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"
st = "University of Pennsylvania February 2018"
re.search(pat, st)
but that recognizes both valid and invalid strings from my example, I want to avoid invalid strings in my eventual output.
For input "University of Pennsylvania February 2018" the expected output should be False
For "February 2018 to Present",output must be True.
This REGEX validate date range that respect this format MONTH YEAR (MONTH YEAR | PRESENT)
import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
'(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)
# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
if line:
groups = re.findall(RE_VALID_RANGE, line)
if groups:
# If you want to do something with range
# all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
# every group have 4 elements because there is 4 capturing group
# if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
M1, Y1, M2, Y2, present = groups[0] # here use loop if you want to verify the values even more
valid_ranges.append(line)
else:
invalid_ranges.append(line)
print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)
# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
# if you want to check the ranges
M1, Y1, M2, Y2, present = match.groups()
valid_ranges.append(match.group(0)) # the text is returned here
print('VALID USING <finditer>: ', valid_ranges)
OUPUT:
VALID: ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>: ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
I hate writing long regular expression in a single str variable I love to break it to understand what It does when I read my code after six Months. Note how the first line is divided to two valid range string using finditer
If you want just to extract ranges you can use this:
valid_ranges = re.findall(RE_VALID_RANGE, text)
But this returns the groups ([M1, Y1, M2, Y2, present)..] not the Text :
[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]
Maybe, you could reduce the boundaries of your expression with some simple ones such as:
(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$
or maybe,
(?i)\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?
Test
import re
regex = r"(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$"
string = """
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Feb. 2016 - March 2019
Sept 2015 to Aug. 2019
April 2015 to present
Nov. 2018 - present
Invalid string:
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
print(re.findall(regex, string, re.M))
Output
[('20', '16', 'March', '20', '19'), ('20', '15', 'August', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', ''), ('20', '16', 'March', '20', '19'), ('20', '15', 'Aug.', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', '')]
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Related
I know there are many questions regarding this type of sorting, I tried many time by referring to those questions and also by going through the re topic in python too
My question is:
class Example(models.Model):
_inherit = 'sorting.example'
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
self.update({'list_of_birthday_week': ','.join(r for r in data_list)})
I need to be sorted according to the month & date like:
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
is there any way to achieve this ?
Use a regex to extract the date than use it as key of sorted function.
import re
pattern = r'(\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D?(?:\d{1,2}\D?))'
sort_by_date = lambda x: datetime.strptime(re.search(pattern, x).group(0), '%b %d')
out = sorted(data_list, key=sort_by_date)
Output:
>>> out
['Anita Oliver Jan 24',
'Abigail Peterson Jan 25',
'Ernest Reed Jan 28',
'Paul Williams Feb 1']
Input:
>>> data_list
['Abigail Peterson Jan 25',
'Paul Williams Feb 1',
'Ernest Reed Jan 28',
'Anita Oliver Jan 24']
You need to extract the date part from the string, and then turn the date string into a comparable format. For the first task, regexen would be a decent choice here, and for the second part, datetime.strptime would be appropriate:
>>> import re
>>> from datetime import *
>>>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')
<re.Match object; span=(17, 23), match='Jan 25'>
>>> re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0]
'Jan 25'
>>>
>>> datetime.strptime('Jan 25', '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
>>> datetime.strptime(re.search('\w+ \d+$', 'Abigail Peterson Jan 25')[0], '%b %d')
datetime.datetime(1900, 1, 25, 0, 0)
Then turn that into a callback for list.sort:
>>> data_list.sort(key=lambda i: datetime.strptime(re.search('\w+ \d+$', i)[0], '%b %d'))
['Anita Oliver Jan 24', 'Abigail Peterson Jan 25', 'Ernest Reed Jan 28', 'Paul Williams Feb 1']
You can also use split() to accomplish that.
from datetime import datetime
...
def unable_to_sort(self):
data_list = ['Abigail Peterson Jan 25','Paul Williams Feb 1','Anita Oliver Jan 24','Ernest Reed Jan 28']
def get_date(data):
name, str_date = data.split(" ")[:-2], data.split(" ")[-2:]
month, day = str_date
return datetime.strptime(f"{month} {day}", "%b %d")
sorted_data_list = sorted(data_list, key=get_date)
self.update({'list_of_birthday_week': ','.join(r for r in sorted_data_list)})
You can use sorted function with keys datetime.strptime() and date value.
from datetime import datetime
data_list = ['Anita Oliver Jan 24','Abigail Peterson Jan 25','Ernest Reed Jan 28','Paul Williams Feb 1']
k=[x.split() for x in data_list]
days_sorted = sorted(k, key=lambda x: (datetime.strptime(x[2],'%b'),x[3]))
[['Anita', 'Oliver', 'Jan', '24'],
['Abigail', 'Peterson', 'Jan', '25'],
['Ernest', 'Reed', 'Jan', '28'],
['Paul', 'Williams', 'Feb', '1']]
I have this list which I convert into a dataframe.
labels = ['Airport',
'Amusement',
'Bridge',
'Campus',
'Casino',
'Commercial',
'Concert',
'Convention',
'Education',
'Entertainment',
'Government',
'Hospital',
'Hotel',
'Library',
'Mall',
'Manufacturing',
'Museum',
'Residential',
'Retail',
'School',
'University',
'Theater',
'Tunnel',
'Warehouse']
labels = pd.DataFrame(labels, columns=['lookup'])
labels
I have this dataframe.
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School']})
How can I clean the items in the df, based on matches in labels? My 'labels' is totally clean and my 'df' is very messy. I would like to see the df like this.
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Airport', 'University', 'Library', 'Theater', 'School']})
df
You can use df.str.extract and nan-replacement:
labels = ['Airport', 'Amusement', 'Bridge', 'Campus', 'Casino', 'Commercial', 'Concert', 'Convention',
'Education', 'Entertainment', 'Government', 'Hospital', 'Hotel', 'Library', 'Mall', 'Manufacturing',
'Museum', 'Residential', 'Retail', 'School', 'University', 'Theater', 'Tunnel', 'Warehouse']
import pandas as pd
df = pd.DataFrame({
'Year': [2020, 2020, 2019, 2019, 2019, 1954],
'Name': ['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School', 'Shake, Rattle and Roll']
})
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll NaN
If you want to keep the non-matching cells, do this:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df.loc[df['Match'].isnull(), 'Match'] = df['Name'][df['Match'].isnull()]
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll Shake, Rattle and Roll
If you want to remove the non-matching cells, do this:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df = df.dropna()
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
Not the most pure pandas answer but you could write a function that performs a check for the string against your labels list and apply that to the Name column i.e.
def clean_labels(name):
labels = ['Airport','Amusement','Bridge','Campus',
'Casino','Commercial','Concert','Convention',
'Education','Entertainment','Government','Hospital',
'Hotel','Library','Mall','Manufacturing','Museum',
'Residential','Retail','School','University', 'Theater',
'Tunnel','Warehouse']
for item in labels:
if item in name:
return item
>>> df.Name.apply(clean_labels)
0 Airport
1 University
2 Library
3 Theater
4 School
I'm assuming here there aren't any typos when comparing the strings and it will return a NoneType for anything that doesn't match.
I'm trying to parse dates from individual health records. Since the entries appear to be manual, the date formats are all over the place. My regex patterns are apparently not making the cut for several observations. Here's the list of tasks I need to accomplish along with the accompanying code. Dataframe has been subsetted to 15 observations for convenience.
Parse dates:
#Create DF:
health_records = ['08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',
'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',
'28 Sep 2015 Primary Care Doctor:',
'06 Mar 1974 Primary Care Doctor:',
'none; but currently has appt with new HJH PCP Rachel Salas, MD on October. 11, 2013 Other Agency Involvement: No',
'.Came back to US on Jan 24 1986, saw Dr. Quackenbush at Beaufort Memorial Hospital. Checked VPA level and found it to be therapeutic and confirmed BPAD dx. Also, has a general physician exam and found to be in good general health, except for being slightly overwt',
'September. 15, 2011 Total time of visit (in minutes):',
'sLanguage based learning disorder, dyslexia. Placed on IEP in 1st grade through Westerbrook HS prior to transitioning to VFA in 8th grade. Graduated from VF Academy in May 2004. Attended 1.5 years college at Arcadia.Employment Currently employed: Yes',
') - Zoloft 100 mg daily: February, 2010 : self-discontinued due to side effects (unknown)',
'6/1998 Primary Care Doctor:',
'12/2008 Primary Care Doctor:',
'ran own business for 35 years, sold in 1985',
'011/14/83 Audit C Score Current:',
'safter evicted in February 1976, hospitalized at Pemberly for 1 mo.Hx of Outpatient Treatment: No',
'. Age 16, 1991, frontal impact. out for two weeks from sports.',
's Mr. Moss is a 27-year-old, Caucasian, engaged veteran of the Navy. He was previously scheduled for an intake at the Southton Sanitorium in January, 2013 but cancelled due to ongoing therapy (see Psychiatric History for more details). He presents to the current intake with primary complaints of sleep difficulties, depressive symptoms, and PTSD.']
import numpy as np
import pandas as pd
df = pd.DataFrame(health_records, columns=['records'])
#Date parsing: patten 1:
df['new'] = (df['records'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Date parsing pattern 2:
df['new2'] = (df['records'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['date'] = df['new']+df['new2']
and here is the output:
df['date']
0 08/11/78
1 7/11/77 16/0.83 4.9/36
2 28 Sep 2015
3 06 Mar 1974
4
5 24 1986
6
7 May 2004
8
9 6/1998
10 12/2008
11
12 011/14
13 February 1976
14
15
As you can see in some places the code works perfectly, but in complex sentences my pattern is not working or spitting out inaccurate results. Here is a list of all possible date combinations:
04/20/2009; 04/20/09; 4/20/09; 4/3/09;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009;
Mar 21st, 2009; Mar 22nd, 2009;
Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009
2009; 2010
Clean dates
Next I tried to clean dates, using a solution provided here - while it should work, since my format is similar to one in that problem, but it doesn't.
#Clean dates to date format
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y'))
df['clean_date']
The above code does not work. Any help would be deeply appreciated. Thanks for your time!
Well figured it out on my own. Still had to make some manual adjustments.
df['new'] = (df['header'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
df['new2'] = (df['header'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['new3'] = (df['header'][455:501].str.findall(r'\d{4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Coerce dates data to date-format
df['date1'] = df['new'].str.strip() + df['new2'].str.strip() + df['new3'].str.strip()
df['date1'] = pd.to_datetime(df['date1'], errors='coerce')
df[['date1', 'header']].sort_values(by ='date1')
I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016
pattern = r"(Mon|Tues|Wednes|Thurs|Fri)day, (February|March) [0-9]{2}, [0-9]{4}\s*Day [0-9]{1}"
line = """
Wednesday, February 28, 2018
Day 4 3:00 Dismissal
All Day
Thursday, March 01, 2018
Day 5 1:30PM Dismissal
All Day
Friday, March 02, 2018
Day 6 3:00 Dismissal
All Day
Monday, March 05, 2018
Day 1 1:30 Dismissal
All Day
Tuesday, March 06, 2018
Day 2 3:00 Dismissal
All Day
Tuesday, March 06, 2018"""
result = re.findall(pattern, line)
print(result)
Won't work.
If you want to catch keys only, group it right:
pattern = r"((?:Mon|Tues|Wednes|Thurs|Fri)day), (February|March) ([0-9]{2}), ([0-9]{4})\s*Day ([0-9]{1})"
Will get:
[('Wednesday', 'February', '28', '2018', '4'), ('Thursday', 'March', '01', '2018', '5'), ('Friday', 'March', '02', '2018', '6'), ('Monday', 'March', '05', '2018', '1'), ('Tuesday', 'March', '06', '2018', '2')]
If you want to catch whole match string, don't group it, (like #ekhumoro said use ?: before a group):
pattern = r"(?:Mon|Tues|Wednes|Thurs|Fri)day, (?:February|March) [0-9]{2}, [0-9]{4}\s*Day [0-9]{1}"
Will get a list of str:
['Wednesday, February 28, 2018 \nDay 4', 'Thursday, March 01, 2018 \nDay 5', 'Friday, March 02, 2018 \nDay 6', 'Monday, March 05, 2018 \nDay 1', 'Tuesday, March 06, 2018 \nDay 2']