python regex: string search for a date - python

I am searching for a specific string within a document that will have known words before and after a date, and I want to extract the date. For example, if the substring is "dated as of 29 Jan 2017 to the schedule", I want to extract "29 Jan 2017".
My code is:
m = re.search(r'dated as of \w+\s+(.+?)+to the schedule', text, re.IGNORECASE)
if m:
items["date"] = m.group(1)
But - this just gives me "Jan 2017" - it misses the day.
I have tried various variations on the regex search string, but still can't get the day. Any thoughts?

You have your capturing group (parentheses) not enclose the first part that is captured by \w+.
Try mixing capturing group (for the whole part) and non-capturing group for your current parentheses:
r'dated as of (\w+\s+(?:.+?)+) to the schedule'
As you can see, we have a simple grouping with no repetition that encloses both \w+ and your previous parentheses.
And your previous parentheses were changed to non-capturing group with ?: just inside them.
Better yet, your already-existing parentheses and combination of +? and + doesn't make much sense, so you can just remove it:
r'dated as of (\w+\s+.+) to the schedule'

"re" module included with Python primarily used for string searching and manipulation
\w = letters ( Match alphanumeric character, including "_")
\d= any number (a digit)
+ = matches 1 or more
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
import re
data = "dated as of 29 Jan 2017 to the schedule"
match = re.findall(r'\d+ \w+ \d{4}', data)
print (match[0])
output:
29 Jan 2017

This works fine :-
text ="dated as of 29 Jan 2017"
m =re.search(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}',
text, re.IGNORECASE)
if m:
print (m.group(0))

Related

regular expression for matching everything until a word is found

I have a piece of text that is repeated several times. Here you have a sample of that text:
DEMO of the text
The idea is to have a regular expression with three groups and repeat this for any match along with the text. Here you have an example of a possible match:
group1 = HORIZON-CL5-2021-D1-01
group2 (Opening) = 15 Apr 2021
group3 (Deadlines(s)) = 07 Sep 2021
group1 = HORIZON-CL5-2022-D1-01-two-stage
group2 (Opening) = 04 Nov 2021
group3 (Deadlines(s)) = 15 Feb 2022 (First Stage), 07 Sep 2022 (Second Stage)
I am trying with this regular expression:
\n(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2}).*?^Opening
It almost works. What I need is to say in the regular expression two more things:
That there are cases that after the last number of HORIZON... might appear some text, like in the second case:
HORIZON-CL5-2022-D1-01 -two-stage
I need to say catch everything until the word 'Opening:' appears at the beginning of a line. I thought was doing this with this part of the expression .*?^Opening but it seems is not correct.
How can I solve this?
To get the -two-stage in group 1, you can add matching 0+ non whitespace chars \S* to the existing group.
You don't need the s modifier to make the dot match a newline. Instead, you can match all lines that do not start with Opening using a negative lookahead, and then match Opening and capture the date and the deadline part in a capture group.
Note that you can omit {1}
^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (.+)\r?\nDeadline\(s\): (.+)
Regex demo
You could make the group starting with a date like part as specific as you want, as .+ is a broad match.
For example
^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (\d{2} [A-Z][a-z]{2} \d{4})\r?\nDeadline\(s\): (\d{2} [A-Z][a-z]{2} \d{4}.*)
Regex demo
In your pattern you are reppeated HORIZON-... in the first group e.g. HORIZON-()-A1-11HORIZON-+-B2-33 while this should not appear in your input it should not be a problem.
The Opening is required in your pattern, I would replace it with a positive lookahead (Opening|$), where $ denotes end of line.
It seems you are not doing anything with the parts of the string you are retrieving, from your examples I think you could simply match non-spaces.
const pattern = /\n(HORIZON-\S+)\s*(.*?)\s*(?=Opening|$)/
If yow want to keep the original pattern and capture the rest of the text in a separate group it would be /\n(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2})(\S*)\s*(.*?)\s*(?=Opening|$)/. The
The expression beginning in '\n' does not match the first line, you could change it to /^(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2})(\S*)\s*(.*?)\s*(?=Opening|$)/.
You can have something like this: HORIZON-\S+-[A-Z]{1}\d{1}-\d{2}(-[^\s]*)? . I added the (-[^\s]*)? part. Here I am telling the regex to match something that starts with - until a white space (\s) is found. The ? makes this part optional so it can show up once or not at all.

Using Regex to search for a string unless it finds another string first

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.
This is what I have right now and it definitely doesn't work.
(\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))
So in this instance It would throw a match since the string is before the next date:
01/20/20
string
01/21/20
Here it shouldn't match since the string is after the next date:
01/20/20
this isn't the phrase you're looking for
01/21/20
string
Any help on this would be greatly appreciated.
You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.
If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}
Regex demo | Python demo
For example (using re.DOTALL to make the dot match a newline)
import re
regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"
test_str = """01/20/20\n\n"
"string\n\n"
"01/21/20\n\n"
"01/20/20\n\n"
"this isn't the phrase you're looking for\n\n"
"01/21/20\n\n"
"string"""
print(re.findall(regex, test_str, re.DOTALL))
Output
['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']
If the string can not occur 2 times between the date, you might use
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}
Regex demo
Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b
One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:
inp = """01/20/20
string # <-- this is matched
01/21/20
01/20/20
01/21/20
string""" # <-- this is not matched
matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)
This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Regular expression to find a date substring Python 3.7

I'm trying to write a regular expression to find a specific substring within a string.
I'm looking for dates in the following format:
"January 1, 2018"
I have already done some research but have not been able to figure out how to make a regular expression for my specific case.
The current version of my regular expression is
re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string)
I'm fairly inexperienced with regular expression but from reading the documentation this is what I could come up with as far as matching the date format I'm working with.
Here is my thought process behind my regular expression:
\w should match to any unicode word character and * should repeat the previous match so these together should match some thing like this "January". ? makes * not greedy so it won't try to match anything in the form of January 20 as in it should stop at the first whitespace character.
\s should match white space.
\d\d and \d\d\d\d should match a two digit and four digit number respectively.
Here's a testable sample of my code:
import re
my_string = "January 01, 1990\n By SomeAuthor"
print(re.search("[\w*?\s\d\d\s\d\d\d\d]", my_string))
EDIT:
I have also tried :[A-Za-z]\s\d{1,2}\s\d{2, 4}
Your pattern may be a bit greedy in certain areas like in the month name. Also, you're missing the optional comma. Finally, you can use the ignore case flag to simplify your pattern. Here is an example using re in verbose mode.
import re
text = "New years day was on January 1, 2018, and boy was it a good time!"
pattern = re.compile(r"""
[a-z]+ # at least one+ ascii letters (ignore case is use)
\s # one space after
\d\d? # one or two digits
,? # an oprtional comma
\s # one space after
\d{4} # four digits (year)
""",re.IGNORECASE|re.VERBOSE)
result = pattern.search(text).group()
print(result)
output
January 1, 2018
Try
In [992]: my_string = "January 01, 1990\n By SomeAuthor"
...: print(re.search("[A-Z][a-z]+\s+\d{1,2},\s+\d{4}", my_string))
...:
<_sre.SRE_Match object; span=(0, 16), match='January 01, 1990'>
[A-Z] is any uppercase letter
[a-z]+ is 1 or more lowercase letters
\s+ is 1 or more space characters
\d{1,2} is at least 1 and at most 2 digits
here:
re.search("\w+\s+\d\d?\s*,\s*\d{4}",date_string)
import re
my_string = "January 01, 1990\n By SomeAuthor"
regex = re.compile('\w+\s+\d+, \d{4}')
result = regex.search(my_string)
result will contain the matched text and the character span.

Why won't my for-loop match any of my defined regular expressions?

We have just started to learn about regular expressions, but I'm not able to find any matches in my string when I try to use the regex to search. What am I doing wrong?
I created six separate strings (just for the look of it), and concatenated them into one and then i try to loop through the words in the split string and search for one of the regexes i declared
Below, I translated the string, on the right side, just so you know what it says.
myString1 = "Skal vi moetes neste torsdag?" # - Shall we meet next Thursday
myString2 = "Hva med aa heller moetes mandag?" # - How about Monday
myString3 = "Hvordan gikk moetet forrige mandag?" # - How did the meeting on monday go?
myString4 = "Det gikk bra, vi skal moetes igjen tirsdag onsdag fredag lørdag søndag 13. september." # - It went well, we are meeting again on Sunday September 13th.
myString5 = "Altsaa, 13/09/2014?"
myString6 = "Ja, Sunday 13. september 2014." # - Yes, Sunday, September 13th 2014
myStringAll = (myString1 + myString2 + myString3 + myString4 + myString5 + myString6)
myWords = myStringAll.split()
regWeekDays = re.compile(r'^(man|tirs|ons|tors|fre|lør|søn)dag$', re.IGNORECASE)
regNextLast = re.compile(r'^[neste].$', re.IGNORECASE)
regDay = re.compile(r'^([0-2][0-9]|3[0-1])$')
regYear = re.compile(r'^([1-2][0-9][0-9][0-9])$')
for words in myWords:
matches = re.findall(regNextLast, words)
if matches:
print words
There are several problems with your regular expressions, but the main one is that you use ^ and $ at the beginning and end of each expression. ^ means to match the beginning of the string and $ matches the end of a string. Unless your strings are strictly the length of the expressions, findall won't match anything.
An Example:
In [55]: re.findall(r'^a$', 'abcdefghijkl')
Out[55]: [] # "a" is not matched!
^ and $ should only be used to explicitly match the beginning and end of a string, respectively (or the end of a line in some cases, see the documentation). Strip these out and your expressions should start matching.
Here are some more specific problems:
In ^(man|tirs|ons|tors|fre|lør|søn)dag$ only the first part (man|tirs|ons|tors|fre|lør|søn) will be captured and returned by findall. Change this to a non capturing group so that the entire expression is returned: (?:man|tirs|ons|tors|fre|lør|søn)dag
In ^[neste].$, I assume you want to capture the string "neste". Currently you have a set [neste] which will match one of the following characters: n, e, s, or t. Change this to simply neste. The documentation on sets can be found here.
^([0-2][0-9]|3[0-1])$ is mostly fine, aside from the ^ and $, you can omit the hyphen between 0 and 1, exclude the parentheses, and use the digit symbol \d (equivalent to [0-9], however: [0-2]\d|3[01]
Finally in '^([1-2][0-9][0-9][0-9])$', (again, aside from the ^ and $) the expression should work as expected, but you can make it more succinct. You can use the curly bracket syntax to specify repeats. Thus a string matching any year from 1000-2999 becomes:
[12]\d{3}
I recommend that you peruse the HOWTO on Regular Expressions.
r'^[neste].$' is probably not what you meant. This regular expression looks for a string two characters in length where the first character is in ('e', 'n', 's', 't') followed by any other single character. None of the two-character substrings you've split out of your larger string match this pattern.
Maybe you could benefit from a tutorial: http://www.regular-expressions.info/tutorial.html

Categories