regular expression for matching everything until a word is found - python

I have a piece of text that is repeated several times. Here you have a sample of that text:
DEMO of the text
The idea is to have a regular expression with three groups and repeat this for any match along with the text. Here you have an example of a possible match:
group1 = HORIZON-CL5-2021-D1-01
group2 (Opening) = 15 Apr 2021
group3 (Deadlines(s)) = 07 Sep 2021
group1 = HORIZON-CL5-2022-D1-01-two-stage
group2 (Opening) = 04 Nov 2021
group3 (Deadlines(s)) = 15 Feb 2022 (First Stage), 07 Sep 2022 (Second Stage)
I am trying with this regular expression:
\n(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2}).*?^Opening
It almost works. What I need is to say in the regular expression two more things:
That there are cases that after the last number of HORIZON... might appear some text, like in the second case:
HORIZON-CL5-2022-D1-01 -two-stage
I need to say catch everything until the word 'Opening:' appears at the beginning of a line. I thought was doing this with this part of the expression .*?^Opening but it seems is not correct.
How can I solve this?

To get the -two-stage in group 1, you can add matching 0+ non whitespace chars \S* to the existing group.
You don't need the s modifier to make the dot match a newline. Instead, you can match all lines that do not start with Opening using a negative lookahead, and then match Opening and capture the date and the deadline part in a capture group.
Note that you can omit {1}
^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (.+)\r?\nDeadline\(s\): (.+)
Regex demo
You could make the group starting with a date like part as specific as you want, as .+ is a broad match.
For example
^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (\d{2} [A-Z][a-z]{2} \d{4})\r?\nDeadline\(s\): (\d{2} [A-Z][a-z]{2} \d{4}.*)
Regex demo

In your pattern you are reppeated HORIZON-... in the first group e.g. HORIZON-()-A1-11HORIZON-+-B2-33 while this should not appear in your input it should not be a problem.
The Opening is required in your pattern, I would replace it with a positive lookahead (Opening|$), where $ denotes end of line.
It seems you are not doing anything with the parts of the string you are retrieving, from your examples I think you could simply match non-spaces.
const pattern = /\n(HORIZON-\S+)\s*(.*?)\s*(?=Opening|$)/
If yow want to keep the original pattern and capture the rest of the text in a separate group it would be /\n(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2})(\S*)\s*(.*?)\s*(?=Opening|$)/. The
The expression beginning in '\n' does not match the first line, you could change it to /^(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2})(\S*)\s*(.*?)\s*(?=Opening|$)/.

You can have something like this: HORIZON-\S+-[A-Z]{1}\d{1}-\d{2}(-[^\s]*)? . I added the (-[^\s]*)? part. Here I am telling the regex to match something that starts with - until a white space (\s) is found. The ? makes this part optional so it can show up once or not at all.

Related

How to make this Regex pattern work for both strings

I have the strings 'amount $165' and 'amount on 04/20' (and a few other variations I have no issues with so far). I want to be able to run an expression and return the numerical amount IF available (in the first string it is 165) and return nothing if it is not available AND make sure not to confuse with a date (second string). If I write the code as following, it returns the 165 but it also returns 04 from the second.
amount_search = re.findall(r'amount.*?(\d+)[^\d?/]?, string)
If I write it as following, it includes neither
amount_search = re.findall(r'amount.*?(\d+)[^\d?/], string)
How to change what I have to return 165 but not 04?
To capture the whole number in a group, you could match amount followed by matching all chars except digits or newlines if the value can not cross newline boundaries.
Capture the first encountered digits in a group and assert a whitespace boundary at the right.
\bamount [^\d\r\n]*(\d+)(?!\S)
In parts
\bamount Match amount followed by a space and preceded with a word boundary
[^\d\r\n]* Match 0 or more times any char except a digit or newlines
(\d+) Capture group 1, match 1 or more digits
(?!\S) Assert a whitespace boundary on the right
Regex demo
try this ^amount\W*\$([\d]{1,})$
the $ indicate end of line, for what I have tested, use .* or ? also work.
by grouping the digits, you can eliminate the / inside the date format.
hope this helps :)
Try this:
from re import sub
your_digit_list = [int(sub(r'[^0-9]', '', s)) for s in str.split() if s.lstrip('$').isdigit()]

Pandas: str extract text every thing except the last part of the string

I have a dataframe with a column known as "msg".
In the "msg" column, all rows goes somesthing like below. User xxxx is of length 6 or 7 characters. xx.xx.xx.xx and yy.yy.yy.yy are ip addresses thus every octet could be 1 digit or 3 digits.
User xxxxxx is attempting to restart primary host xxx.xx.xxx.xx (id=1) for managed host yyy.yy.yyy.yy (id=4) at Dec 30, 2019, 6:08:87 PM
I need a rule to extract everything in each cell before "at Dec 30, 2019, 6:08:87 Pm"? i.e I want to drop all characters after "at \w\w\w \d\d, \d\d\d\d, \d:\d\d:\d\d ....."
My current code is as below but I not sure how to fill in the pat.
Test = df['msg'].str.extract(pat='...')
Respond to comments below:
Matthew: yes. The format after the 2nd (id=xx) are the same.
Jon: either way is OK.
You could use a positive lookahead regex here:
Test = df['msg'].str.extract(pat='^.*(?=\s+at [A-Za-z]{3} \d{2}, \d{4}, [\d:]+ (?:AM|PM)$)')
Here is a regex demo showing that the above pattern is working:
Demo
string='I ate an apple (id=1) and an orange (id=4) at Dec 30, 2019, 6:08:87 PM'
string = string[:string.rfind('at')]
Here, I guess the word 'at' should be before the date. Hence what I did is found the last_occurence of 'at' using rfind() and sliced the string
Please Try
df.msg.str.extractall('(?<=\s)([a-z]*\s[A-Z0-9]\S*\s[0-9,].+)')
Explanation
(?<=\s) Any expression after space followed by lower case aphas and space [a-z]*\s and an upper case alphanumeric[A-Z0-9]
and the alphanumeric may match non-whitespace \S
and may match a string white space greedily to the left zero or multiple times *\s and may also have strings with digits between 0-9 and comma[0-9,] and if this pattern exists match those characters except terminators greedily to the the left zero or multiple times.+

python regex: string search for a date

I am searching for a specific string within a document that will have known words before and after a date, and I want to extract the date. For example, if the substring is "dated as of 29 Jan 2017 to the schedule", I want to extract "29 Jan 2017".
My code is:
m = re.search(r'dated as of \w+\s+(.+?)+to the schedule', text, re.IGNORECASE)
if m:
items["date"] = m.group(1)
But - this just gives me "Jan 2017" - it misses the day.
I have tried various variations on the regex search string, but still can't get the day. Any thoughts?
You have your capturing group (parentheses) not enclose the first part that is captured by \w+.
Try mixing capturing group (for the whole part) and non-capturing group for your current parentheses:
r'dated as of (\w+\s+(?:.+?)+) to the schedule'
As you can see, we have a simple grouping with no repetition that encloses both \w+ and your previous parentheses.
And your previous parentheses were changed to non-capturing group with ?: just inside them.
Better yet, your already-existing parentheses and combination of +? and + doesn't make much sense, so you can just remove it:
r'dated as of (\w+\s+.+) to the schedule'
"re" module included with Python primarily used for string searching and manipulation
\w = letters ( Match alphanumeric character, including "_")
\d= any number (a digit)
+ = matches 1 or more
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
import re
data = "dated as of 29 Jan 2017 to the schedule"
match = re.findall(r'\d+ \w+ \d{4}', data)
print (match[0])
output:
29 Jan 2017
This works fine :-
text ="dated as of 29 Jan 2017"
m =re.search(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}',
text, re.IGNORECASE)
if m:
print (m.group(0))

Match a date with the same character separating the values

I have to find dates in multiple formats in a text.
I have some regex like this one:
# Detection of:
# 25/02/2014 or 25/02/14 or 25.02.14
regex = r'\b(0?[1-9]|[12]\d|3[01])[-/\._](0?[1-9]|1[012])[-/\._]((?:19|20)\d\d|\d\d)\b'
The problem is that it also matches dates like 25.02/14 which is not good because the splitting character is not the same.
I could of course do multiple regex with a different splitting character for every regex, or do a post-treatment on the matching results, but I would prefer a complete solution using only one good regex. Is there a way to do so?
In addition to my comment (the original word boundary approach lets the pattern match "dates" that are in fact parts of other entities, like IPs, serial numbers, product IDs, etc.), see the improved version of your regex in comparison with yours:
import re
s = '25.02.19.35 6666-20-03-16-67875 25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = [m.group() for m in re.finditer(r'\b(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d\b', s)]
print(found_dates) # initial regex
found_dates = [m.group() for m in re.finditer(r'(?<![\d.-])(?:0?[1-9]|[12]\d|3[01])([./-])(?:0?[1-9]|1[012])\1(?:19|20)?\d\d(?!\1\d)', s)]
print(found_dates) # fixed boundaries
# = >['25.02.19', '20-03-16', '11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
# => ['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']
See, your regex extracts '25.02.19' (part of a potential IP) and '20-03-16' (part of a potential serial number/product ID).
Note I also shortened the regex and extraction code a bit.
Pattern details:
(?<![\d.-]) - a negative lookbehind making sure there is no digit, .
and - immediately to the left of the current location (/ has been discarded since dates are often found inside URLs)
(?:0?[1-9]|[12]\d|3[01]) - 01 / 1 to 31 (day part)
([./-]) - Group 1 (technical group to hold the separator value) matching either ., or / or -
(?:0?[1-9]|1[012]) - month part: 01 / 1 to 12
\1 - backreference to the Group 1 value to make sure the same separator comes here
(?:19|20)?\d\d - year part: 19 or 20 (optional values) and then any two digits.
(?!\1\d) - negative lookahead making sure there is no separator (captured into Group 1) followed with any digit immediately to the right of the current location.
Based on the comment of Rawing, this did the trick:
regex = r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b'
So, the complete code is:
import re
s = '25.02/2014 25.02/14 11/12/98 11/12/1998 14/12-2014 14-12-2014 14.12.1998'
found_dates = []
for m in re.finditer(r'\b(0?[1-9]|[12]\d|3[01])([./-])(0?[1-9]|1[012])\2((?:19|20)\d\d|\d\d)\b', s):
found_dates.append(m.group(0))
print(found_dates)
The output is, as desired :
['11/12/98', '11/12/1998', '14-12-2014', '14.12.1998']

python regex match optional square brackets

I have the following strings:
1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003",
2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]",
3 '''GREGORY LANCASTER AND JOHN HENRY HUNTER V CULLEN INVESTMENTS LIMITED AND
ERIC JOHN WATSON CA CA51/03 26 May 2003'''
I am trying to find a regular expression which matches all of them. I don't know how to match optional square brackets around the date at the end of the string eg [16 May 2014].
casename = re.compile(r'(^[A-Z][A-Za-z\'\(\) ]+\b[v|V]\b[A-Za-z\'\(\) ]+(.*?)[ \[ ]\d+ \w+ \d\d\d\d[\] ])', re.S)
The date regex at the end only matches cases with dates in square bracket but not the ones without.
Thank to everybody who answered. #Matt Clarkson what I am trying to match is a judicial decision 'handle' in a much larger text. There is a large variation within those handles, but they all start at the beginning of a line have 'v' for versus between the party names and a date at the end. Mostly the names of the parties are in capital but not exclusively. I am trying to have only one match per document and no false positives.
I got all of them to match using this (You'll need to add the case-insensitive flag):
(^[a-z][a-z\'&\(\) ]+\bv\b[a-z&\'\(\) ]+(?:.*?) \[?\d+ \w+ \d{4}\]?)
Regex Demo
Explanation:
( Begin capture group
[a-z\'&\(\) ]+ Match one or more of the characters in this group
\b Match a word boundary
v Match the character 'v' literally
\b Match a word boundary
[a-z&\'\(\) ]+ Match one or more of the characters in this group
(?: Begin non-capturing group
.*? Match anything
) End non-capturing group
\[?\d+ \w+ \d{4}\]? Match a date, optionally surrounded by brackets
) End capture group
How to make Square brackets optional, can be achieved like this:
[\[]* with the * it makes the opening [ optional.
A few recommendations if I may:
This \d\d\d\d could be also expressed like this as well \d{4}
[v|V] in regex what is inside the [] is already one or other | is not necessary [vV]
And here is what an online demo
Using your regex and input strings, it looks like you will match only the 2nd line (if you get rid of the '^' at the beginning of the regex. I've added inline comments to each section of the regular expression you provided to make it more clear.
Can you indicate what you are trying to capture from each line? Do you want the entire string? Only the word immediately preceding the lone letter 'v'? Do you want the date captured separately?
Depending on the portions that you wish to capture, each section can be broken apart into their respective match groups: regex101.com example. This is a little looser than yours (capturing the entire section between quotation marks instead of only the single word immediately preceding the lone 'v'), and broken apart to help readability (each "group" on its own line).
This example also assumes the newline is intentional, and supports the newline component (warning: it COULD suck up more than you intend, depending on whether the date at the end gets matched or not).

Categories