Regular Expression to Match Text into Mutliple Groups - python

I'm trying to set up a regular expression to match text and I'd like a particular string to match with a separate group from the rest of the text if it is present.
For instance, if my string is this is a test, I would like this is a to match the first group and test to match the second group. I am using the python regex library. Here are a few more examples of what result I would like
this is a test - group 1: this is a, group 2: test
one day at a time - group 1: one day at a time, group 2:
one day test is - group 1: one day, group 2: test
testing, 1,2,3 - no match
this is not a drill - group 1: this is not a drill, group 2:
in those cases, the particular string I'm matching in the second group is test. I'm not sure how to set up a regular expression to match these particular cases correctly.

You can try this mate
^(?:(?!test))(?:(.*)(?=\btest\b)(\btest\b)|(.*))
Explanation
^(?:(?!test)) - Negative look ahead.Don't match anything start with test.
(.*) - Matches anything except newline.
(?=\btest\b) - Positive lookahead. Matches test between word boundaries.
(\btest\b) - Capturing group matches test.
| - Alternation works same as logical OR.
(.*) - Matches anything except newline.
Demo

You can try the following regular expression:
^(this.*?)(test)?$
Explanation of the regular expression:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
this 'this'
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
test 'test'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Related

How to create optional capture groups in Python regex

I have examined a previous question relating to optional capture groups in Python, but this has not been helpful. Attempting to follow, the code I have is below:
import re
c = re.compile(r'(?P<Prelude>.*?)'
r'(?:Discussion:(?P<Discussion>.+?))?'
r'(?:References:(?P<References>.*?))?',
re.M|re.S)
test_text = r"""Prelude strings
Discussion: this is some
text.
References: My bad, I have none.
"""
test_text2 = r"""Prelude strings
Discussion: this is some
text.
"""
print(c.match(test_text).groups())
print(c.match(test_text2).groups())
Both print ('Prelude strings', None, None) instead of capturing the two groups. I am unable to determine why.
The expected result is ('Prelude strings', ' this is some\ntext.', ' My bad, I have none.') for the first, and the second the same but with None as the third capture group. It should also be possible to delete the Discussion lines and still capture References.
You can use
c = re.compile(r'^(?P<Prelude>.*?)'
r'(?:Discussion:\s*(?P<Discussion>.*?)\s*)?'
r'(?:References:\s*(?P<References>.*?))?$',
re.S)
One-line regex pattern as a string:
(?s)^(?P<Prelude>.*?)(?:Discussion:\s*(?P<Discussion>.*?)\s*)?(?:References:\s*(?P<References>.*?))?$
See the regex demo.
Details:
(?s) - same as re.S, makes . match line break chars
^ - start of the whole string (note that it no longer matches start of any line, since I removed the re.M flag)
(?P<Prelude>.*?) - Group "Prelude": any zero or more chars as few as possible
(?:Discussion:\s*(?P<Discussion>.*?)\s*)? - an optional non-capturing group matching one or zero occurrences of the following sequence:
Discussion: - a fixed string
\s* - zero or more whitespaces
(?P<Discussion>.*?) - Group "Discussion": zero or more chars as few as possible
\s* - zero or more whitespaces
(?:References:\s*(?P<References>.*?))? - an optional non-capturing group matching one or zero occurrences of the following sequence:
References: - a fixed string
\s* - zero or more whitespaces
(?P<References>.*?) - Group "References": any zero or more chars as few as possible
$ - end of the string.

Add a custom exception to the regex expression

This question is related to my previous question, for which I got an answer.
Now I need to add an exception condition into the recommended regex expression. The regex expression (?<!\s)-\s+ should be applied only if the word after - is not equal to to. If it is equal to - to, then the - should be replaced with a single white space .
I tried to use a negative lookbehind (?<!to) to add the condition on to.
import re
s = "refer- ences har- ness Stand- ard Re- quired www.mypo- rtal.test.com A - it is a document, move- to store"
re.sub(r"(?<!\s)-\s+(?<!to)", "", s)
But it still returns moveto store instead of move to store.
The expected output:
references harness Standard Required www.myportal.test.com A - it is a document, move to store
You can use
import re
s = "refer- ences har- ness Stand- ard Re- quired www.mypo- rtal.test.com A - it is a document, move- to store"
print(re.sub(r"(?<!\s)-(?:(\s)+(to)\b|\s+)", r"\1\2", s))
# => references harness Standard Required www.myportal.test.com A - it is a document, move to store
See the Python demo and the regex demo.
Details
(?<!\s) - a location with no whitespace immediately on the left
- - a hyphen
(?:(\s)+(to)\b|\s+) - a non-capturing group matching either of the two patterns:
(\s)+(to)\b - a whitespace captured into Group 1 (the group value is referred to with the \1 placeholder, called a replacement backreference, from the replacement pattern), repeated one or more times (so that only the last one lands in the Group 1 memory buffer) and then a whole word to (since \b is a word boundary) that is captured into Group 2 (\2 in the replacement pattern)
| - or
\s+ - 1+ whitespaces.
The replacement is a concatenation of Group 1 and Group 2. When the first alternative in the non-capturing group does not match, the \1 and \2 are empty strings, so the result is as expected in both cases.

regex storing matches in wrong capture group

I am trying to build a python regex with optional capture group. My regex works for most case but fails to put the matches in the right group in one of the test case.
I want to match and capture the following cases:
namespace::tool_name::1.0.1
namespace::tool_name
tool_name::1.0.1
tool_name
Here is the regex I have so far:
(?:(?P<namespace>^[^:]+)::)?(?P<name>[^:]*)(?:::(?P<version>[0-9\.]+))?
This regex works fine for all my 4 test cases but the problem I have is in case 3, the tool_name is capture in the namespace group and the 1.0.1 is captured in the name group. I would like them to be captured in the right groups, name and version respectively
Thanks
You may make tool_name regex part obligatory by replacing * with + (it looks like it always is present) and restrict this pattern from matching three dot-separated digit chunks with a negative lookahead:
^(?:(?P<namespace>[^:]+)::)?(?!\d+(?:\.\d+){2})(?P<name>[^:]+)(?:::(?P<version>\d+(?:\.\d+){2}))?
See the regex demo
Details
^ - start of string
(?:(?P<namespace>[^:]+)::)? - an optional non-capturing group matching any 1+ chars other than : into Group "namespace" and then just matches ::
(?!\d+(?:\.\d+){2}) - a negative lookahead that does not allow digits.digits.digits pattern to appear right after the current position
(?P<name>[^:]+) - Group "name": any 1 or more chars other than :
(?:::(?P<version>\d+(?:\.\d+){2}))? - an optional non-capturing group matching :: and then Group "version" captures 1+ digits and 2 repetitions of . and 1+ digits.

Split string by number of whitespaces

I have a string that looks like either of these three examples:
1: Name = astring Some comments
2: Typ = one two thee Must be "sand", "mud" or "bedload"
3: RDW = 0.02 [ - ] Some comment about RDW
I first split the variable name and rest like so:
re.findall(r'\s*([a-zA-z0-9_]+)\s*=\s*(.*)', line)
I then want to split the right part of the string into a part containing the values and a part containing the comments (if there are any). I want to do this by looking at the number of whitespaces. If it exceeds say 4, then I assume the comments to start
Any idea on how to do this?
I currently have
re.findall(r'(?:(\S+)\s{0,3})+', dataString)
However if I test this using the string:
'aa aa23r234rf2134213^$&$%& bb'
Then it also selects 'bb'
You may use a single regex with re.findall:
^\s*(\w+)\s*=\s*(.*?)(?:(?:\s{4,}|\[)(.*))?$
See the regex demo.
Details:
^ - start of string
\s* - 0+ whitespaces
(\w+) - capturing group #1 matching 1 or more letters/digits/underscores
\s*=\s* - = enclosed with 0+ whitespaces
(.*?) - capturing group #2 matching any 0+ chars, as few as possible, up to the first...
(?:(?:\s{4,}|\[)(.*))? - an optional group matching
(?:\s{4,}|\[) - 4 or more whitespaces or a [
(.*) - capturing group #3 matching 0+ chars up to
$ - the end of string.

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories