regex storing matches in wrong capture group

regex storing matches in wrong capture group - python

I am trying to build a python regex with optional capture group. My regex works for most case but fails to put the matches in the right group in one of the test case.
I want to match and capture the following cases:
namespace::tool_name::1.0.1
namespace::tool_name
tool_name::1.0.1
tool_name
Here is the regex I have so far:
(?:(?P<namespace>^[^:]+)::)?(?P<name>[^:]*)(?:::(?P<version>[0-9\.]+))?
This regex works fine for all my 4 test cases but the problem I have is in case 3, the tool_name is capture in the namespace group and the 1.0.1 is captured in the name group. I would like them to be captured in the right groups, name and version respectively
Thanks

You may make tool_name regex part obligatory by replacing * with + (it looks like it always is present) and restrict this pattern from matching three dot-separated digit chunks with a negative lookahead:
^(?:(?P<namespace>[^:]+)::)?(?!\d+(?:\.\d+){2})(?P<name>[^:]+)(?:::(?P<version>\d+(?:\.\d+){2}))?
See the regex demo
Details
^ - start of string
(?:(?P<namespace>[^:]+)::)? - an optional non-capturing group matching any 1+ chars other than : into Group "namespace" and then just matches ::
(?!\d+(?:\.\d+){2}) - a negative lookahead that does not allow digits.digits.digits pattern to appear right after the current position
(?P<name>[^:]+) - Group "name": any 1 or more chars other than :
(?:::(?P<version>\d+(?:\.\d+){2}))? - an optional non-capturing group matching :: and then Group "version" captures 1+ digits and 2 repetitions of . and 1+ digits.

Related

How to create optional capture groups in Python regex

I have examined a previous question relating to optional capture groups in Python, but this has not been helpful. Attempting to follow, the code I have is below:
import re
c = re.compile(r'(?P<Prelude>.*?)'
r'(?:Discussion:(?P<Discussion>.+?))?'
r'(?:References:(?P<References>.*?))?',
re.M|re.S)
test_text = r"""Prelude strings
Discussion: this is some
text.
References: My bad, I have none.
"""
test_text2 = r"""Prelude strings
Discussion: this is some
text.
"""
print(c.match(test_text).groups())
print(c.match(test_text2).groups())
Both print ('Prelude strings', None, None) instead of capturing the two groups. I am unable to determine why.
The expected result is ('Prelude strings', ' this is some\ntext.', ' My bad, I have none.') for the first, and the second the same but with None as the third capture group. It should also be possible to delete the Discussion lines and still capture References.

You can use
c = re.compile(r'^(?P<Prelude>.*?)'
r'(?:Discussion:\s*(?P<Discussion>.*?)\s*)?'
r'(?:References:\s*(?P<References>.*?))?$',
re.S)
One-line regex pattern as a string:
(?s)^(?P<Prelude>.*?)(?:Discussion:\s*(?P<Discussion>.*?)\s*)?(?:References:\s*(?P<References>.*?))?$
See the regex demo.
Details:
(?s) - same as re.S, makes . match line break chars
^ - start of the whole string (note that it no longer matches start of any line, since I removed the re.M flag)
(?P<Prelude>.*?) - Group "Prelude": any zero or more chars as few as possible
(?:Discussion:\s*(?P<Discussion>.*?)\s*)? - an optional non-capturing group matching one or zero occurrences of the following sequence:
Discussion: - a fixed string
\s* - zero or more whitespaces
(?P<Discussion>.*?) - Group "Discussion": zero or more chars as few as possible
\s* - zero or more whitespaces
(?:References:\s*(?P<References>.*?))? - an optional non-capturing group matching one or zero occurrences of the following sequence:
References: - a fixed string
\s* - zero or more whitespaces
(?P<References>.*?) - Group "References": any zero or more chars as few as possible
$ - end of the string.

Python: find a string between 2 strings in text

I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?

You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo

Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo

You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1

Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).

This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]

doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)

About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

Find something between parentheses

I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.

You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).

Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);

Regular Expression to Match Text into Mutliple Groups

I'm trying to set up a regular expression to match text and I'd like a particular string to match with a separate group from the rest of the text if it is present.
For instance, if my string is this is a test, I would like this is a to match the first group and test to match the second group. I am using the python regex library. Here are a few more examples of what result I would like
this is a test - group 1: this is a, group 2: test
one day at a time - group 1: one day at a time, group 2:
one day test is - group 1: one day, group 2: test
testing, 1,2,3 - no match
this is not a drill - group 1: this is not a drill, group 2:
in those cases, the particular string I'm matching in the second group is test. I'm not sure how to set up a regular expression to match these particular cases correctly.

You can try this mate
^(?:(?!test))(?:(.*)(?=\btest\b)(\btest\b)|(.*))
Explanation
^(?:(?!test)) - Negative look ahead.Don't match anything start with test.
(.*) - Matches anything except newline.
(?=\btest\b) - Positive lookahead. Matches test between word boundaries.
(\btest\b) - Capturing group matches test.
| - Alternation works same as logical OR.
(.*) - Matches anything except newline.
Demo

You can try the following regular expression:
^(this.*?)(test)?$
Explanation of the regular expression:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
this 'this'
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
test 'test'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex storing matches in wrong capture group - python

Related

How to create optional capture groups in Python regex

Python: find a string between 2 strings in text

Regex - How do i find this specific slice of string inside a bigger whole string

Find something between parentheses

Regular Expression to Match Text into Mutliple Groups

Categories

Resources