Regex extract group inside optional group

Regex extract group inside optional group - python

I have strings of the form "identfier STEP=10" where the "STEP=10" part is optional. The goal is to detect both lines with or without the STEP part and to extract the numerical value of STEP in cases where it is part of the line. Now matching both cases is easy enough,
import re
pattern = ".*(STEP=[0-9]+)?"
re.match(pattern, "identifier STEP=10")
re.match(pattern, "identifier")
This detects both cases without problem. But I fail to extract the numerical value in one go. I tried the following,
import re
pattern = ".*(STEP=([0-9]+))?"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
And while it still does detect the lines, i only get
group0 = (None, None)
group1 = (None, None)
While i hoped to get something like
group0 = (None, "10")
group1 = (None, None)
Is regex not suited to do this in one go or am I simply using it wrong ? I am curious if there is a single regex call that returns what I want without doing a second pass after I have matched the line.

A possible solution will look like
import re
pattern = "^.*?(?:STEP=([0-9]+))?$"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
print(*group0)
print(*group1)
See the Python demo.
The ^.*?(?:STEP=([0-9]+))?$ regex matches
^ - start of string
.*? - zero or more chars other than line break chars as few as possible (i.e. the regex engine skips this pattern first and tries the subsequent patterns, and only comes back to use this when the subsequent patterns fail to match)
(?:STEP=([0-9]+))? - an optional non-capturing group: STEP= and then Group 1 capturing one or more ASCII digits
$ - end of string.
The .*(STEP=[0-9]+)? regex matches like this:
.* - grabs the whole line, from start to end
(STEP=[0-9]+)? - the group is quantified with * (meaning zero or more occurrences of the quantified pattern), so the regex engine, with its index being at the end of the line now, finds a match: an empty string at the string end, and the match is returned, with Group 1 text value as empty.
To be able to resolve such issues you must understand backtracking in regex (for example, see this YT video of mine to learn more about it).

Related

(Python) How to check a long string against several regex?

I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code

You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.

Match all identifiers in a string

Problem:
I am looking for a way to match certain identifiers in a given line
that starts with certain words. The ID consists of
characters, possibly followed by digits, followed by a dash then some
more digits. An ID should only be matched on lines where the
starting word is one of the following: Closes, Fixes, Resolves. If a
line contains more than one IDs, those will be separated by
the string and. Any number of IDs can be present on a
line.
Example Test String:
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
What I tried:
Using a regular expressions to get all the matches, I always come up short in some regards. E.g. one of the regexp I tried is this:
^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*
I intended to have a non-capturing group where the line needs to
start with one of the allowed words, followed by a single space:
^(?:Closes|Fixes|Resolves)
Then at least one ID needs to follow the starting word,
which I intend to capture: (\w+-\d+)
Finally, zero or more ID can follow the first one, which are
separated by the string and, but I only want to capture the
IDs here, not the separator: (?:(?: and )(\w+-\d+))*
Result of this regexp in python:
test_string = """
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
"""
ids = []
for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
for group in match:
if group:
ids.append(group)
print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']
Also, here is the result on regex101.com. If more than one ID follows the initial one, unfortunately it only captures the last match, not all of them. I read that a repeated capturing group will only capture the last iteration, and I should put a capturing group around the repeated group to capture all iterations, but I couldn't make it work.
Summary:
Is there a solution for this with regular expressions, something similar to what I tried but which captures all the occurrences of the IDs? Or is there a better way to parse this string for the IDs, using Python?

You could use a single capturing group and in that capturing group match the first occurrence and repeat the same pattern 0+ times preceded by a space followed by and and space.
The values are in group 1.
To get the separate values, split on and
^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)
Regex demo

It might be easier with the two-stage approach, such as:
def get_matches(test): #assume test is a list of strings
regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
regex2 = re.compile(r'\w+-\d+')
results = []
for line in test:
if regex1.search(line):
results.extend(regex2.findall(line))
return results
gives:
['PD-1','PD-21','PD-22','PD-31','PD-32',
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']

If you need to work with repeated capturing groups, you should install PyPi regex module with pip install regex and use
import regex
test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']
See the Python demo
The capture stack for each group is accessible via match.captures(X).
The regex you have is fine to use as is, but it is more user-frienly with a named capturing group here.

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.

Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com

You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)

In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.

I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Regex pattern to match substring

Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?

Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']

Regex: skip the first match of a character in group?

From this string
s = 'stringalading-0.26.0-1'
I'd like to extract the part 0.26.0-1. I can think of various ways to achieve this, using split or a regular expression using a pattern like this
pattern = r'\d+\.\d+\.\d+\-\d+'
I also tried to use a group of characters, like so:
pattern = r'[.\-\d]+'
This gives me:
In [30]: re.findall(pattern, s)
Out[30]: ['-0.26.0-1']
So I wondered: is it possible to skip the first occurrence of a character in a group, in this case the first occurrence of -?

is it possible to to skip the first occurrence of a character in a group, in this case the first occurrence of -?
NO, because when matching, the regex engine processes the string from left to right, and once the matching pattern is found, the matched chunk of text is written to the match buffer. Thus, either write a regex that only matches what you need, or post-process the found result by stripping unwanted characters from the left.
I think you do not need a regex here. You can split the string with - and pass the maxsplit argument set to 1, then just access the second item:
s = 'stringalading-0.26.0-1'
print(s.split("-", 1)[1]) # => '0.26.0-1'
See the Python demo
Also, your first regex works well:
import re
s = 'stringalading-0.26.0-1'
pat = r'\d+\.\d+\.\d+-\d+'
print(re.findall(pat, s)) # => ['0.26.0-1']

Do:
-(.*)
and get captured group 1.
Example:
In [9]: s = 'stringalading-0.26.0-1'
In [10]: re.search(r'-(.*)', s).group(1)
Out[10]: '0.26.0-1'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex extract group inside optional group - python

Related

(Python) How to check a long string against several regex?

Match all identifiers in a string

Python regex to match after the text and the dot [duplicate]

Regex pattern to match substring

Regex: skip the first match of a character in group?

Categories

Resources