Regex capture groups divided by a number in between

Regex capture groups divided by a number in between - python

I need to capture 3 groups from a string. The string is in the form the following form:
{phrase 1} {optional number} {optional phrase 2}
A few examples of this are:
Battery Bank 1
Battery Bank 1 Segments
Battery Bank 1 Warranty Logger
Battery Bank 10
Battery Bank 10 Segments
Battery Bank 10 Warranty Logger
BSU
BSU 1
PCS 3
PCS 1
System
System Meter
As you can see, the only mandatory group is the first one which is comprised of word characters and spaces until a number of at least 1 digit appears. Then, optionally, another group of word and spaces characters.
This is what I have so far, but it's not working properly. It's matching over lines. It should match one per line.
([a-zA-Z\s]+)(\d+)?(\w)?
Here's a regex101 link to play with:
https://regex101.com/r/tSGIEm/2

You may use this regex with optional groups:
([a-zA-Z]+(?:[ \t]+[a-zA-Z]+)*)(?:[ \t]+(\d+)(?:[ \t]+(.+))?)?
Updated RegEx Demo
RegEx Details:
(: Start capture group #1
[a-zA-Z]+: Match a word of 1+ letters
(?:[ \t]+[a-zA-Z]+)*: Match 0 or more words separated by 1+ spaces/tabs
): End capture group #1
(?:: Start non-capture group #1
[ \t]+: Match 1+ spaces or tabs
(\d+): Match 1+ digits and capture in group #2
(?:: Start non-capture group #2
[ \t]+: Match 1+ spaces or tabs
(.+): Match 1+ of any characters and capture in group #3
)?: End optional non-capture group #2
)?: End optional non-capture group #1

You may use
^(.*?)(?: +(\d+) *(.*))?$
See the regex demo.
Details
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
(?: +(\d+) *(.*))? - an optional group matching 1 or 0 occurrences of:
+ - 1+ spaces
(\d+) - Group 2: one or more digits
* - 0+ spaces
(.*) - Group 3: any zero or more chars other than line break chars, as many as possible
$ - end of string.

Related

How to exclude two words from regex?

I have this regex:
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*([\w\s][^cui]+)
That should match
] AN 1 words 2 words 3 words
or
] AV 1 words 2 words 3 words
The words after 3 should exclude "da cui", so "da\scui", but it doesn't work. Try it here: https://regex101.com/r/kI7Tan/1
What am I doing wrong?
Sample string:
campo] AN 1 campo 2 prato con penna B sps a 1 3 da cui campo con penna C as a 1 cfr Nota filologica
Expected output: it won't match it because of the "da cui". So basically I want to match all words without the string "da cui".

The final capture group of the regex ( ([\w\s][^cui]+) ) matches ...
Exactly 1 word character due to the first character class.
This class does not match a whitespace due to the preceding \s* in the regex.
Any number of characters other than c, u, i.
If you want to exclude matches contingent on the word(s) da cui, use a negative lookahead.
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*(?!.*da cui)(.*)
See the demo (regex101).
Update
Capture group reintroduced to the regex.

You may use either of the two:
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*(.*?)(?=cui|$)
See the regex demo
The (?:(?!cui).)* is a tempered greedy token that matches any char, 0 or more occurrences, as many as possible, that does not start a cui char sequence. The (.*?)(?=cui|$) pattern captures 0+ chars other than line break chars, as few as possible, up to the cui char sequence or end of string.

My interpretation of the question, as it concerns the string that follows one or more spaces after 3 (to the end of the line), is that if the string da cui is present in that string an empty string is to be saved to capture group 4, else that string is to be saved to capture group 4.
You could use the following regular expression.
\]\s*(AN|AV)\s+1\s+([\w\s]+)\s+2\s+([\w\s]+)\s+3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*)
Demo
This replaces 3\s*([\w\s][^cui]+) in the OP's regex with 3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*).
Python's regex engine performs the following steps after matching 3.
\s+ match 1+ spaces
( begin capture group 4
(?=.*\bda cui\b) match 0+ chars, then 'da cui' in a positive lookahead
| or
(?!=.*\bda cui\b) match 0* chars, then 'da cui' in a negative lookahead
.* match rest of line
) end capture group 4
If the positive lookahead succeeds an empty string is saved to the capture group.

Greedy Python RegEx capturing group to include "and"

I need some help writing regex expressions. I need an expression that can match the following patterns (including words and digits, spaces and commas):
Line 145
Line3544354
Lines 10,12
Line items 45,10,26
Lines 10 and 45
Thus far, I wrote one expression which includes the first three patterns and all case variations:
r'(?i)(line item[\.*\,*\s*\d+]+]+|line[\.*\,*\s*\d+]+|lines[\.*\,*\s*\d+]+|line items[\.*\,*\s*\d+]+)'
I would like to include the last two patterns listed but not sure how. I have wrote this expression for the pattern matching "Lines 10 and 45" by modifying the capturing group as follows:
r'(Lines[\.*\,*\w*\s*\d+]+)'
However, it does not work as expected. It selects all word characters in the string. I would like to keep my expressions greedy, but not sure how to implement the last two patterns in the list.
Any suggestions please?

You may use
(?i)lines?(?:\s+items?)?\s*\d+(?:\.\d+)?(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)*
See the regex demo.
Pattern details:
(?i) - ignore case inline modifier
lines? - line or lines (? quantifier makes the preceding pattern optional, matching 1 or 0 occurrences)
(?:\s+items?)? - an optional non-capturing group matching 1 or 0 occurrences of 1+ whitespaces followed with item and an optional s char
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits
(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)* - 0 or more repetitions of
\s* - 0+ whitespaces
(?:,|and) - , or and char sequence
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits

Regular Expression to Match Text into Mutliple Groups

I'm trying to set up a regular expression to match text and I'd like a particular string to match with a separate group from the rest of the text if it is present.
For instance, if my string is this is a test, I would like this is a to match the first group and test to match the second group. I am using the python regex library. Here are a few more examples of what result I would like
this is a test - group 1: this is a, group 2: test
one day at a time - group 1: one day at a time, group 2:
one day test is - group 1: one day, group 2: test
testing, 1,2,3 - no match
this is not a drill - group 1: this is not a drill, group 2:
in those cases, the particular string I'm matching in the second group is test. I'm not sure how to set up a regular expression to match these particular cases correctly.

You can try this mate
^(?:(?!test))(?:(.*)(?=\btest\b)(\btest\b)|(.*))
Explanation
^(?:(?!test)) - Negative look ahead.Don't match anything start with test.
(.*) - Matches anything except newline.
(?=\btest\b) - Positive lookahead. Matches test between word boundaries.
(\btest\b) - Capturing group matches test.
| - Alternation works same as logical OR.
(.*) - Matches anything except newline.
Demo

You can try the following regular expression:
^(this.*?)(test)?$
Explanation of the regular expression:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
this 'this'
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
test 'test'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

select group based on same value in regular Expression

I have a following content
ONE
1234234534564 123
34erewrwer323 123
123fsgrt43232 123
TWO
42433412133fr 234
fafafd3234132 342
THREE
sfafdfe345233 3234
FOUR
324ereffdf343 4323
fvdafasf34nhj 4323
fsfnhjdgh342g 4323
Consider ONE,TWO,THREE and FOUR are separate group.In that I want match only ONE and FOUR, based on the condition of second value of each line in the every group must be same and it will match group that has more than one line in that..How can I do that in regular expression
I have already tried following regex, but its not up to the mark
\w+\n\w+\t(\d+)(\n\w+\t\1){2,}

You may use
r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$'
See the regex demo.
Details
(?m) - enable re.MULTILINE mode to make ^ / $ match start and end of lines respectively
^ - start of a line
[A-Z]+ - 1+ uppercase ASCII letters (adjust as you see fit)
\r?\n - a line break like CRLF or LF
\S+ - 1+ non-whitespace chars
\s+ - 1 whitespaces (or use \t if a tab is the field separator)
(\d+) - Capturing group 1, one or more digits
(?:\r?\n\S+\s+\1)+ - one or more repetitions of a line break followed with 1+ non-whitespaces, 1+ whitespaces and the same value as in Group 1 since \1 is a backreference to the value stored in that group
$ - end of line.
In Python, use re.finditer:
for m in re.finditer(r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$', text):
print(m.group())
See the Python demo.

How to extract data from string

My code is
import regex
word = '\x02|1280|SELECT|35;36|="214554"'.encode('ascii')
pattern = r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}+|;*)\|="(\w+)"'.encode('ascii')
print(regex.match(pattern, word).group(4))
and I'm interested in group 4 -> (\d{1,2}+|;*) that can have following pattern
|one digit number|
|two-digit number|
|one/two-digit number; one/two-digit number; ... ; one/two-digit number|
I have tried different combination, but as I'm new to regex none of them returns data from group.

How about changing the pattern for group 4 to: (\d{1,2}(?:;\d{1,2})*)?
\d{1,2} represents one or two digits
(?:;\d{1,2})* represents zero or more non-capturing groups that include a semi colon ; followed by one or two digits numbers
Important to mark the group as non-capturing by adding a (?: at the start
Regex101 Demo
Hope this helps!

The \d{1,2}+|;* pattern matches 1 or 2 digits possessively or 0+ semi-colon. So, it is not what you need.
You need to write the pattern like this:
r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}(?:;\d{1,2})*)\|="(\w+)"'
See the Python demo.
The Group 4 pattern will look like (\d{1,2}(?:;\d{1,2})*):
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing group that matches sequences of....
; - a semi-colon
\d{1,2} - 1 or 2 digits
)* - .... zero or more occurrences.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex capture groups divided by a number in between - python

Related

How to exclude two words from regex?

Greedy Python RegEx capturing group to include "and"

Regular Expression to Match Text into Mutliple Groups

select group based on same value in regular Expression

How to extract data from string

Categories

Resources