select group based on same value in regular Expression - python

I have a following content
ONE
1234234534564 123
34erewrwer323 123
123fsgrt43232 123
TWO
42433412133fr 234
fafafd3234132 342
THREE
sfafdfe345233 3234
FOUR
324ereffdf343 4323
fvdafasf34nhj 4323
fsfnhjdgh342g 4323
Consider ONE,TWO,THREE and FOUR are separate group.In that I want match only ONE and FOUR, based on the condition of second value of each line in the every group must be same and it will match group that has more than one line in that..How can I do that in regular expression
I have already tried following regex, but its not up to the mark
\w+\n\w+\t(\d+)(\n\w+\t\1){2,}

You may use
r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$'
See the regex demo.
Details
(?m) - enable re.MULTILINE mode to make ^ / $ match start and end of lines respectively
^ - start of a line
[A-Z]+ - 1+ uppercase ASCII letters (adjust as you see fit)
\r?\n - a line break like CRLF or LF
\S+ - 1+ non-whitespace chars
\s+ - 1 whitespaces (or use \t if a tab is the field separator)
(\d+) - Capturing group 1, one or more digits
(?:\r?\n\S+\s+\1)+ - one or more repetitions of a line break followed with 1+ non-whitespaces, 1+ whitespaces and the same value as in Group 1 since \1 is a backreference to the value stored in that group
$ - end of line.
In Python, use re.finditer:
for m in re.finditer(r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$', text):
print(m.group())
See the Python demo.

Related

Regex capture groups divided by a number in between

I need to capture 3 groups from a string. The string is in the form the following form:
{phrase 1} {optional number} {optional phrase 2}
A few examples of this are:
Battery Bank 1
Battery Bank 1 Segments
Battery Bank 1 Warranty Logger
Battery Bank 10
Battery Bank 10 Segments
Battery Bank 10 Warranty Logger
BSU
BSU 1
PCS 3
PCS 1
System
System Meter
As you can see, the only mandatory group is the first one which is comprised of word characters and spaces until a number of at least 1 digit appears. Then, optionally, another group of word and spaces characters.
This is what I have so far, but it's not working properly. It's matching over lines. It should match one per line.
([a-zA-Z\s]+)(\d+)?(\w)?
Here's a regex101 link to play with:
https://regex101.com/r/tSGIEm/2
You may use this regex with optional groups:
([a-zA-Z]+(?:[ \t]+[a-zA-Z]+)*)(?:[ \t]+(\d+)(?:[ \t]+(.+))?)?
Updated RegEx Demo
RegEx Details:
(: Start capture group #1
[a-zA-Z]+: Match a word of 1+ letters
(?:[ \t]+[a-zA-Z]+)*: Match 0 or more words separated by 1+ spaces/tabs
): End capture group #1
(?:: Start non-capture group #1
[ \t]+: Match 1+ spaces or tabs
(\d+): Match 1+ digits and capture in group #2
(?:: Start non-capture group #2
[ \t]+: Match 1+ spaces or tabs
(.+): Match 1+ of any characters and capture in group #3
)?: End optional non-capture group #2
)?: End optional non-capture group #1
You may use
^(.*?)(?: +(\d+) *(.*))?$
See the regex demo.
Details
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
(?: +(\d+) *(.*))? - an optional group matching 1 or 0 occurrences of:
+ - 1+ spaces
(\d+) - Group 2: one or more digits
* - 0+ spaces
(.*) - Group 3: any zero or more chars other than line break chars, as many as possible
$ - end of string.

Greedy Python RegEx capturing group to include "and"

I need some help writing regex expressions. I need an expression that can match the following patterns (including words and digits, spaces and commas):
Line 145
Line3544354
Lines 10,12
Line items 45,10,26
Lines 10 and 45
Thus far, I wrote one expression which includes the first three patterns and all case variations:
r'(?i)(line item[\.*\,*\s*\d+]+]+|line[\.*\,*\s*\d+]+|lines[\.*\,*\s*\d+]+|line items[\.*\,*\s*\d+]+)'
I would like to include the last two patterns listed but not sure how. I have wrote this expression for the pattern matching "Lines 10 and 45" by modifying the capturing group as follows:
r'(Lines[\.*\,*\w*\s*\d+]+)'
However, it does not work as expected. It selects all word characters in the string. I would like to keep my expressions greedy, but not sure how to implement the last two patterns in the list.
Any suggestions please?
You may use
(?i)lines?(?:\s+items?)?\s*\d+(?:\.\d+)?(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)*
See the regex demo.
Pattern details:
(?i) - ignore case inline modifier
lines? - line or lines (? quantifier makes the preceding pattern optional, matching 1 or 0 occurrences)
(?:\s+items?)? - an optional non-capturing group matching 1 or 0 occurrences of 1+ whitespaces followed with item and an optional s char
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits
(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)* - 0 or more repetitions of
\s* - 0+ whitespaces
(?:,|and) - , or and char sequence
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits

How to extract data from string

My code is
import regex
word = '\x02|1280|SELECT|35;36|="214554"'.encode('ascii')
pattern = r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}+|;*)\|="(\w+)"'.encode('ascii')
print(regex.match(pattern, word).group(4))
and I'm interested in group 4 -> (\d{1,2}+|;*) that can have following pattern
|one digit number|
|two-digit number|
|one/two-digit number; one/two-digit number; ... ; one/two-digit number|
I have tried different combination, but as I'm new to regex none of them returns data from group.
How about changing the pattern for group 4 to: (\d{1,2}(?:;\d{1,2})*)?
\d{1,2} represents one or two digits
(?:;\d{1,2})* represents zero or more non-capturing groups that include a semi colon ; followed by one or two digits numbers
Important to mark the group as non-capturing by adding a (?: at the start
Regex101 Demo
Hope this helps!
The \d{1,2}+|;* pattern matches 1 or 2 digits possessively or 0+ semi-colon. So, it is not what you need.
You need to write the pattern like this:
r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}(?:;\d{1,2})*)\|="(\w+)"'
See the Python demo.
The Group 4 pattern will look like (\d{1,2}(?:;\d{1,2})*):
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing group that matches sequences of....
; - a semi-colon
\d{1,2} - 1 or 2 digits
)* - .... zero or more occurrences.

Split string by number of whitespaces

I have a string that looks like either of these three examples:
1: Name = astring Some comments
2: Typ = one two thee Must be "sand", "mud" or "bedload"
3: RDW = 0.02 [ - ] Some comment about RDW
I first split the variable name and rest like so:
re.findall(r'\s*([a-zA-z0-9_]+)\s*=\s*(.*)', line)
I then want to split the right part of the string into a part containing the values and a part containing the comments (if there are any). I want to do this by looking at the number of whitespaces. If it exceeds say 4, then I assume the comments to start
Any idea on how to do this?
I currently have
re.findall(r'(?:(\S+)\s{0,3})+', dataString)
However if I test this using the string:
'aa aa23r234rf2134213^$&$%& bb'
Then it also selects 'bb'
You may use a single regex with re.findall:
^\s*(\w+)\s*=\s*(.*?)(?:(?:\s{4,}|\[)(.*))?$
See the regex demo.
Details:
^ - start of string
\s* - 0+ whitespaces
(\w+) - capturing group #1 matching 1 or more letters/digits/underscores
\s*=\s* - = enclosed with 0+ whitespaces
(.*?) - capturing group #2 matching any 0+ chars, as few as possible, up to the first...
(?:(?:\s{4,}|\[)(.*))? - an optional group matching
(?:\s{4,}|\[) - 4 or more whitespaces or a [
(.*) - capturing group #3 matching 0+ chars up to
$ - the end of string.

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories