Multiple capturing groups within non-capturing group using Python regexes

Multiple capturing groups within non-capturing group using Python regexes - python

I have the following code using multiple capturing groups within a non-capturing group:
>>> regex = r'(?:a ([ac]+)|b ([bd]+))'
>>> re.match(regex, 'a caca').groups()
('caca', None)
>>> re.match(regex, 'b bdbd').groups()
(None, 'bdbd')
How can I change the code so it outputs either ('caca') or ('bdbd')?

You are close.
To get the capture always as group 1 can use a lookahead to do the match and then a separate capturing group to capture:
(?:a (?=[ac]+)|b (?=[bd]+))(.*)
Demo
Or in Python3:
>>> regex=r'(?:a (?=[ac]+)|b (?=[bd]+))(.*)'
>>> (?:a (?=[ac]+)|b (?=[bd]+))(.*)
>>> re.match(regex, 'a caca').groups()
('caca',)
>>> re.match(regex, 'b bdbd').groups()
('bdbd',)

Another option is to get the matches using a lookbehind without a capturing group:
(?<=a )[ac]+|(?<=b )[bd]+
Regex demo
For example
import re
pattern = r'(?<=a )[ac]+|(?<=b )[bd]+'
print (re.search(pattern, 'a caca').group())
print (re.search(pattern, 'b bdbd').group())
Output
caca
bdbd

You may use a branch reset group with PyPi regex module:
Alternatives inside a branch reset group share the same capturing groups. The syntax is (?|regex) where (?| opens the group and regex is any regular expression. If you don’t use any alternation or capturing groups inside the branch reset group, then its special function doesn’t come into play. It then acts as a non-capturing group.
The regex will look like
(?|a ([ac]+)|b ([bd]+))
See the regex demo. See the Python 3 demo:
import regex
rx = r'(?|a ([ac]+)|b ([bd]+))'
print (regex.search(rx, 'a caca').groups()) # => ('caca',)
print (regex.search(rx, 'b bdbd').groups()) # => ('bdbd',)

See the problem the other way around:
((?:a [ac]+)|(?:b [bd]+))
^ ^ ^ ^
| | | other exact match
| | OR
| not capturing for exact match
capture everything
A easier look: https://regex101.com/r/e3bK2B/1/

Related

get ALL capturing groups that matches with findall

Hi I need a regexp to capture ALL the groups that matches a text
I have the following text
"abc"
and this regexp
compiled = re.compile("(?P<group1>abc)|(?P<group2>abc)")
compiled.findall("asd")
but the output is the following:
("abc", "")
The output I expect is the following
("abc", "abc") # one match per capturing group that matches
EDITED:
What I need to achieve
I have around 500 groups of things, and I want to categorize a text to each one of this groups, so I created a capturing group with a regexp for each one. in this way I can run a big regexp once, and get the index of the matched groups to know which group matched
for example, I have ingredients of desserts, and want to know to which desserts a text may belong:
test = re.compile('(?P<dessert1>(?:apple))|(?P<dessert2>(?:apple|banana))|(?P<others>(?:other))')
then if I have the string
apple
I would want to get the groups "desert1" and "desert2"
I can't run several regexps for each dessert for performance reasons

You might use a positive lookahead with one of the capturing groups
(?=(?P<group1>abc))(?P<group2>\1)
Regex demo | Python demo
import re
regex = r"(?=(?P<group1>abc))(?P<group2>(?P=group1))"
test_str = "abc"
print(re.findall(regex, test_str))
Output
[('abc', 'abc')]
Or more explicit instead of the backreference \1, use (?P=group1) to match the same text as capturing group named group1
(?=(?P<group1>abc))(?P<group2>(?P=group1))
Regex demo

regular expressions excluding words that begin with a semi colon

I have been trying to figure out how to include certain word groups and exclude others.I have this string for example
string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
I want to find HI:MYDLKJL:ajkld? and app? but not :JKLJBLKJD:DKJL? because it begins with a : I have made this code but it still includes the :JKLJBLKJD:DKJL? just ignoring the : in the front
match3=re.findall("[A-Za-z]{1,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[\?]{1}",string1)

The actual pattern is pretty simple to specify. But, you'll also need to specify a look-behind to handle the second term appropriately.
>>> re.findall(r'(?:(?<=\s)|(?<=^))[^:]\S+\?', string1)
['HI:MYDLKJL:ajkld?', 'app?']
The regex means "any expression that does not start with a colon but ends with a question mark".
(?: # lookbehind
(?<=\s) # space
| # OR
(?<=^) # start-of-line metachar
)
[^:] # anything that is not a colon
\S+ # one or more characters that are not a space
\? # literal question mark
A simple word boundary does not work because \b will also match the boundary between : and JKLJBLKJD... no bueno, hence the lookbehind.

Alternate approach
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> string1.split()
['HI:MYDLKJL:ajkld?', ':JKLJBLKJD:DKJL?', 'app?']
# filter out elements not needed
>>> [s for s in string1.split() if not s.startswith(':')]
['HI:MYDLKJL:ajkld?', 'app?']
Or, using the regex module
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> regex.findall(r'(?:^|\s):\S+(*SKIP)(*F)|\S+', string1)
['HI:MYDLKJL:ajkld?', 'app?']
(?:^|\s):\S+(*SKIP)(*F) will effectively ignore strings starting with :
(?: means non-capturing group

Extract string within parentheses - PYTHON

I have a string "Name(something)" and I am trying to extract the portion of the string within the parentheses!
Iv'e tried the following solutions but don't seem to be getting the results I'm looking for.
n.split('()')
name, something = n.split('()')

You can use a simple regex to catch everything between the parenthesis:
>>> import re
>>> s = 'Name(something)'
>>> re.search('\(([^)]+)', s).group(1)
'something'
The regex matches the first "(", then it matches everything that's not a ")":
\( matches the character "(" literally
the capturing group ([^)]+) greedily matches anything that's not a ")"

as an improvement on #Maroun Maroun 's answer:
re.findall('\(([^)]+)', s)
it finds all instances of strings in between parentheses

You can use split as in your example but this way
val = s.split('(', 1)[1].split(')')[0]
or using regex

You can use re.match:
>>> import re
>>> s = "name(something)"
>>> na, so = re.match(r"(.*)\((.*)\)" ,s).groups()
>>> na, so
('name', 'something')
that matches two (.*) which means anything, where the second is between parentheses \( & \).

You can look for ( and ) (need to escape these using backslash in regex) and then match every character using .* (capturing this in a group).
Example:
import re
s = "name(something)"
regex = r'\((.*)\)'
text_inside_paranthesis = re.match(regex, s).group(1)
print(text_inside_paranthesis)
Outputs:
something
Without regex you can do the following:
text_inside_paranthesis = s[s.find('(')+1:s.find(')')]
Outputs:
something

Regular expression in python doesn't seem to be working like I expect

My code doesn't seem to be working like it's supposed to:
x = "engniu4nwi5u"
print re.sub(r"\D(\d)\D", r"\1abc", x)
My desired output is: engniuabcnwiabcu
But the output actually given is: engni4abcw5abc

You are grouping the wrong characters it must be written as
>>> x = "engniu4nwi5u"
>>> re.sub(r"(\D)\d(\D)", r"\1abc\2", x)
'engniuabcnwiabcu'
(\D) Matches a non digit and captures it in \1
\d Matches the digit
(\D) Matches the following digit. captures in \2
How does it matches?
engniu4nwi5u
|
\D => \1
engniu4nwi5u
|
\d
engniu4nwi5u
|
\D => \2
Another Solution
You can also use look arounds to perform the same as
>>> x = "engniu4nwi5u"
>>> re.sub(r"(?<=\D)\d(?=\D)", r"abc", x)
'engniuabcnwiabcu'
(?<=\D) Look behind assertion. Checks if the digit is presceded by a non digit. But not caputred
\d Matches the digit
(?=\D) Look ahead assertion. Checks if the digit is followed by the non digit. Also not captured.

This is because you replaced the wrong part:
Let's consider the first match. \D\d\D matches the following:
engniu4nwi5u
^^^
4 is captured as \1. Then you replace the whole match with: \1abc, which becomes 4abc.
You have a couple solutions here:
Capture what you want to keep: (\D)\d(\D) and replace it with \1abc\2
Use lookaheads: (?<=\D)\d(?=\D) and replace this with abc

Based on your regexp:
>>> re.sub("(\D)\d", r"\1abc", x)
'engniuabcnwiabcu'
Although I would do this instead:
>>> re.sub("\d", "abc", x)
'engniuabcnwiabcu'

If you plan to check also the beginning and end of string, you need to add ^ and $ to the regex:
(\D|^)\d(?=$|\D)
And replace with \1abc.
See demo
Sample code on IDEONE:
import re
p = re.compile(ur'(\D|^)\d(?=$|\D)')
test_str = u"1engniu4nwi5u"
subst = u"\1abc"
print re.sub(p, subst, test_str)

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)

If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<

^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.

If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple capturing groups within non-capturing group using Python regexes - python

Another option is to get the matches using a lookbehind without a capturing group: (?<=a )[ac]+|(?<=b )[bd]+ Regex demo For example import re pattern = r'(?<=a )[ac]+|(?<=b )[bd]+' print (re.search(pattern, 'a caca').group()) print (re.search(pattern, 'b bdbd').group()) Output caca bdbd

See the problem the other way around: ((?:a [ac]+)|(?:b [bd]+)) ^ ^ ^ ^ | | | other exact match | | OR | not capturing for exact match capture everything A easier look: https://regex101.com/r/e3bK2B/1/

Related

get ALL capturing groups that matches with findall

regular expressions excluding words that begin with a semi colon

Extract string within parentheses - PYTHON

Regular expression in python doesn't seem to be working like I expect

Regular expression to return all characters between two special characters

Categories

Resources