Named non-capturing group in python? - python

Is it possible to have named non-capturing group in python? For example I want to match string in this pattern (including the quotes):
"a=b"
'bird=angel'
I can do the following:
s = '"bird=angel"'
myre = re.compile(r'(?P<quote>[\'"])(\w+)=(\w+)(?P=quote)')
m = myre.search(s)
m.groups()
# ('"', 'bird', 'angel')
The result captures the quote group, which is not desirable here.

No, named groups are always capturing groups. From the documentation of the re module:
Extensions usually do not create a new group; (?P<name>...) is the
only exception to this rule.
And regarding the named group extension:
Similar to regular parentheses, but the substring matched by the group
is accessible within the rest of the regular expression via the
symbolic group name name
Where regular parentheses means (...), in contrast with (?:...).

You do need a capturing group in order to match the same quote: there is no other mechanism in re that allows you to do this, short of explicitly distinguishing the two quotes:
myre = re.compile('"{0}"' "|'{0}'" .format('(\w+)=(\w+)'))
(which has the downside of giving you four groups, two for each style of quotes).
Note that one does not need to give a name to the quotes, though:
myre = re.compile(r'([\'"])(\w+)=(\w+)\1')
works as well.
In conclusion, you are better off using groups()[1:] in order to get only what you need, if at all possible.

Related

`(?P<name>...) ` and `\g<quote>` in re module

Upon reading the python regex module, (?P<name>...) usually confused me.
I knew P here stanrds for nothing but random things as foo bar zoo from the answer python - Named regular expression group "(?Pregexp)": what does "P" stand for? - Stack Overflow
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.
Named groups can be referenced in three contexts. If the pattern is (?P['"]).*?(?P=quote) (i.e. matching a string quoted with either single or double quotes):
Mistakes often make in repl argument in re.sub to employ \g<quote>.
Since I try to be pythonic and explain the root things to others.
why use g instead of p in \g<quote> or why not use G in (?P<name>...)?
so there will be some consistent than chaos.

?= and ?P combined in a regex

in short :
I want to use the Lookahead technique in Python with the ?P<name> convention (details here) to get the groups by name.
more details :
I discovered the Lookahead trick here; e.g. the following regex...
/^(?=.*Tim)(?=.*stupid).+
... allows to detect strings like "Tim stupid" or "stupid Tim", the order being not important.
I can't figure out how I can combine the ?= "operator" with the ?P one; the following regex obviously doesn't do the trick but gives an idea of what I want :
/^(?=?P<word1>.*Tim)(?=?P<word2>.*stupid).+
The ?P<word1> in your regex reminds of a named capture group:
The syntax for a named group is one of the Python-specific extensions: (?P<name>...). *name* is, obviously, the name of the group. Named groups also behave exactly like capturing groups, and additionally associate a name with a group.
So, most probably you are looking for a way to capture substrings inside a positive lookahead anchored at the start to require a string to meet both patterns, and capture the substrings inside both the lookaheads:
^(?=(?P<word1>.*Tim))(?=(?P<word2>.*stupid)).+
^^^^^^^^^^ ^ ^^^^^^^^^^ ^
See the regex demo
Note that if you do not need the string itself, .+ is redundant and can be removed. You might want to re-adjust the borders of the named capture groups if necessary.

not returning the whole pattern in regex in python

I have the following code:
haystack = "aaa months(3) bbb"
needle = re.compile(r'(months|days)\([\d]*\)')
instances = list(set(needle.findall(haystack)))
print str(instances)
I'd expect it to print months(3) but instead I just get months. Is there any reason for this?
needle = re.compile(r'((?:months|days)\([\d]*\))')
fixes your problem.
you were capturing only the months|days part.
in this specific situation, this regex is a bit better:
needle = re.compile(r'((?:months|days)\(\d+\))')
this way you will only get results with a number, previously a result like months() would work. if you want to ignore case for options like Months or Days, then also add the re.IGNORECASE flag. like this:
re.compile(r'((?:months|days)\(\d+\))', re.IGNORECASE)
some explanation for the OP:
a regular expression is comprised of many elements, the chief among them is the capturing group. "()" but sometimes we want to make groups without capturing, so we use "(?:)" there are many other forms of groups, but these are the most common.
in this case, we surround the entire regular expression in a capturing group, because you are trying to capture everything, normally - any regular expression is automatically surrounded by a capturing group, but in this case, you specified one explicitly, so it did not surround your regular expression with an automatic capture group.
now that we have surrounded the entire regular expression with a capturing group, we turn the group we have into a non-capturing group by adding ?: to the beginning, as shown above. we could also not have surrounded the entire regular expression and only turned the group into a non-capturing group, since as you saw, it will automatically turn the whole regular expression into a capturing group where non is present. i personally prefer explicit coding.
further information about regular expressions can be found here: http://docs.python.org/library/re.html
Parens are not just for grouping, but also for forming capture groups. What you want is re.compile(r'(?:months|days)\(\d+\)'). That uses a non-capturing group for the or condition, and will not get you a bunch of subgroup matches you don't appear to want when using findall.

Parentheses in regular expression pattern when splitting a string

I would like to know the reason behind the following behaviour:
>>> re.compile("(b)").split("abc")[1]
'b'
>>> re.compile("b").split("abc")[1]
'c'
I seems that when I add parentheses around the splitting pattern, re adds it into the split array. But why? Is it something consistent, or simply an isolated feature of regular expressions.
It's a feature of re.split, according to the documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
In general, parenthesis denote capture groups and are used to extract certain parts of a string. Read more about capture groups.
In any regular expression, parentheses denote a capture group. Capture groups are typically used to extract values from the matched string (in conjunction with re.match or re.search). For details, refer to the official documentation (search for (...)).
re.split adds the matched groups in between the splitted values:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

What does this Django regular expression mean? `?P`

I have the following regular expression (regex) in my urls.py and I'd like to know what it means. Specifically the (?P<category_slug> portion of the regex.
r'^category/(?P<category_slug>[-\w]+)/$
In django, named capturing groups are passed to your view as keyword arguments.
Unnamed capturing groups (just a parenthesis) are passed to your view as arguments.
The ?P is a named capturing group, as opposed to an unnamed capturing group.
http://docs.python.org/library/re.html
(?P<name>...) Similar to regular parentheses, but the substring
matched by the group is accessible within the rest of the regular
expression via the symbolic group name name. Group names must be valid
Python identifiers, and each group name must be defined only once
within a regular expression. A symbolic group is also a numbered
group, just as if the group were not named. So the group named id in
the example below can also be referenced as the numbered group 1.
(?P<name>regex) - Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the sharp brackets. The name may consist of letters and digits.
Copy paste from: http://www.regular-expressions.info/refext.html
(?P<category_slug>) creates a match group named category_slug.
The regex itself matches a string starting with category/ and then a mix of alphanumeric characters, the dash - and the underscore _, followed by a trailing slash.
Example URLs accepted by the regex:
category/foo/
category/foo_bar-baz/
category/12345/
category/q1e2_asdf/
In pattern matching,
Use this pattern for passing string
(?P<username2>[-\w]+)
This for interger value
(?P<user_id>[0-9]+)
New in version 3.6.
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.
copy paste from Python3Regex

Categories