Extract substring using python re.match - python

I have a string as
sg_ts_feature_name_01_some_xyz
In this, i want to extract two words that comes after the pattern - sg_ts with the underscore seperation between them
It must be,
feature_name
This regex,
st = 'sg_ts_my_feature_01'
a = re.match('sg_ts_([a-zA-Z_]*)_*', st)
print a.group()
returns,
sg_ts_my_feature_
whereas, i expect,
my_feature

The problem is that you are asking for the whole match, not just the capture group. From the manual:
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.
and you asked for a.group() which is equivalent to a.group(0) which is the whole match. Asking for a.group(1) will give you only the capture group in the parentheses.

You can ask for the group surrounded by the parentheses, 'a.group(1)', which returns
'my_feature_'
In addition, if your string is always in this form you could also use the end-of string character $ and to make the inner match lazy instead of greedy (so it doesn't swallow the _).
a = re.match('sg_ts_([a-zA-Z_]*?)[_0-9]*$',st)

Related

Regex how to match an optional character in front of a greedy capture?

I'm using python re. I have a string in the following format:
<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_n>-<op_1>:<op_2>=<value>`
I would like to capture four groups: .<entry_2>.<entry_3>...<entry_n> in one group, <op_1> in a second group, <op_2> in a third group, and <value> in the fourth group. However, I would also like -<op_1> to be optional. So, if - doesn't exist, then the second group returns empty. My current matching expression is ^.+?(\..+)[-](.*):(.*)=(.*). But [-] and [:] require those characters in order to match. And making them optional forces the first capture to overrun the - and : characters if they do exist. Is there a better way to approach this?
>>> s = '<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>-<op_1>:<op_2>=<value>'
>>> re.findall(r'(\.<entry_.*entry_\d+>)(?:-(<op_\d+>))?:(<op_\d+>)=(<[^>]+>)', s)
[('.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>', '<op_1>', '<op_2>', '<value>')]
>>> s = '<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>:<op_2>=<value>'
>>> re.findall(r'(\.<entry_.*entry_\d+>)(?:-(<op_\d+>))?:(<op_\d+>)=(<[^>]+>)', s)
[('.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>', '', '<op_2>', '<value>')]
I have changed entry_n to entry_10 so that it has digits instead of n for the code snippet to work.
^\+spm_.+? isn't present in input sample, so I didn't include it, but you can add it if you need it
The four groups are:
(\.<entry_.*entry_\d+>)
(?:-(<op_\d+>))? --> optional group
:(<op_\d+>)
=(<[^>]+>)
You can also use re.search(r'pat', s).groups() but you will get None instead of empty string for the optional group. Forgot that you could change it, use .groups(default='') to get empty string instead of None

Python parentheses and returning only certain part of regex

I have a list of strings that I'm looping through. I have the following regular expression (item is the string I'm looping through at any given moment):
regularexpression = re.compile(r'set(\d+)e', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
What I want it to do is return numbers that have the word set before them and the letter e after them.
However, I also want it to return numbers that have set before them and x after them. If I use the following code:
regularexpression = re.compile(r'set(\d+)(e|x)', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
Instead of returning just the number, it also returns e or x. Is there a way to use parentheses to group my regular expression into bits without it returning everything in the parentheses?
Your example code seems fine already, but to answer your question, you can make a non-capturing group using the (?:) syntax, e.g.:
set(\d+)(?:e|x)
Additionally, in this specific example you can just use a character class:
set(\d+)[ex]
It appears you are looking at more than just .group(1); you have two capturing groups defined in your regular expression.
You can make the second group non-capturing by using (?:...) instead of (...):
regularexpression = re.compile(r'set(\d+)(?:e|x)', re.IGNORECASE)

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

regex: string with optional parts

I am trying to parse some docstrings.
An example docstrings is:
Test if a column field is larger than a given value
This function can also be called as an operator using the '>' syntax
Arguments:
- DbColumn self
- string or float value: the value to compare to
in case of string: lexicographic comparison
in case of float: numeric comparison
Returns:
DbWhere object
Both the Arguments and Returns parts are optional. I want my regex to return as groups the description (first lines), the Arguments part (if present) and the Returns part (if present).
The regex I have now is:
m = re.search('(.*)(Arguments:.*)(Returns:.*)', s, re.DOTALL)
and works in case all three parts are present but fails as soon as Arguments or the Returnsparts are not available. I have tried several variations with the non-greedy modifiers like ??but to no avail.
Edit: When the Arguments and Returns parts are present, I actually would only like to match the text after Arguments: and Returns: respectively.
Thanks!
Try with:
re.search('^(.*?)(Arguments:.*?)?(Returns:.*)?$', s, re.DOTALL)
Just making the second and third groups optional by appending a ?, and making the qualifiers of the first two groups non-greedy by (again) appending a ? on them (yes, confusing).
Also, if you use the non-greedy modifier on the first group of the pattern, it'll match the shortest possible substring, which for .* is the empty string. You can overcome this by adding the end-of-line character ($) at the end of the pattern, which forces the first group to match as few characters as possible to satisfy the pattern, i.e. the whole string when there's no Arguments and no Returns sections, and everything before those sections, when present.
Edit: OK, if you just want to capture the text after the Arguments: and Returns: tokens, you'll have to tuck in a couple more groups. We're not going to use all of the groups, so naming them —with the <?P<name> notation (another question mark, argh!)— is starting to make sense:
>>> m = re.search('^(?P<description>.*?)(Arguments:(?P<arguments>.*?))?(Returns:(?P<returns>.*))?$', s, re.DOTALL)
>>> m.groupdict()['description']
"Test if a column field is larger than a given value\n This function can also be called as an operator using the '>' syntax\n\n "
>>> m.groupdict()['arguments']
'\n - DbColumn self\n - string or float value: the value to compare to\n in case of string: lexicographic comparison\n in case of float: numeric comparison\n '
>>> m.groupdict()['returns']
'\n DbWhere object'
>>>
If you want to match the text after optional Arguments: and Returns: sections, AND you don't want to use (?P<name>...) to name your capture groups, you can also use, (?:...), the non-capturing version of regular parentheses.
The regex would look like this:
m = re.search('^(.*?)(?:Arguments:(.*?))?(?:Returns:(.*?))?$', doc, re.DOTALL)
# ^^ ^^
According to the Python3 documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Why is there an extra result handed back to me during this Python regex example?

Code:
re.findall('(/\d\d\d\d)?','/2000')
Result:
['/2000', '']
Code:
re.findall('/\d\d\d\d?','/2000')
Result:
['/2000']
Why is the extra '' returned in the first example?
i am using the first example for django url configuration , is there a way i can prevent matching of '' ?
Because using the brackets you define a group, and then with ? you ask for 0 to 1 repetitions of the group. Thus the empty string and /2000 both match.
the operator ? will match 0 or 1 repetitions of the preceding expression, in the first case the preceding expression is (/\d\d\d\d), while in the second is the last \d.
Therefore the first case the empty string "" will be matched, as it contain zero repetition of the expression (/\d\d\d\d)
Here is what is happening: The regex engine starts off with its pointer before the first char in the target string. It greedily consumes the whole string and places the match result in the first list element. This leaves the internal pointer at the end of the string. But since the regex pattern can match nothingness, it successfully matches at the position at the end of the string too, Thus, there are two elements in the list.

Categories