regex: string with optional parts - python

I am trying to parse some docstrings.
An example docstrings is:
Test if a column field is larger than a given value
This function can also be called as an operator using the '>' syntax
Arguments:
- DbColumn self
- string or float value: the value to compare to
in case of string: lexicographic comparison
in case of float: numeric comparison
Returns:
DbWhere object
Both the Arguments and Returns parts are optional. I want my regex to return as groups the description (first lines), the Arguments part (if present) and the Returns part (if present).
The regex I have now is:
m = re.search('(.*)(Arguments:.*)(Returns:.*)', s, re.DOTALL)
and works in case all three parts are present but fails as soon as Arguments or the Returnsparts are not available. I have tried several variations with the non-greedy modifiers like ??but to no avail.
Edit: When the Arguments and Returns parts are present, I actually would only like to match the text after Arguments: and Returns: respectively.
Thanks!

Try with:
re.search('^(.*?)(Arguments:.*?)?(Returns:.*)?$', s, re.DOTALL)
Just making the second and third groups optional by appending a ?, and making the qualifiers of the first two groups non-greedy by (again) appending a ? on them (yes, confusing).
Also, if you use the non-greedy modifier on the first group of the pattern, it'll match the shortest possible substring, which for .* is the empty string. You can overcome this by adding the end-of-line character ($) at the end of the pattern, which forces the first group to match as few characters as possible to satisfy the pattern, i.e. the whole string when there's no Arguments and no Returns sections, and everything before those sections, when present.
Edit: OK, if you just want to capture the text after the Arguments: and Returns: tokens, you'll have to tuck in a couple more groups. We're not going to use all of the groups, so naming them —with the <?P<name> notation (another question mark, argh!)— is starting to make sense:
>>> m = re.search('^(?P<description>.*?)(Arguments:(?P<arguments>.*?))?(Returns:(?P<returns>.*))?$', s, re.DOTALL)
>>> m.groupdict()['description']
"Test if a column field is larger than a given value\n This function can also be called as an operator using the '>' syntax\n\n "
>>> m.groupdict()['arguments']
'\n - DbColumn self\n - string or float value: the value to compare to\n in case of string: lexicographic comparison\n in case of float: numeric comparison\n '
>>> m.groupdict()['returns']
'\n DbWhere object'
>>>

If you want to match the text after optional Arguments: and Returns: sections, AND you don't want to use (?P<name>...) to name your capture groups, you can also use, (?:...), the non-capturing version of regular parentheses.
The regex would look like this:
m = re.search('^(.*?)(?:Arguments:(.*?))?(?:Returns:(.*?))?$', doc, re.DOTALL)
# ^^ ^^
According to the Python3 documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Related

Regex how to match an optional character in front of a greedy capture?

I'm using python re. I have a string in the following format:
<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_n>-<op_1>:<op_2>=<value>`
I would like to capture four groups: .<entry_2>.<entry_3>...<entry_n> in one group, <op_1> in a second group, <op_2> in a third group, and <value> in the fourth group. However, I would also like -<op_1> to be optional. So, if - doesn't exist, then the second group returns empty. My current matching expression is ^.+?(\..+)[-](.*):(.*)=(.*). But [-] and [:] require those characters in order to match. And making them optional forces the first capture to overrun the - and : characters if they do exist. Is there a better way to approach this?
>>> s = '<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>-<op_1>:<op_2>=<value>'
>>> re.findall(r'(\.<entry_.*entry_\d+>)(?:-(<op_\d+>))?:(<op_\d+>)=(<[^>]+>)', s)
[('.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>', '<op_1>', '<op_2>', '<value>')]
>>> s = '<root>.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>:<op_2>=<value>'
>>> re.findall(r'(\.<entry_.*entry_\d+>)(?:-(<op_\d+>))?:(<op_\d+>)=(<[^>]+>)', s)
[('.<entry_2>.<entry_3>.<entry_4>.<entry_5>...<entry_10>', '', '<op_2>', '<value>')]
I have changed entry_n to entry_10 so that it has digits instead of n for the code snippet to work.
^\+spm_.+? isn't present in input sample, so I didn't include it, but you can add it if you need it
The four groups are:
(\.<entry_.*entry_\d+>)
(?:-(<op_\d+>))? --> optional group
:(<op_\d+>)
=(<[^>]+>)
You can also use re.search(r'pat', s).groups() but you will get None instead of empty string for the optional group. Forgot that you could change it, use .groups(default='') to get empty string instead of None

expression for capturing aligned, fixed width fields of integers with failure on invalid succeeding characters

The main goal
The desired regex should fail for a given fixed width field if THE CONTENTS OF THE ENTIRE FIELD does NOT match the pattern of:
an integer
non-zero
optional +/- signs
optional padding in front (e.g., ' 1')
My current pattern succeeds in the case that any front portion matches the pattern. However it should fail unless the entire string matches this pattern.
Some examples of the kind of strings that should match (for a field of width 5; all match results are 5 characters long):
'12345' # matches up with '12345'
'+2345678' # matches up with '+2345'
'-2345678' # matches up with '-2345'
' +2345678' # matches up with ' +234'
My current attempt looks like this, which works for all of the above examples:
>>> re.match('(?= *[[+-]?[1-9][0-9]*]?)(?P<X>.{5})', ' +2-345678').group('X')
' +2-3' # should not work here!
However I want the expression to fail on the above match attempt due to the fact that the pattern found in the look ahead is interrupted by a non-integer character; in this case, -.
An additional string that is causing a problem is the following one, which should also fail:
' 1'
This one should fail because there are five spaces at the front prior to the integer. Currently the expression allows any number of spaces. I understand this is because ' *' allows any number of spaces to occur, but it is not solved by doing ' {0,5}' instead (results in a match for ' ').
An additional example that ought to fail:
' +1'
In this case, the numerical character does not appear until the sixth position. Therefore the field of width 5 should not produce a match because the matching string '....+' ( where dots are spaces) isn't a valid integer.
Some more details
Since "why do you want to do this?" is a typical question around here:
I have line types from a file containing fields of various specified lengths.
Sometimes the line fields are right-aligned, sometimes left-aligned (with the remaining characters padded with spaces). Sometimes fields contain integers (with optional +/- signs), sometimes floats OR integers (with optional +/- and optional decimals), and sometimes an arbitrary string. And for some of these fields, the entire field is allowed to be spaces, while for others they are not allowed to be empty.
All of the above details for any given field are known in advance. That is, for any specific line definition (ie combination of fields described above), I know in advance the order of each field in the line, its width, and the kind of information it contains (int, float, float OR int, or any string), including whether or not it is allowed to be blank.
What I am attempting to do is write a single regex for each line definition with pattern labels (using the (?P<NAME>EXPR) syntax) so the results can be accessed by name, like so:
m = re.match('(?P<SOMELABEL>SOME_PATTERN)', 'SOME_STRING')
m.group('SOMELABEL')
I am having trouble finding a way to prevent my regex from succeeding for a number of these field types and for a number of edge cases that I want it to fail, including the above case.
Here is a recipe you can use to force a regex expression to match a certain number of characters:
(?=.{LENGTH}(.*))EXPRESSION(?=\1$)
Example:
>>> # match 5 digits followed by "abc"
>>> pattern = re.compile(r'(?=.{5}(.*))\d+(?=\1$)abc')
>>> pattern.match('12345abc')
<_sre.SRE_Match object; span=(0, 8), match='12345abc'>
>>> pattern.match('123456abc')
>>>
If we combine this with a regex for non-zero integers padded with spaces on either side (\s*[+-]?0*[1-9]\d*\s*), it passes all the given test cases:
>>> pattern = re.compile(r'(?=.{5}(.*))\s*[+-]?0*[1-9]\d*\s*(?=\1$)')
>>> pattern.match('12345').group()
'12345'
>>> pattern.match('+2345678').group()
'+2345'
>>> pattern.match('-2345678').group()
'-2345'
>>> pattern.match(' +2345678').group()
' +234'
>>> pattern.match(' +2-345678')
>>> pattern.match(' 1')
>>>
What is this sorcery?
Let's take a closer look at this recipe:
(?=.{LENGTH}(.*))EXPRESSION(?=\1$)
First, the lookahead (?=.{LENGTH}(.*)) skips LENGTH characters with .{LENGTH}. Then (.*) captures all the remaining text in group 1. In other words, we've captured all the remaining text minus the first LENGTH characters.
Afterwards, EXPRESSION matches and (hopefully) consumes exactly LENGTH characters.
Finally, we use (?=\1$) to assert that capture group 1 matches. Since group 1 contains all remaining text minus LENGTH characters, this will only match if EXPRESSION has consumed exactly LENGTH characters. We have thus forced EXPRESSION to an exact length of LENGTH characters.
You have two issues with your current Regular Expression:
Additional brackets (which will match a field like [2345678)
(?= *[[+-]?[1-9][0-9]*]?)(?P<X>.{5})
^ ^
Not applying rule #1 not integers strictly
Fixing both will result in a shorter, working regex:
(?= {0,4}[+-]?[1-9]\d*$)(?P<X>.{5})
Live demo
Update
According to more clarifications by comments you need a little modification on mentioned regex:
^(?= {0,4}[+-]?[1-9]\d*$)(?P<X>.{4}\d)
^^^^^^
Live demo
Although the larger problem I have to solve is a little bit more complicated, the solution to the specific issue of how to limit matches to fields of fixed width in the context of a larger string turns out to be somewhat simple using some additional features of the re module, i.e. the optional pos and endpos arguments of the fullmatch method.
Ignoring the issue of the field width momentarily: for arbitrary length integers, exclusive of 0, and which therefore cannot start with 0 but can optionally start with + or - signs, and ignoring up to four spaces of padding up front, we can compile this expression for later use:
>>> r = ' {0,4}(?P<X>[[-+]?[1-9][0-9]*]?)'
>>> c = re.compile(r)
Now we use the fullmatch method of the compiled object to match at any arbitrary position in a longer string. Like so:
>>> starting_pos = 5
>>> s = ' a 1'
>>> c.fullmatch(s, pos=starting_pos, endpos=starting_pos+5).group('X')
'1'
The fullmatch method will reliably fail unless the entire partial string is matched, which is the desired behavior.

Extract substring using python re.match

I have a string as
sg_ts_feature_name_01_some_xyz
In this, i want to extract two words that comes after the pattern - sg_ts with the underscore seperation between them
It must be,
feature_name
This regex,
st = 'sg_ts_my_feature_01'
a = re.match('sg_ts_([a-zA-Z_]*)_*', st)
print a.group()
returns,
sg_ts_my_feature_
whereas, i expect,
my_feature
The problem is that you are asking for the whole match, not just the capture group. From the manual:
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.
and you asked for a.group() which is equivalent to a.group(0) which is the whole match. Asking for a.group(1) will give you only the capture group in the parentheses.
You can ask for the group surrounded by the parentheses, 'a.group(1)', which returns
'my_feature_'
In addition, if your string is always in this form you could also use the end-of string character $ and to make the inner match lazy instead of greedy (so it doesn't swallow the _).
a = re.match('sg_ts_([a-zA-Z_]*?)[_0-9]*$',st)

Replace pairs of characters at start of string with a single character

I only want this done at the start of the sting. Some examples (I want to replace "--" with "-"):
"--foo" -> "-foo"
"-----foo" -> "---foo"
"foo--bar" -> "foo--bar"
I can't simply use s.replace("--", "-") because of the third case. I also tried a regex, but I can't get it to work specifically with replacing pairs. I get as far as trying to replace r"^(?:(-){2})+" with r"\1", but that tries to replace the full block of dashes at the start, and I can't figure how to get it to replace only pairs within that block.
Final regex was:
re.sub(r'^(-+)\1', r'\1', "------foo--bar")
^ - match start
(-+) - match at least one -, but...
\1 - an equal number must exist outside the capture group.
and finally, replace with that number of hyphens, effectively cutting the number of hyphens in half.
import re
print re.sub(r'\--', '',"--foo")
print re.sub(r'\--', '',"-----foo")
Output:
foo
-foo
EDIT this answer is for the OP before it was completely edited and changed.
Here's it all written out for anyone else who comes this way.
>>> foo = '---foo'
>>> bar = '-----foo'
>>> foobar = 'foo--bar'
>>> foobaz = '-----foo--bar'
>>> re.sub('^(-+)\\1', '-', foo)
'-foo'
>>> re.sub('^(-+)\\1', '-', bar)
'---foo'
>>> re.sub('^(-+)\\1', '-', foobar)
'foo--bar'
>>> re.sub('^(-+)\\1', '-', foobaz)
'--foo--bar'
The pattern for re.sub() is:
re.sub(pattern, replacement, string)
therefore in this case we want to replace -- with -. HOWEVER, the issue comes when we have -- that we don't want to replace, given by some circumstances.
In this case we only want to match -- at the beginning of a string. In regular expressions for python, the ^ character, when used in the pattern string, will only match the given pattern at the beginning of the string - just what we were looking for!
Note that the ^ character behaves differently when used within square brackets.
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'... An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
Getting back to what we were talking about. The parenthesis in the pattern represent a "group," this group can then be referenced with the \\1, meaning the first group. If there was a second set of parenthesis, we could then reference that sub-pattern with \\2. The extra \ is to escape the next slash. This pattern can also be written with re.sub(r'^(-+)\1', '-', foo) forcing python to interpret the string as a raw string, as denoted with the r preceding the pattern, thereby eliminating the need to escape special characters.
Now that the pattern is all set up, you just make the replacement whatever you want to replace the pattern with, and put in the string that you are searching through.
A link that I keep handy when dealing with regular expressions, is Google's developer's notes on them.

Python parentheses and returning only certain part of regex

I have a list of strings that I'm looping through. I have the following regular expression (item is the string I'm looping through at any given moment):
regularexpression = re.compile(r'set(\d+)e', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
What I want it to do is return numbers that have the word set before them and the letter e after them.
However, I also want it to return numbers that have set before them and x after them. If I use the following code:
regularexpression = re.compile(r'set(\d+)(e|x)', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
Instead of returning just the number, it also returns e or x. Is there a way to use parentheses to group my regular expression into bits without it returning everything in the parentheses?
Your example code seems fine already, but to answer your question, you can make a non-capturing group using the (?:) syntax, e.g.:
set(\d+)(?:e|x)
Additionally, in this specific example you can just use a character class:
set(\d+)[ex]
It appears you are looking at more than just .group(1); you have two capturing groups defined in your regular expression.
You can make the second group non-capturing by using (?:...) instead of (...):
regularexpression = re.compile(r'set(\d+)(?:e|x)', re.IGNORECASE)

Categories