I would like to know the reason behind the following behaviour:
>>> re.compile("(b)").split("abc")[1]
'b'
>>> re.compile("b").split("abc")[1]
'c'
I seems that when I add parentheses around the splitting pattern, re adds it into the split array. But why? Is it something consistent, or simply an isolated feature of regular expressions.
It's a feature of re.split, according to the documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
In general, parenthesis denote capture groups and are used to extract certain parts of a string. Read more about capture groups.
In any regular expression, parentheses denote a capture group. Capture groups are typically used to extract values from the matched string (in conjunction with re.match or re.search). For details, refer to the official documentation (search for (...)).
re.split adds the matched groups in between the splitted values:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Related
Suppose I want to match the following regular expression (matching e.g. https://user:passwd#localhost:8080), but want to get as much context around the matched substring as possible:
[a-zA-Z]{3,10}://[^/\\s:#]{3,32}:[^/\\s:#]{3,32}#[a-zA-Z0-9][a-zA-Z0-9-]{0,62}(?:\\.[a-zA-Z0-9][a-zA-Z0-9-]{0,62})*(?::[0-9]{1,5})?
If I use eager matching for the context, e.g. (.{0,25}), then it will possible consume some of the characters that could be matched by \w{3,10}. Conversely, if I use lazy matching, e.g. (.{0,25}?), then I will get no context at all:
>>> re.search('(.{0,25})([a-zA-Z]{3,10}://[^/\\s:#]{3,32}:[^/\\s:#]{3,32}#[a-zA-Z0-9][a-zA-Z0-9-]{0,62}(?:\\.[a-zA-Z0-9][a-zA-Z0-9-]{0,62})*(?::[0-9]{1,5})?)(.{0,25})', 'XXXXhttps://user:passwd#localhost:8080XXX').groups()
('XXXXht', 'tps://user:passwd#localhost:8080', 'XXX')
In the above example, I'd want 'ht' to be part of the matched URL, so that group 1 would be:
'https://user:passwd#localhost:8080'
How can I specify eager matching for the context, but say that a regular expression should take precedence over neighbouring regular expressions and match as much as possible?
Rather than trying to update your regular expression to include "context" (not well-defined in your example), it seems easier to use the match object's .start()/end() methods to get the indices in the original string to which the match corresponds. Then you can manipulate those indices as desired to read some characters before/after the match.
Note that if you want to get the start/end index of a specific capture group inside the pattern, you can use .start(group_number).
I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo
I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.
I couldn't understand why this regular expression,
re.findall(r"(do|re|mi)+","mimi rere midore"),
generates this result,
['mi', 're', 're'].
My expected result is ['mimi', 'rere', 'midore']...
However, when I use this regular expression,
re.findall(r"(?:do|re|mi)+","mimi rere midore"),
it generates the result as expected.
Can you tell me the different between two regular expressions?
Thank you.
The difference is in the capturing group. With a capturing froup, findall() returns only what was captured. Without a capturing group, the whole match is returned.
In your first example, the group only captures the two characters, repeated or not. In the second example, the whole match includes any repetitions.
The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If your (do|re|mi)+ pattern is part of a larger pattern and you want findall() to only return the full repeated set of characters, use a non-capturing group for the two-letter options with a capturing group around the whole:
r'Some example text: ((?:do|re|me)+)'
I have the following code:
haystack = "aaa months(3) bbb"
needle = re.compile(r'(months|days)\([\d]*\)')
instances = list(set(needle.findall(haystack)))
print str(instances)
I'd expect it to print months(3) but instead I just get months. Is there any reason for this?
needle = re.compile(r'((?:months|days)\([\d]*\))')
fixes your problem.
you were capturing only the months|days part.
in this specific situation, this regex is a bit better:
needle = re.compile(r'((?:months|days)\(\d+\))')
this way you will only get results with a number, previously a result like months() would work. if you want to ignore case for options like Months or Days, then also add the re.IGNORECASE flag. like this:
re.compile(r'((?:months|days)\(\d+\))', re.IGNORECASE)
some explanation for the OP:
a regular expression is comprised of many elements, the chief among them is the capturing group. "()" but sometimes we want to make groups without capturing, so we use "(?:)" there are many other forms of groups, but these are the most common.
in this case, we surround the entire regular expression in a capturing group, because you are trying to capture everything, normally - any regular expression is automatically surrounded by a capturing group, but in this case, you specified one explicitly, so it did not surround your regular expression with an automatic capture group.
now that we have surrounded the entire regular expression with a capturing group, we turn the group we have into a non-capturing group by adding ?: to the beginning, as shown above. we could also not have surrounded the entire regular expression and only turned the group into a non-capturing group, since as you saw, it will automatically turn the whole regular expression into a capturing group where non is present. i personally prefer explicit coding.
further information about regular expressions can be found here: http://docs.python.org/library/re.html
Parens are not just for grouping, but also for forming capture groups. What you want is re.compile(r'(?:months|days)\(\d+\)'). That uses a non-capturing group for the or condition, and will not get you a bunch of subgroup matches you don't appear to want when using findall.