python regular expression not matching properly - python

I have a string
"aaabbbbccc"
I want to retrieve
["aaa", "bbbb", "ccc"]
According to this post
What regex can match sequences of the same character?
In [8]: re.findall('(\w)\1+', s)
Out[8]: []
I think I successfully retrieved this pattern using a online regex parser.

There are two things you should consider here:
1) Use raw string literals when defining regex (or double escape the \ inside the pattern so that \1 could be parsed as a backreference and not as an octal character notation), and
2) Use re.finditer here to get whole match values since re.findall will fetch only the values captured with capturing groups:
import re
s = 'aaabbbbccc'
print([x.group() for x in re.finditer(r'(\w)\1+', s)])
See the Python demo.
Here, x.group() is the whole match stored inside the re.MatchObject that is returned by re.finditer.

Related

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

insert string at beginning of regex in python

In below code, the string "Graph" is replacing the matched regex:
htmlText = re.sub("[0-9]*/index.html", 'Graph', htmlText, re.MULTILINE|re.DOTALL)
But the problem is, I want to prepend 'Graph' to the beginning of the matched '[0-9]*/index.html' expression, not replace it.
You want to capture the match (by surrounding your regex with parens), then backreference it (via \1), using a raw string (via r before the replacement string) to prevent the backslash from being treated as an escape character:
In [1]: import re
In [2]: htmlText = "5/index.html"
In [3]: re.sub("([0-9]*/index.html)", r'Graph\g<1>', htmlText, re.MULTILINE|re.DOTALL)
Out[3]: 'Graph5/index.html'
Edit: Changed r'Graph\1' to r'Graph\g<1>' above, since that's more reliable in case someone uses this answer in a context where the backreference is followed by another number -- see docs https://docs.python.org/2/library/re.html#re.sub which cite:
\g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0
Note: Example above uses Python 2.7.6.

Is there a way to refer to the entire matched expression in re.sub without the use of a group?

Suppose I want to prepend all occurrences of a particular expression with a character such as \.
In sed, it would look like this.
echo '__^^^%%%__FooBar' | sed 's/[_^%]/\\&/g'
Note that the & character is used to represent the original matched expression.
I have looked through the regex docs and the regex howto, but I do not see an equivalent to the & character that can be used to substitute in the matched expression.
The only workaround I have found is to use the an extra set of () to group the expression and then refernece the group, as follows.
import re
line = "__^^^%%%__FooBar"
print re.sub("([_%^$])", r"\\\1", line)
Is there a clean way to reference the entire matched expression without the extra group creation?
From the docs:
The backreference \g<0> substitutes in the entire substring matched by the RE.
Example:
>>> print re.sub("[_%^$]", r"\\\g<0>", line)
\_\_\^\^\^\%\%\%\_\_FooBar
You could get the result also by using Positive lookahead .
>>> print re.sub("(?=[_%^$])", r"\\", line)
\_\_\^\^\^\%\%\%\_\_FooBar

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Extracting Data with Python Regular Expressions

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.
The page I am trying to parse has a number of productIds which appear in the following format
\"productId\":\"111111\"
I need to extract all the values, 111111 in this case.
t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
print m.group(1)
meaning match non-word characters (\W*), then productId followed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)).
Output
111111
something like this:
In [13]: s=r'\"productId\":\"111111\"'
In [14]: print s
\"productId\":\"111111\"
In [15]: import re
In [16]: re.findall(r'\d+', s)
Out[16]: ['111111']
The backslashes here might add to the confusion, because they are used as an escape character both by (non-raw) Python strings and by the regexp syntax.
This extracts the product ids from the format you posted:
re_prodId = re.compile(r'\\"productId\\":\\"([^"]+)\\"')
The raw string r'...' does away with one level of backslash escaping; the use of a single quote as the string delimiter does away with the need to escape double quotes; and finally the backslashe are doubled (only once) because of their special meaning in the regexp language.
You can use the regexp object's findall() method to find all matches in some text:
re_prodId.findall(text_to_search)
This will return a list of all product ids.
Try this,
:\\"(\d*)\\"
Give more examples of your data if this doesn't do what you want.

Categories