Replacing single words with re.sub in python - python

Spoiler: Yes this is an assignment. It is solved, but for personal interest I want to know the below.
So at the moment working with a syntax marker for an assignment, in which we input a file, and using a dictionary of regexes, colour them (keywords) accordingly.
Having some issues, though.
for i in iterations:
pass
in this above line, using a regex
r'(\t*for.*in.*?:.?)' will work, but it will colour the entire line. While that is allowed, I would really like for it to only mark for/in.
Trying with r'(\bfor\b|\bin\b)' is not being kind, nor r'(for)', or r'(\sfor\s)'.
I read the whole code into one string and use re.sub() to replace all occurences with colour + r'\1' + colour_end where colour specifies colour sequences.

You may use capturing and backreferences:
^(\t*)(for\b)(.*)\b(in)\b(.*?:)
Replace with $1<color>$2</color>$3<color>$4</color>$5. See the regex demo.
Here, the expression is split into 5 subparts with (...) capturing groups. In the replacement pattern, those values captured are referred to with backreferences having $+n format where n is the ID of the capturing group inside the pattern.
If you have no chance to run 1 regex with multiple capturing groups, run two on end:
^(\t*)for\b(?=.*\bin\b.*?:) --> $1<color>for</color> (see this demo)
^(\t*for\b.*)\bin\b(?=.*?:) --> $1<color>in</color> (see another demo).
The single capturing group is around the part before the word, and the part after the word is not matched but checked with a positive lookahead.

Here's my solution:
import re
STR = """
for i in iterations:
pass
"""
pattern = r'(\b)(for|in|pass)(\b)'
change = r'\1<COLOR>\2</COLOR>\3'
print re.sub(pattern, change, STR)
so I'm capturing the keywords with whitespaces and give them back with as \1 and \3
this gives:
<COLOR>for</COLOR> i <COLOR>in</COLOR> iterations:
<COLOR>pass</COLOR>

Related

Why doesn't \0 work in Python regexp substitutions, i.e. with sub() or expand(), while match.group(0) does, and also \1, \2, ...?

Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?
This simple example (executed in Python 3.7) says it all:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by method: {}'.format(match.group(0)))
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
print('Capture group 1, by method: {}'.format(match.group(1)))
print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))
The output from this is:
Full match, by method: 123
Full match, by template:
Capture group 1, by method: 2
Capture group 1, by template: 2
Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?
Is this a Python bug?
Huh, you're right, that is annoying!
Fortunately, Python's way ahead of you. The docs for sub say this:
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.
So your code example can be:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:
We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.
I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.
If you will look into docs, you will find next:
The backreference \g<0> substitutes in the entire substring matched by the RE.
A bit more deep in docs (back in 2003) you will find next tip:
There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.
So, you need to follow this recommendations and use \g<0>:
expand_template_full = r'\g<0>'
Quoting from https://docs.python.org/3/library/re.html
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
To summarize:
Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

re.sub part of string: (?: ...) mystery [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp
The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.
According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.

How to search part of pattern in regex python

I can match pattern as it is. But can I search only part of the pattern? or I have to send it separately again.
e.g. pattern = '/(\w+)/(.+?)'
I can search this pattern using re.search and then use group to get individual groups.
But can I search only for say (\w+) ?
e.g.
pattern = '/(\w+)/(.+?)'
pattern_match = re.search(pattern, string)
print pattern_match.group(1)
Can I just search for part of pattern. e.g. pattern.group(1) or something
You can make any part of a regular expression optional by wrapping it in a non-matching group followed by a ?, i.e. (?: ... )?.
pattern = '/(\w+)(?:/(.+))?'
This will match /abc/def as well as /abc.
In both examples pattern_match.group(1) will be abc, but pattern_match.group(2) will be def in the first one and an empty string in the second one.
For further reference, have a look at (?:x) in the special characters table at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
EDIT
Changed the second group to (.+), since I assume you want to match more than one character. .+ is called a "greedy" match, which will try to match as much as possible. .+? on the other hand is a "lazy" match that will only match the minimum number of characters necessary. In case of /abc/def, this will only match the d from def.
That pattern is merely a character string; send the needed slice however you want. For instance:
re.search(pattern[:6], string)
uses only the first 6 characters of your pattern. If you need to detect the end of the first pattern -- and you have no intervening right-parens -- you can use
rparen_pos = pattern.index(')')
re.search(pattern[:rparen_pos+1], string)
Another possibility is
pat1 = '/(\w+)'
pat2 = '/(.+?)'
big_match = re.search(pat1+pat2, string)
small_match = re.search(pat1, string)
You can get more innovative with expression variables ($1, $2, etc.); see the links below for more help.
http://flockhart.virtualave.net/RBIF0100/regexp.html
https://docs.python.org/2/howto/regex.html

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Categories