Python readability regex [duplicate] - python

Is there a cleaner way to write long regex patterns in python? I saw this approach somewhere but regex in python doesn't allow lists.
patterns = [
re.compile(r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'),
re.compile(r'\n+|\s{2}')
]

You can use verbose mode to write more readable regular expressions. In this mode:
Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.
When a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
The following two statements are equivalent:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
(Taken from the documentation of verbose mode)

Though #Ayman's suggestion about re.VERBOSE is a better idea, if all you want is what you're showing, just do:
patterns = re.compile(
r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'
r'\n+|\s{2}'
)
and Python's automatic concatenation of adjacent string literals (much like C's, btw) will do the rest;-).

You can use comments in regex's, which make them much more readable. Taking an example from http://gnosis.cx/publish/programming/regular_expressions.html :
/ # identify URLs within a text file
[^="] # do not match URLs in IMG tags like:
# <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
:\/\/ # ...needs to be followed by colon-slash-slash
[^ \n\r]+ # stuff other than space, newline, tab is in URL
(?=[\s\.,]) # assert: followed by whitespace/period/comma
/

Related

How to make Python ignore indentation and \n in re expressions

I need to match a very large string like this:
"""
A=...B=...C=...D=...
"""
I used such an re expression to match the string:
'''
A=.+?
[^ABCD]+?
C=.+?
'''
As the re expression is very long, I split it into several lines according to PEP8. But python matches my implicit \n as well, i.e. it is trying to match:
A=(.+?)\\n[^ABCD]+?\\nC=(.+?)\\n
which is definitely not what I want. Furthermore, I cannot use a pretty indentation because python matches that white spaces as well.
So how can I get out of this? I want pretty indentation and right matching.
Specify the re.VERBOSE (also known as re.X) flag when creating the Regex:
pattern = re.compile('''
A=.+?
[^ABCD]+?
C=.+?
''', re.VERBOSE)
From the docs:
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, except when in a character
class or preceded by an unescaped backslash, and, when a line contains
a '#' neither in a character class or preceded by an unescaped
backslash, all characters from the leftmost such '#' through the end
of the line are ignored.
You need to use the re.VERBOSE mode, you can even comment your re.
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

Is there a Python equivalent to the Perl "/x" modifier for regular expressions?

Perl makes it easy to construct readable regular expressions using the /x modifier. This modifier allows to write regular expression strings and ignore all whitespaces in these strings. In other words, logical parts of the regular expression can be separated by whitespace or even carriage returns, allowing great readability. In Python, the only way I see of doing this is to construct such regular expression string, remove whitespace from it in an intermediate step, and then use the resulting string for matching. Is there a more elegant way of doing this?
Yes, by setting the re.X / re.VERBOSE flag:
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
That means that the two following regular expression objects that match a decimal number are functionally equal:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
This is pretty much exactly like the /x Perl flag.
You can control the same flag in a subsection of your pattern within the (?x:...) (enable) and (?-x:...) (disable) groupings.
To add, inline modifiers can be placed within a regular expression to enforce relevant matching behavior on the given expression. In Python the inline modifiers apply to the entire regular expression, and do not support inline negate modifiers such as (?-ismx)
pattern = re.compile(r'''
(?x)
\d+ (?# Some numbers)
\s+ (?# Whitespace)
\d+ (?# More numbers)
''');
The way around that would be to import Python's regex module in which the inline modifiers apply to the end of the group or pattern, and they can be turned on or off.
import regex
pattern = regex.compile(r'(?x) \d+ (?-x)[a-z]+(?x) \d+', regex.V1)

Extracting text between two markers and handling Backslashes in Regex for Python

I have a few URLs in a file, some of them are embedded between specific start and end tags whereas others are not. I only need to extract the ones which are embedded in between the start and end tags.
A line in my inputfile.txt looks like the following:
some gibberish data-start=\"https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg\" data-end this is useless text, some gibberishhh data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" data-end some gibberish fake-data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" fake-data-end
The start and end tags of the URLs that I need are data-start and data-end as opposed to fake-data-start and fake-data-end.
Now I'm using the following regex in Python to extract the aforementioned URLs:
(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)
I believe the above Regex works which I tested from this link
and My Python Code is:
import re
import string
import sys
s = re.compile('(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)')
fin = open('inputfile.txt')
for line in fin:
m = s.findall(line)
if m:
print m
However, my Python code is unable to find the URLs, on the other hand if I remove all backslashes from my file then the above code works fine. I haven't been able to explain this difference.
Backslash serves as an escape character. Therefore; for every single (\) backslash you need two backslashes (\\). You can use the following regular expression here:
(?<=data-start=\\").*?(?=\\" data-end)
Explanation:
(?<= # look behind to see if there is:
data-start= # 'data-start='
\\ # '\'
" # '"'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
\\ # '\'
" data-end # '" data-end'
) # end of look-ahead
Note: If your data spans multi-line, use the inline (?s) modifier forcing the dot to match newline characters.
(?s)(?<=data-start=\\").*?(?=\\" data-end)
Final solution:
import re
myfile = open('inputfile.txt', 'r')
regex = re.compile(r'(?<=data-start=\\").*?(?=\\" data-end)')
for line in myfile:
matches = regex.findall(line)
for m in matches:
print m
Output
https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg
https:\/\/cdn.net\/hphotos-xaf1\/2.jpg
You seem to have too many backslashes. It looks to me like you could simplify your regex to something like:
(?<= data-start=\\")([^".]+\.[^".]+\.[^"\\]+)
Please note that the original [^"]+\., which means any character that is not a double quote, then a dot, will first eat all the dots, then backtrack, which is why I added the dots in the character classes.
In Python, something like:
s = re.compile(r'(?<= data-start=\\")([^".]+\.[^".]+\.[^"\\]+)')

Commenting Regular expressions in python

This answer to a question regarding the maintainability of regular expressions mentions the ability of .NET users to implement comments in their regular expressions (I am particularly interested in the second example)
Is there an easy native way to reproduce this in python, preferably without having to install a third party library or writing my own comment-strip algorithm?
what I currently do is similar to the first example in that answer, I concatenate the regular expression in multiple lines and comment each line, like in the following example:
regexString = '(?:' # Non-capturing group matching the beginning of a comment
regexString += '/\*\*'
regexString += ')'
You're looking for the VERBOSE flag in the re module. Example from its documentation:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
r"""
(?: # Match the regular expression below
/ # Match the character “/” literally
\* # Match the character “*” literally
\* # Match the character “*” literally
)
"""
You can also add comments into regex like this:
(?#The following regex matches /** in a non-capture group :D)(?:/\*\*)

Regular expression sub

I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks
Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)
This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)
Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.

Categories