This answer to a question regarding the maintainability of regular expressions mentions the ability of .NET users to implement comments in their regular expressions (I am particularly interested in the second example)
Is there an easy native way to reproduce this in python, preferably without having to install a third party library or writing my own comment-strip algorithm?
what I currently do is similar to the first example in that answer, I concatenate the regular expression in multiple lines and comment each line, like in the following example:
regexString = '(?:' # Non-capturing group matching the beginning of a comment
regexString += '/\*\*'
regexString += ')'
You're looking for the VERBOSE flag in the re module. Example from its documentation:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
r"""
(?: # Match the regular expression below
/ # Match the character “/” literally
\* # Match the character “*” literally
\* # Match the character “*” literally
)
"""
You can also add comments into regex like this:
(?#The following regex matches /** in a non-capture group :D)(?:/\*\*)
Related
I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.
The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']
Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo
The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com
I am trying to write Python Regex.
First I have read each line of the file into a list.Then I am looping through the list.
Q1. I want to capture when the arithmetic operators does not have space between them.Something like
Capture:
a = a+5
Does not capture:
a = a + 5
For this,I have written something like:
for i in array:
pattern = re.search(r"\S(\+|\-|\*|\\)\S",i)
\S : anything other than space
(+|-|*|\):mathematical operators
\S : anything other than space
But the problem is,its capturing the lines where post increment operators are used.
Captures :
a = a++
How could I write a regex expression such that it does not capture the line where post increment operators are used.
Q2.I want to capture where multi line comments are used in a file.
I tried by writing the below expression but it is failing to capture.I don't know where I have done wrong.Kindly help.
for i in array:
pattern = re.search(r"\/\*[A-Za-z0-9_]\*\/",i)
For question 1, you want a plus not preceded by a space or a plus and not followed by a space or a plus. This can be written
r"(?<![+\s])[+](?![+\s])"
You can do the same expression with minus instead of plus, or star, or slash. Then join these expressions with the | sign.
For question 2, you can try
r"[/][*](?:[^*]|[*](?![/]))*[*][/]"
Of course it won't handle nested multiline comments. For these, a simple regex won't suffice.
The first issue can be solved with the help of negated character classes, at least for the current example strings and maybe some more cases. The problem you showed is due to the fact that \S matches any non-whitespace character. To match any char that is not whitespace, -, /, + and *, and maybe even ( and ), use [^\s+*/()-] negated character class. Your first regex - note that division operator should be /, not \ - can be written as
pat = r"[^\s+*/()-]([+*/-])[^\s+*/()-]"
See the regex demo
The second one is a solved issue.
pat = r"/\*[^*]*\*+(?:[^/*][^*]*\*+)*/";
See the regex demo.
Details
/\* - comment start
[^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
(?:[^/*][^*]*\*+)* - 0+ sequences of:
[^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
/ - closing /
Is there a cleaner way to write long regex patterns in python? I saw this approach somewhere but regex in python doesn't allow lists.
patterns = [
re.compile(r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'),
re.compile(r'\n+|\s{2}')
]
You can use verbose mode to write more readable regular expressions. In this mode:
Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.
When a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
The following two statements are equivalent:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
(Taken from the documentation of verbose mode)
Though #Ayman's suggestion about re.VERBOSE is a better idea, if all you want is what you're showing, just do:
patterns = re.compile(
r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'
r'\n+|\s{2}'
)
and Python's automatic concatenation of adjacent string literals (much like C's, btw) will do the rest;-).
You can use comments in regex's, which make them much more readable. Taking an example from http://gnosis.cx/publish/programming/regular_expressions.html :
/ # identify URLs within a text file
[^="] # do not match URLs in IMG tags like:
# <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
:\/\/ # ...needs to be followed by colon-slash-slash
[^ \n\r]+ # stuff other than space, newline, tab is in URL
(?=[\s\.,]) # assert: followed by whitespace/period/comma
/
Perl makes it easy to construct readable regular expressions using the /x modifier. This modifier allows to write regular expression strings and ignore all whitespaces in these strings. In other words, logical parts of the regular expression can be separated by whitespace or even carriage returns, allowing great readability. In Python, the only way I see of doing this is to construct such regular expression string, remove whitespace from it in an intermediate step, and then use the resulting string for matching. Is there a more elegant way of doing this?
Yes, by setting the re.X / re.VERBOSE flag:
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
That means that the two following regular expression objects that match a decimal number are functionally equal:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
This is pretty much exactly like the /x Perl flag.
You can control the same flag in a subsection of your pattern within the (?x:...) (enable) and (?-x:...) (disable) groupings.
To add, inline modifiers can be placed within a regular expression to enforce relevant matching behavior on the given expression. In Python the inline modifiers apply to the entire regular expression, and do not support inline negate modifiers such as (?-ismx)
pattern = re.compile(r'''
(?x)
\d+ (?# Some numbers)
\s+ (?# Whitespace)
\d+ (?# More numbers)
''');
The way around that would be to import Python's regex module in which the inline modifiers apply to the end of the group or pattern, and they can be turned on or off.
import regex
pattern = regex.compile(r'(?x) \d+ (?-x)[a-z]+(?x) \d+', regex.V1)
I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks
Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)
This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)
Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.