How to find superfluous escapes in regex patterns - python

How to find and remove all the unneeded backslash escapes in Python regular expressions.
For example in r'\{\"*' all the escapes are unnecessary and has the same meaning as r'{"*'. But in r'\[a-b]\{2}\Z\'\+' removing any of the escapes would change how the regex is interpreted by the regex engine (or cause a syntax error).
Given the pattern, is there an easy, i.e. other than perhaps parsing the whole regex string looking for escapes on non-special characters, way to remove escape patterns programmatically in Python?

Here is the code that I came up with:
from contextlib import redirect_stdout
from io import StringIO
from re import compile, DEBUG, error, MULTILINE, VERBOSE
def unescape(pattern: str, flags: int):
"""Remove any escape that does not change the regex meaning"""
strio = StringIO()
with redirect_stdout(strio):
compile(pattern, DEBUG | flags)
original_debug = strio.getvalue()
index = len(pattern)
while index >= 0:
index -= 1
character = pattern[index]
if character != '\\':
continue
removed_escape = pattern[:index] + pattern[index+1:]
strio = StringIO()
with redirect_stdout(strio):
try:
compile(removed_escape, DEBUG | flags)
except error:
continue
if original_debug == strio.getvalue():
pattern = removed_escape
return pattern
def print_unescaped_raw(regex: str, flags:int=0):
"""Print an unescaped raw-string representation for s."""
print(
("r'%s'" % unescape(regex, flags)
.replace("'", r'\'')
.replace('\n', r'\n'))
)
print_unescaped_raw(r'\{\"*') # r'{"*'
One can also use sre_parse.parse directly, but the SubPatterns and tuples in the result may contain nested SubPatterns. And SubPattern instances don't have __eq__ method defined for them, so a recursive comparison subroutine might be required.
P.S.
Unfortunately, this method does not work with the regex module because in regex you get different debug output for escaped characters:
regex.compile(r'{', regex.DEBUG)
LITERAL MATCH '{'
regex.compile(r'\{', regex.DEBUG)
CHARACTER MATCH '{'
Unlike re that gives:
re.compile(r'{', re.DEBUG)
LITERAL 123
re.compile(r'\{', re.DEBUG)
LITERAL 123

I will not do the whole implementation but I can give you some hints to make a viable heuristic/algo:
Initial Hypothesis: You have for each regex that you are going to modify a list of input strings/expected output strings to validate its behavior
Use this website to have the list of characters that should stay escaped with the backslash \ http://www.rexegg.com/regex-quickstart.html and Create a list of elements that should not be replaced
Parse your regex and replace all the \X where X is a character that is not present in the list generated at the previous step by X
Test your initial regex on its input strings and test your new regex on the same input strings and compare their respective outputs for all the result
If all of your results are the same, then you can use your new/simplified regex.
If at least one of the output is different then you have to throw away your new regex and proceed with local replacements: select randomly (round robin could be used) one of the \X in your initial regex that is not in the list that you have construct at step 1. and replace it by X check the output in comparison to the initial regex output for each input string if it matches you can use that regex and repeat step 5. until it is not possible to progress anymore. however, If the output is different for that replacement remove it from the list of elements you might be able to replace and repeat the step 5 with your previous regex. Do the process until your list of possible local replacement is empty, you can use the new regex instead of the old one.

Related

Python If substring exists in string, get its context between relimiters

I have a list of strings that follow a pattern such that in some position in the string there may be a substring RAM.
ex:
sdfjhsk_sdkjfhs_RAM_lkfdgjls
Sometimes this string may have another character after it.
ex:
aaaa_RAMA_sfsffgd
I'd need to have the whole context between the nearest underscores, so RAM in the first case, RAMA in the second.
And it may not even exist at all in the string
ex:
sfdks_sdfsdf_sdfsdf_sdfsdfsdf
Matches at the start or end of the string are allowed:
RAMsdoa_saeorfioa_noutd -> RAMsdoa
aetu_eaei_sdsdf_RAMSdoa -> RAMsdoa
as are matches in strings without underscores:
sdasids -> nothing
sdfRAMso -> sdfRAMso
What is the best way to search the string and if it contains the pattern RAM and if it does, grab everything in between the nearest delimiters _ (or the start or end of the string, if nearer)?
You can use a regular expression here. You need to match RAM, plus any non-_ characters before and after:
import re
def find_ram_context(inputtext):
match = re.search(r'[^_]*RAM[^_]*', inputtext)
if match:
return match.group(0)
[^...] is a negative character-set match; anything not in that set would match. Here that's _, and * means that there should be zero or more such characters. So any character before or after RAM that's not an underscore would be pulled into the matched text.
The function above returns the matched context, or None if the word RAM is not present:
>>> find_ram_context('sdfjhsk_sdkjfhs_RAM_lkfdgjls')
'RAM'
>>> find_ram_context('aaaa_RAMA_sfsffgd')
'RAMA'
>>> find_ram_context('sfdks_sdfsdf_sdfsdf_sdfsdfsdf') is None
True
Online demo of the regex with your test cases at https://regex101.com/r/6VcLrC/1

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions

NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?
You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.
replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)
There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Replace pairs of characters at start of string with a single character

I only want this done at the start of the sting. Some examples (I want to replace "--" with "-"):
"--foo" -> "-foo"
"-----foo" -> "---foo"
"foo--bar" -> "foo--bar"
I can't simply use s.replace("--", "-") because of the third case. I also tried a regex, but I can't get it to work specifically with replacing pairs. I get as far as trying to replace r"^(?:(-){2})+" with r"\1", but that tries to replace the full block of dashes at the start, and I can't figure how to get it to replace only pairs within that block.
Final regex was:
re.sub(r'^(-+)\1', r'\1', "------foo--bar")
^ - match start
(-+) - match at least one -, but...
\1 - an equal number must exist outside the capture group.
and finally, replace with that number of hyphens, effectively cutting the number of hyphens in half.
import re
print re.sub(r'\--', '',"--foo")
print re.sub(r'\--', '',"-----foo")
Output:
foo
-foo
EDIT this answer is for the OP before it was completely edited and changed.
Here's it all written out for anyone else who comes this way.
>>> foo = '---foo'
>>> bar = '-----foo'
>>> foobar = 'foo--bar'
>>> foobaz = '-----foo--bar'
>>> re.sub('^(-+)\\1', '-', foo)
'-foo'
>>> re.sub('^(-+)\\1', '-', bar)
'---foo'
>>> re.sub('^(-+)\\1', '-', foobar)
'foo--bar'
>>> re.sub('^(-+)\\1', '-', foobaz)
'--foo--bar'
The pattern for re.sub() is:
re.sub(pattern, replacement, string)
therefore in this case we want to replace -- with -. HOWEVER, the issue comes when we have -- that we don't want to replace, given by some circumstances.
In this case we only want to match -- at the beginning of a string. In regular expressions for python, the ^ character, when used in the pattern string, will only match the given pattern at the beginning of the string - just what we were looking for!
Note that the ^ character behaves differently when used within square brackets.
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'... An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
Getting back to what we were talking about. The parenthesis in the pattern represent a "group," this group can then be referenced with the \\1, meaning the first group. If there was a second set of parenthesis, we could then reference that sub-pattern with \\2. The extra \ is to escape the next slash. This pattern can also be written with re.sub(r'^(-+)\1', '-', foo) forcing python to interpret the string as a raw string, as denoted with the r preceding the pattern, thereby eliminating the need to escape special characters.
Now that the pattern is all set up, you just make the replacement whatever you want to replace the pattern with, and put in the string that you are searching through.
A link that I keep handy when dealing with regular expressions, is Google's developer's notes on them.

Categories