python regex match and replace - python

I need to find, process and remove (one by one) any substrings that match a rather long regex:
# p is a compiled regex
# s is a string
while 1:
m = p.match(s)
if m is None:
break
process(m.group(0)) #do something with the matched pattern
s = re.sub(m.group(0), '', s) #remove it from string s
The code above is not good for 2 reasons:
It doesn't work if m.group(0) happens to contain any regex-special characters (like *, +, etc.).
It feels like I'm duplicating the work: first I search the string for the regular expression, and then I have to kinda go look for it again to remove it.
What's a good way to do this?

The re.sub function can take a function as an argument so you can combine the replacement and processing steps if you wish:
# p is a compiled regex
# s is a string
def process_match(m):
# Process the match here.
return ''
s = p.sub(process_match, s)

Related

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Python regex, how to delete all matches from a string

I have a list of regex patterns.
rgx_list = ['pattern_1', 'pattern_2', 'pattern_3']
And I am using a function to loop through the list, compile the regex's, and apply a findall to grab the matched terms and then I would like a way of deleting said terms from the text.
def clean_text(rgx_list, text):
matches = []
for r in rgx_list:
rgx = re.compile(r)
found_matches = re.findall(rgx, text)
matches.append(found_matches)
I want to do something like text.delete(matches) so that all of the matches will be deleted from the text and then I can return the cleansed text.
Does anyone know how to do this? My current code will only work for one match of each pattern, but the text may have more than one occurence of the same pattern and I would like to eliminate all matches.
Use sub to replace matched patterns with an empty string. No need to separately find the matches first.
def clean_text(rgx_list, text):
new_text = text
for rgx_match in rgx_list:
new_text = re.sub(rgx_match, '', new_text)
return new_text
For simple regex you can OR the expressions together using a "|". There are examples of combining regex using OR on stack overflow.
For really complex regex I would loop through the list of regex. You could get timeouts from combined complex regex.

Best way to split a string for the last space

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:
a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'
For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']
Right now I have this:
def getMin(aList):
min = sys.maxint
for item in aList:
if item < min and item != -1:
min = item
return min
myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]
I'm sure there's better ways to do this and I'm open to all suggestions
Matching vs. Splitting
There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.
The Reduced, Simple Version
Let's start with the simplest version of the regex so you don't get scared by something long:
{[^{}]*}|\S+
This matches all the items of your second string—the same as if we were splitting (see demo)
The left side of the | alternation matches complete sets of {braces}.
The right side of the | matches any characters that are not whitespace characters.
It's that simple!
The Full Regex
We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.
(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+
See demo.
Notes Potential Improvements
We could replace the quoted string regex by one that accepts escaped quotes
We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...
Let me know if you have questions!
You can use regular expressions:
import re
def parse(text):
m = re.search(r'(.*) ([[({"].*)', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".
If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.
def parse(text):
m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
Perhaps this link will help:
Split a string by spaces -- preserving quoted substrings -- in Python
It explains how to preserve quoted substrings when splitting a string by spaces.

Find the indexes of all regex matches?

I'm parsing strings that could have any number of quoted strings inside them (I'm parsing code, and trying to avoid PLY). I want to find out if a substring is quoted, and I have the substrings index. My initial thought was to use re to find all the matches and then figure out the range of indexes they represent.
It seems like I should use re with a regex like \"[^\"]+\"|'[^']+' (I'm avoiding dealing with triple quoted and such strings at the moment). When I use findall() I get a list of the matching strings, which is somewhat nice, but I need indexes.
My substring might be as simple as c, and I need to figure out if this particular c is actually quoted or not.
This is what you want: (source)
re.finditer(pattern, string[, flags])
Return an iterator yielding MatchObject instances over all
non-overlapping matches for the RE pattern in string. The string is
scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of
another match.
You can then get the start and end positions from the MatchObjects.
e.g.
[(m.start(0), m.end(0)) for m in re.finditer(pattern, string)]
To get indice of all occurences:
S = input() # Source String
k = input() # String to be searched
import re
pattern = re.compile(k)
r = pattern.search(S)
if not r: print("(-1, -1)")
while r:
print("({0}, {1})".format(r.start(), r.end() - 1))
r = pattern.search(S,r.start() + 1)
This should solve your issue:
pattern=r"(?=(\"[^\"]+\"|'[^']+'))"
Then use the following to get all overlapping indices:
indicesTuple = [(mObj.start(1),mObj.end(1)-1) for mObj in re.finditer(pattern,input)]

Categories