I am trying to extract the string inside nested brackets and product them.
Let's say I have following string
string = "(A((B|C)D|E|F))"
According to the answer in Extract string inside nested brackets
I can extract the string inside nested brackets, but for my case it's different since I have "D" at the end of bracket so this is the result from the code. It looks so far from my desired output
['B|C', 'D|E|F', 'A']
This is my desired output
[[['A'],['B|C'],['D']], [['A'],['E|F']']] # '|' means OR
Do you have any recommandation, should I implement by using regular expression or just run through all given string?
So it can leads to my final result, that is
"ABD"
"ACD"
"AE"
"AF"
In this point, I will use itertools.product
You didn't specify the language precisely, but it looks like arbitrary nested brackets are allowed. It's not a regular language. I wouldn't recommend to parse it with regular expression (it might be possible as regular expressions in python are not truly regular, but even if it's possible, it'll probably be a mess).
I'd recommend to define a context-free grammar for your language and parse it instead. Here's how you can do it:
EXPR -> A EXPR (an expression is an expression preceded by an alphabetic character)
EXPR -> (LIST) EXPR (an expression is a list followed by an expression)
EXPR -> "" (an expression can be an empty string)
LIST -> EXPR | LIST (a list is an expression followed by "|" followed by a list)
LIST -> EXPR (or just one expression)
This grammar can be parsed by a simple top-down recursive parser which works in linear time. Here's a sample implementation:
class Parser:
def __init__(self, data):
self.data = data
self.pos = 0
def get_cur_char(self):
"""
Returns the current character or None if the input is over
"""
return None if self.pos == len(self.data) else self.data[self.pos]
def advance(self):
"""
Moves to the next character of the input if the input is not over.
"""
if self.pos < len(self.data):
self.pos += 1
def get_and_advance(self):
"""
Returns the current character and moves to the next one.
"""
res = self.get_cur_char()
self.advance()
return res
def parse_expr(self):
"""
Parse the EXPR according to the speficied grammar.
"""
cur_char = self.get_cur_char()
if cur_char == '(':
# EXPR -> (LIST) EXPR rule
self.advance()
# Parser the list and the rest of the expression and combines
# the result.
prefixes = self.parse_list()
suffices = self.parse_expr()
return [p + s for p in prefixes for s in suffices]
elif not cur_char or cur_char == ')' or cur_char == '|':
# EXPR -> Empty rule. Returns a list with an empty string without
# consuming the input.
return ['']
else:
# EXPR -> A EXPR rule.
# Parses the rest of the expression and prepends the current
# character.
self.advance()
return [cur_char + s for s in self.parse_expr()]
def parse_list(self):
"""
Parser the LIST according to the speficied grammar.
"""
first_expr = self.parse_expr()
# Uses the LIST -> EXPR | LIST rule if the next character is | and
# LIST -> EXPR otherwise
return first_expr + (self.parse_list() if self.get_and_advance() == '|' else [])
if __name__ == '__main__':
string = "(A((B|C)D|E|F))"
parser = Parser(string)
print('\n'.join(parser.parse_expr()))
If you're not familiar with this technique, you can read more about it here.
This implementation is not the most efficient one (for instance, it uses lists explicitly instead of iterators), but it's a good starting point.
I would suggest to go for a solution that targets the final result immediately. So a function that would make this transformation:
input: "(A((B|C)D|E|F))"
output: ['ABD', 'ACD', 'AE', 'AF']
Here is the code I would propose:
import re
def tokenize(text):
return re.findall(r'[()|]|\w+', text)
def product(a, b):
return [x+y for x in a for y in b] if a and b else a or b
def parse(text):
tokens = tokenize(text)
def recurse(tokens, i):
factor = []
term = []
while i < len(tokens) and tokens[i] != ')':
token = tokens[i]
i += 1
if token == '|':
term.extend(factor)
factor = []
else:
if token == '(':
expr, i = recurse(tokens, i)
else:
expr = [token]
factor = product(factor, expr)
return term+factor, i+1
return recurse(tokens, 0)[0]
string = "(A((B|C)D|E|F))"
print(parse(string))
See it run on repl.it
Related
as the title suggests I'm trying to parse a piece of code into a tree or a list.
First off I would like to thank for any contribution and time spent on this.
So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.
Problem
1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax.
2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share.
In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >= ....). But this is not a real need right now.
Research
My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore)
So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic.
I would share the links to the sites but I didn't keep track of them.
The Syntax of the code
The syntax is pretty simple, after all I'm no interested in types or any further detail. just the functions and parameters.
strings are defined as 'my string', variables as !variable and numbers as in any other language.
Here is a sample of code:
db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
My Output
Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
Desired Output
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
My code so far
The parseRecursive method is the entry point.
import re
class FileParser:
#order is important to avoid miss splits
COMPARATOR_SIGN = {
'#='
,'#<>'
,'<>'
,'>='
,'<='
,'='
,'>'
,'<'
}
def __init__(self):
pass
def __charExistsInOccurences(self,current_needle, needles, text):
"""
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
"""
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
if sum(found) > 0:
exists = True
break
return exists
def findOperator(self, needles, haystack):
"""
split parameters from operators
needles : list -> list of operators
haystack : string
"""
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure+1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend(before_string)
occurences.extend([haystack[string_open:string_closure+1]])
occurences.extend(after_string)
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
"""
parse a block of text
text : string
"""
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
break
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space+1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
#accumulated_params.extend(text_prev_func_params)
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
accumulated_params.extend(text_prev_operator)
else:
text_prev_operator.append(operator)
accumulated_params.extend(text_prev_operator)
#accumulated_params.extend(text_prev_operator)
else:
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open+1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
accumulated_params.append(func_params)
#
# parameters after the function
#
new_text = text[function_close+1:]
accumulated_params.extend(self.parseRecursive(new_text))
else:
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
accumulated_params.extend(current_func_params)
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))
You can use pyparsing to deal with such a case.
* pyparsing can be installed by pip install pyparsing
Code:
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
else:
stack.append(e)
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
print(result)
Output:
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
Note:
If there is an unbalanced parenthesis inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or an IndexError is raised. So be careful in such cases.
At the moment, only a single sample is available to make a pattern to parse string. So if you encounter a parsing error, provide more examples in your question.
I am trying to make a function that takes an equation as input and evaluate it based on the operations, the rule is that I should have the operators(*,+,-,%,^) between correct mathematical expressions, examples:
Input: 6**8
Result: Not correct
Reason: * has another * next to it instead of a digit or a mathematical expression
Input: -6+2
Result: Not correct
Reason: "-" was in the beginning and it didn't fall between two numbers.
Input: 6*(2+3)
Result: Correct
Reason: "*" was next to a mathematically correct expression "(2+3)
1. Option: eval
eval the expression with try-except:
try:
result = eval(expression)
correct_sign = True
except SyntaxError:
correct_sign = False
Advantages:
Very easy and fast
Disadvantages:
Python accepts expressions, that you probably don't want (e.g. ** is valid in python)
eval is not secure
2. Option: Algorithm
In compilers algorithms are used, to make a math expression readable for the pc. These algorithms can also be used to evaluate if the expression is valid.
I don't aim to explain these algorithms. There are enough resources outside.
This is a very brief structure of what you can do:
Parsing an infix expression
Converting infix expression to a postfix expression
Evaluating the postfix expression
You need to understand what postfix and infix expressions mean.
Resources:
Shunting yard algorithm: https://en.wikipedia.org/wiki/Shunting-yard_algorithm
Reverse polish notation/ post fix notation: https://en.wikipedia.org/wiki/Reverse_Polish_notation
Python builtin tokenizer: https://docs.python.org/3.7/library/tokenize.html
Advantages:
Reliable
Works for complicated expressions
You don't have to reinvent the wheel
Disadvantages
complicate to understand
complicate to implement
As mentioned in comments, this is called parsing and requires a grammar.
See an example with parsimonious, a PEG parser:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
from parsimonious.exceptions import ParseError
grammar = Grammar(
r"""
expr = (term operator term)+
term = (lpar factor rpar) / number
factor = (number operator number)
operator = ws? (mod / mult / sub / add) ws?
add = "+"
sub = "-"
mult = "*"
mod = "/"
number = ~"\d+(?:\.\d+)?"
lpar = ws? "(" ws?
rpar = ws? ")" ws?
ws = ~"\s+"
"""
)
class SimpleCalculator(NodeVisitor):
def generic_visit(self, node, children):
return children or node
def visit_expr(self, node, children):
return self.calc(children[0])
def visit_operator(self, node, children):
_, operator, *_ = node
return operator
def visit_term(self, node, children):
child = children[0]
if isinstance(child, list):
_, factor, *_ = child
return factor
else:
return child
def visit_factor(self, node, children):
return self.calc(children)
def calc(self, params):
""" Calculates the actual equation. """
x, op, y = params
op = op.text
if not isinstance(x, float):
x = float(x.text)
if not isinstance(y, float):
y = float(y.text)
if op == "+":
return x+y
elif op == "-":
return x-y
elif op == "/":
return x/y
elif op == "*":
return x*y
equations = ["6 *(2+3)", "2+2", "4*8", "123-23", "-1+1", "100/10", "6**6"]
c = SimpleCalculator()
for equation in equations:
try:
tree = grammar.parse(equation)
result = c.visit(tree)
print("{} = {}".format(equation, result))
except ParseError:
print("The equation {} could not be parsed.".format(equation))
This yields
6 *(2+3) = 30.0
2+2 = 4.0
4*8 = 32.0
123-23 = 100.0
The equation -1+1 could not be parsed.
100/10 = 10.0
The equation 6**6 could not be parsed.
You need to use correct data structures and algorithms to achieve your goal to parse a mathematical equation and evaluate it.
also you have to be familiar with two concepts: stacks and trees for creating a parser.
think the best algorithm you can use is RPN (Reverse Polish Notation).
For issue #1, you could always strip out the parentheses before evaluating.
input_string = "6*(2+3)"
it = filter(lambda x: x != '(' and x != ')', input_string)
after = ' '.join(list(it))
print(after)
# prints "6 * 2 + 3"
It looks like you might just be starting to use python. There are always many ways to solve a problem. One interesting one to sort of get you jump started would be to consider splitting the equation based on the operators.
For example the following uses what's called a regular expression to split the formula:
import re
>>> formula2 = '6+3+5--5'
>>> re.split(r'\*|\/|\%|\^|\+|\-',formula2)
['6', '3', '5', '', '5']
>>> formula3 = '-2+5'
>>> re.split(r'\*|\/|\%|\^|\+|\-',formula3)
['', '2', '5']
It may look complex, but in the r'\*|\/|\%|\^|\+|\-' piece the \ means to take the next character literally and the | means 'or' so it evaluates to split on any one of those operators.
In this case you'd notice that any time there are two operators together, or when a formula starts with an operator you will have a blank value in your list - one for the second - in the first formula and one for the leading - in the second formula.
Based on that you could say something like:
if '' in re.split(r'\*|\/|\%|\^|\+|\-',formula):
correctsign = False
Maybe this can serve as a good starting point to get the brain thinking about interesting ways to solve the problem.
Important to first mention that ** stands for exponentiation, i.e 6**8: 6 to the power of 8.
The logic behind your algorithm is wrong because in your code the response depends only on whether the last digit/sign satisfies your conditions. This is because once the loop is complete, your boolean correctsigns defaults to True or False based on the last digit/sign.
You can also use elif instead of nested else statements for cleaner code.
Without changing your core algorithm, your code would like something like this:
def checksigns(equation):
signs = ["*","/","%","^","+","-"]
for i in signs:
if i in equation:
index = equation.index((i))
if (equation[index] == equation[0]):
return "Not correct"
elif (equation[index] == equation[len(equation) - 1]):
return "Not correct"
elif (equation[index + 1].isdigit() and equation[index - 1].isdigit()):
return "Correct"
else:
return "Not correct"
You can use Python's ast module for parsing the expression:
import ast
import itertools as it
def check(expr):
allowed = (ast.Add, ast.Sub, ast.Mult, ast.Mod)
try:
for node in it.islice(ast.walk(ast.parse(expr)), 2, None):
if isinstance(node, (ast.BinOp, ast.Num)):
continue
if not isinstance(node, allowed):
return False
except SyntaxError:
return False
return True
print(check('6**8')) # False
print(check('-6+2')) # False
print(check('6*(2+3)')) # True
The first case 6**8 evaluates as False because it is represented by ast.Pow node and the second one because -6 corresponds to ast.UnaryOp.
I have a string in which every marked substring within < and >
has to be reversed (the brackets don't nest). For example,
"hello <wolfrevokcats>, how <t uoy era>oday?"
should become
"hello stackoverflow, how are you today?"
My current idea is to loop over the string and find pairs of indices
where < and > are. Then simply slice the string and put the slices
together again with everything that was in between the markers reversed.
Is this a correct approach? Is there an obvious/better solution?
It's pretty simple with regular expressions. re.sub takes a function as an argument to which the match object is passed.
>>> import re
>>> s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub('<(.*?)>', lambda m: m.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
Explanation of the regex:
<(.*?)> will match everything between < and > in matching group 1. To ensure that the regex engine will stop at the first > symbol occurrence, the lazy quantifier *? is used.
The function lambda m: m.group(1)[::-1] that is passed to re.sub takes the match object, extracts group 1, and reverses the string. Finally re.sub inserts this return value.
Or, use re.sub() and a replacing function:
>>> import re
s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub(r"<(.*?)>", lambda match: match.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
where .*? would match any characters any number of times in a non-greedy fashion. The parenthesis around it would help us to capture it in a group which we then refer to in the replacing function - match.group(1). [::-1] slice notation reverses a string.
I'm going to assume this is a coursework assignment and the use of regular expressions isn't allowed. So I'm going to offer a solution that doesn't use it.
content = "hello <wolfrevokcats>, how <t uoy era>oday?"
insert_pos = -1
result = []
placeholder_count = 0
for pos, ch in enumerate(content):
if ch == '<':
insert_pos = pos
elif ch == '>':
insert_pos = -1
placeholder_count += 1
elif insert_pos >= 0:
result.insert(insert_pos - (placeholder_count * 2), ch)
else:
result.append(ch)
print("".join(result))
The gist of the code is to have just a single pass at the string one character at a time. When outside the brackets, simply append the character at the end of the result string. When inside the brackets, insert the character at the position of the opening bracket (i.e. pre-pend the character).
I agree that regular expressions is the proper tool to solve this problem, and I like the gist of Dmitry B.'s answer. However, I used this question to practice about generators and functional programming, and I post my solution just for sharing it.
msg = "<,woN> hello <wolfrevokcats>, how <t uoy era>oday?"
def traverse(s, d=">"):
for c in s:
if c in "<>": d = c
else: yield c, d
def group(tt, dc=None):
for c, d in tt:
if d != dc:
if dc is not None:
yield dc, l
l = [c]
dc = d
else:
l.append(c)
else: yield dc, l
def direct(groups):
func = lambda d: list if d == ">" else reversed
fst = lambda t: t[0]
snd = lambda t: t[1]
for gr in groups:
yield func(fst(gr))(snd(gr))
def concat(groups):
return "".join("".join(gr) for gr in groups)
print(concat(direct(group(traverse(msg)))))
#Now, hello stackoverflow, how are you today?
Here's another one without using regular expressions:
def reverse_marked(str0):
separators = ['<', '>']
reverse = 0
str1 = ['', str0]
res = ''
while len(str1) == 2:
str1 = str1[1].split(separators[reverse], maxsplit=1)
res = ''.join((res, str1[0][::-1] if reverse else str1[0]))
reverse = 1 - reverse # toggle 0 - 1 - 0 ...
return res
print(reverse_marked('hello <wolfrevokcats>, how <t uoy era>oday?'))
Output:
hello stackoverflow, how are you today?
How can I find the position of a substring in a string without using str.find() in Python? How should I loop it?
def find substring(string,substring):
for i in xrange(len(string)):
if string[i]==substring[0]:
print i
else: print false
For example, when string = "ATACGTG" and substring = "ACGT", it should return 2. I want to understand how str.find() works
You can use Boyer-Moore or Knuth-Morris-Pratt. Both create tables to precalculate faster moves on each miss. The B-M page has a python implementation. And both pages refer to other string-searching algorithms.
I can't think of a way to do it without any built-in functions at all.
I can:
def find_substring(string, substring):
def starts_with(string, substring):
while True:
if substring == '':
return True
if string == '' or string[0] != substring[0]:
return False
string, substring = string[1:], substring[1:]
n = 0
while string != '' and substring != '':
if starts_with(string, substring):
return n
string = string[1:]
n += 1
return -1
print(find_substring('ATACGTG', 'ACGT'))
I.e. avoiding built-ins len(), range(), etc. By not using built-in len() we lose some efficiency in that we could have finished sooner. The OP specified iteration, which the above uses, but the recursive variant is a bit more compact:
def find_substring(string, substring, n=0):
def starts_with(string, substring):
if substring == '':
return True
if string == '' or string[0] != substring[0]:
return False
return starts_with(string[1:], substring[1:])
if string == '' or substring == '':
return -1
if starts_with(string, substring):
return n
return find_substring(string[1:], substring, n + 1)
print(find_substring('ATACGTG', 'ACGT'))
Under the constraint of not using find, you can use str.index instead, which returns a ValueError if the substring is not found:
def find_substring(a_string, substring):
try:
print(a_string.index(substring))
except ValueError:
print('Not Found')
and usage:
>>> find_substring('foo bar baz', 'bar')
4
>>> find_substring('foo bar baz', 'quux')
Not Found
If you must loop, you can do this, which slides along the string, and with a matching first character then checks to see if the rest of the string startswith the substring, which is a match:
def find_substring(a_string, substring):
for i, c in enumerate(a_string):
if c == substring[0] and a_string[i:].startswith(substring):
print(i)
return
else:
print(False)
To do it with no string methods:
def find_substring(a_string, substring):
for i in range(len(a_string)):
if a_string[i] == substring[0] and a_string[i:i+len(substring)] == substring:
print(i)
return
else:
print(False)
I can't think of a way to do it without any built-in functions at all.
I am trying to replace the Nth appearance of a needle in a haystack. I want to do this simply via re.sub(), but cannot seem to come up with an appropriate regex to solve this. I am trying to adapt: http://docstore.mik.ua/orelly/perl/cookbook/ch06_06.htm but am failing at spanning multilines, I suppose.
My current method is an iterative approach that finds the position of each occurrence from the beginning after each mutation. This is pretty inefficient and I would like to get some input. Thanks!
I think you mean re.sub. You could pass a function and keep track of how often it was called so far:
def replaceNthWith(n, replacement):
def replace(match, c=[0]):
c[0] += 1
return replacement if c[0] == n else match.group(0)
return replace
Usage:
re.sub(pattern, replaceNthWith(n, replacement), str)
But this approach feels a bit hacky, maybe there are more elegant ways.
DEMO
Something like this regex should help you. Though I'm not sure how efficient it is:
#N=3
re.sub(
r'^((?:.*?mytexttoreplace){2}.*?)mytexttoreplace',
'\1yourreplacementtext.',
'mystring',
flags=re.DOTALL
)
The DOTALL flag is important.
I've been struggling for a while with this, but I found a solution that I think is pretty pythonic:
>>> def nth_matcher(n, replacement):
... def alternate(n):
... i=0
... while True:
... i += 1
... yield i%n == 0
... gen = alternate(n)
... def match(m):
... replace = gen.next()
... if replace:
... return replacement
... else:
... return m.group(0)
... return match
...
...
>>> re.sub("([0-9])", nth_matcher(3, "X"), "1234567890")
'12X45X78X0'
EDIT: the matcher consists of two parts:
the alternate(n) function. This returns a generator that returns an infinite sequence True/False, where every nth value is True. Think of it like list(alternate(3)) == [False, False, True, False, False, True, False, ...].
The match(m) function. This is the function that gets passed to re.sub: it gets the next value in alternate(n) (gen.next()) and if it's True it replaces the matched value; otherwise, it keeps it unchanged (replaces it with itself).
I hope this is clear enough. If my explanation is hazy, please say so and I'll improve it.
Could you do it using re.findall with MatchObject.start() and MatchObject.end()?
find all occurences of pattern in string with .findall, get indices of Nth occurrence with .start/.end, make new string with replacement value using the indices?
If the pattern ("needle") or replacement is a complex regular expression, you can't assume anything. The function "nth_occurrence_sub" is what I came up with as a more general solution:
def nth_match_end(pattern, string, n, flags):
for i, match_object in enumerate(re.finditer(pattern, string, flags)):
if i + 1 == n:
return match_object.end()
def nth_occurrence_sub(pattern, repl, string, n=0, flags=0):
max_n = len(re.findall(pattern, string, flags))
if abs(n) > max_n or n == 0:
return string
if n < 0:
n = max_n + n + 1
sub_n_times = re.sub(pattern, repl, string, n, flags)
if n == 1:
return sub_n_times
nm1_end = nth_match_end(pattern, string, n - 1, flags)
sub_nm1_times = re.sub(pattern, repl, string, n - 1, flags)
sub_nm1_change = sub_nm1_times[:-1 * len(string[nm1_end:])]
components = [
string[:nm1_end],
sub_n_times[len(sub_nm1_change):]
]
return ''.join(components)
I have a similar function I wrote to do this. I was trying to replicate SQL REGEXP_REPLACE() functionality. I ended up with:
def sql_regexp_replace( txt, pattern, replacement='', position=1, occurrence=0, regexp_modifier='c'):
class ReplWrapper(object):
def __init__(self, replacement, occurrence):
self.count = 0
self.replacement = replacement
self.occurrence = occurrence
def repl(self, match):
self.count += 1
if self.occurrence == 0 or self.occurrence == self.count:
return match.expand(self.replacement)
else:
try:
return match.group(0)
except IndexError:
return match.group(0)
occurrence = 0 if occurrence < 0 else occurrence
flags = regexp_flags(regexp_modifier)
rx = re.compile(pattern, flags)
replw = ReplWrapper(replacement, occurrence)
return txt[0:position-1] + rx.sub(replw.repl, txt[position-1:])
One important note that I haven't seen mentioned is that you need to return match.expand() otherwise it won't expand the \1 templates properly and will treat them as literals.
If you want this to work you'll need to handle the flags differently (or take it from my github, it's simple to implement and you can dummy it for a test by setting it to 0 and ignoring my call to regexp_flags()).