RegEx removing pair parantheses only - python

How to replace only pair parentheses by nothing in this expression? I tried many ways, but then I decided to post my question here...
Code:
expression = ')([()])('
pattern = r'[(\(.*\)]'
nothing = ''
print(re.sub(pattern, nothing, expression)) # Expected to be ')[]('
Another expressions to validating:
// True
<s>HTML is a programming language</s>
(1+2) * (3+4) / 4!
{1, 2, 3, ..., 10}
([{<>}])
// False
<}>
)[}({>]<
<[{}]><
As you guess, I want to solve a classic problem in new way... Not only parentheses, another punctuation marks such as brackets, angle brackets, and braces should be removed. (Use re.sub(r'[^\(\)\[\]\{\}\<\>]', '', expr) to clean them)
I want to drop them in one step, but all answers are accepted...

Based on How to remove all text between the outer parentheses in a string?:
import re
def rem_parens(text):
n = 1 # run at least once
while n:
text, n = re.subn(r'\(([^()]*)\)', r'\1', text)
return text
print(rem_parens(")([()])("))
Results: )[](
See Python proof
How to extend to accept more bracket types
Add alternatives to the expression and backreferences to the replace:
re.subn(r'\(([^()]*)\)|\[([^][]*)]|<([^<>]*)>|\{([^{}]*)}', r'\1\2\3\4', text)

Related

Python regex match anything enclosed in either quotations brackets braces or parenthesis

UPDATE
This is still not entirely the solution so far. It is only for preceding repeated closing characters (e.g )), ]], }}). I'm still looking for a way to capture enclosed contents and will update this.
Code:
>>> import re
>>> re.search(r'(\(.+?[?<!)]\))', '((x(y)z))', re.DOTALL).groups()
('((x(y)z))',)
Details:
r'(\(.+?[?<!)]\))'
() - Capturing group special characters.
\( and \) - The open and closing characters (e.g ', ", (), {}, [])
.+? - Match any character content (use with re.DOTALL flag)
[?<!)] - The negative lookbehind for character ) (replace this with the matching closing character). This will basically find any ) character where \) character does not precede (more info here).
I was trying to parse something like a variable assignment statement for this lexer thing I'm working with, just trying to get the basic logic behind interpreters/compilers.
Here's the basic assignment statements and literals I'm dealing with:
az = none
az_ = true
az09 = false
az09_ = +0.9
az_09 = 'az09_'
_az09 = "az09_"
_az = [
"az",
0.9
]
_09 = {
0: az
1: 0.9
}
_ = (
true
)
Somehow, I managed to parse those simple assignments like none, true, false, and numeric literals. Here's where I'm currently stuck at:
import sys
import re
# validate command-line arguments
if (len(sys.argv) != 2): raise ValueError('usage: parse <script>')
# parse the variable name and its value
def handle_assignment(index, source):
# TODO: handle quotations, brackets, braces, and parenthesis values
variable = re.search(r'[\S\D]([\w]+)\s+?=\s+?(none|true|false|[-+]?\d+\.?\d+|[\'\"].*[\'\"])', source[index:])
if variable is not None:
print('{}={}'.format(variable.group(1), variable.group(2)))
index += source[index:].index(variable.group(2))
return index
# parse through the source element by element
with open(sys.argv[1]) as file:
source = file.read()
index = 0
while index < len(source):
# checks if the line matches a variable assignment statement
if re.match(r'[\S\D][\w]+\s+?=', source[index:]):
index = handle_assignment(index, source)
index += 1
I was looking for a way to capture those values with enclosed quotations, brackets, braces, and parenthesis.
Probably, will update this post if I found an answer.
Use a regexp with multiple alternatives for each matching pair.
re.match(r'\'.*?\'|".*?"|\(.*?\)|\[.*?\]|\{.*?\}', s)
Note, however, that if there are nested brackets, this will match the first ending bracket, e.g. if the input is
(words (and some more words))
the result will be
(words (and some more words)
Regular expressions are not appropriate for matching nested structures, you should use a more powerful parsing technique.
Solution for #Barmar's recursive characters using the regex third-party module:
pip install regex
python3
>>> import regex
>>> recurParentheses = regex.compile(r'[(](?:[^()]|(?R))*[)]')
>>> recurParentheses.findall('(z(x(y)z)x) ((x)(y)(z))')
['(z(x(y)z)x)', '((x)(y)(z))']
>>> recurCurlyBraces = regex.compile(r'[{](?:[^{}]|(?R))*[}]')
>>> recurCurlyBraces.findall('{z{x{y}z}x} {{x}{y}{z}}')
['{z{x{y}z}x}', '{{x}{y}{z}}']
>>> recurSquareBrackets = regex.compile(r'[[](?:[^][]|(?R))*[]]')
>>> recurSquareBrackets.findall('[z[x[y]z]x] [[x][y][z]]')
['[z[x[y]z]x]', '[[x][y][z]]']
For string literal recursion, I suggest take a look at this.

Replace commas enclosed in curly braces

I try to replace commas with semicolons enclosed in curly braces.
Sample string:
text = "a,b,{'c','d','e','f'},g,h"
I am aware that it comes down to lookbehinds and lookaheads, but somehow it won't work like I want it to:
substr = re.sub(r"(?<=\{)(.+?)(,)(?=.+\})",r"\1;", text)
It returns:
a,b,{'c';'d','e','f'},g,h
However, I am aiming for the following:
a,b,{'c';'d';'e';'f'},g,h
Any idea how I can achieve this?
Any help much appreciated :)
You can match the whole block {...} (with {[^{}]+}) and replace commas inside it only with a lambda:
import re
text = "a,b,{'c','d','e','f'},g,h"
print(re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), text))
See IDEONE demo
Output: a,b,{'c';'d';'e';'f'},g,h
By declaring lambda x we can get access to each match object, and get the whole match value using x.group(0). Then, all we need is replace a comma with a semi-colon.
This regex does not support recursive patterns. To use a recursive pattern, you need PyPi regex module. Something like m = regex.sub(r"\{(?:[^{}]|(?R))*}", lambda x: x.group(0).replace(",", ";"), text) should work.
Below I have posted a solution that does not rely on an regular expression. It uses a stack (list) to determine if a character is inside a curly bracket {. Regular expression are more elegant, however, they can be harder to modify when requirements change. Please note that the example below also works for nested brackets.
text = "a,b,{'c','d','e','f'},g,h"
output=''
stack = []
for char in text:
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
output += ';'
else:
output += char
print output
This gives:
'a,b,{'c';'d';'e';'f'},g,h
You can also rewrite this as a map function if you use a the global variable for stack:
stack = []
def replace_comma_in_curly_brackets(char):
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
return ';'
return char
text = "a,b,{'c','d','e','f'},g,h"
print ''.join(map(str, map(replace_comma_in_curly_brackets,text)))
Regarding performance, when running the above two methods and the regular expression solution proposed by #stribizhev on the test string at the end of this post, I get the following timings:
Regular expression (#stribizshev): 0.38 seconds
Map function: 26.3 seconds
For loop: 251 seconds
This is the test string that is 55,300,00 characters long:
text = "a,able,about,across,after,all,almost,{also,am,among,an,and,any,are,as,at,be,because},been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your" * 100000
If you don't have nested braces, it might be enough to just look ahead at each , if there is a closing } ahead without any opening { in between. Search for
,(?=[^{]*})
and replace with ;
, match a comma literally
(?=...) the lookahead to check
if there's ahead [^{]* any amount of characters, that are not {
followed by a closing curly brace }
See demo at regex101

How to use regular expression to detect parenthesis at the end of a string?

I am using if/elif statements in Python to match some strings, but I need help in matching one particular type of string. I want all strings that have parenthesis '()' in the end to match the same if condition. For example, string = "Tennis (5.5)" or string = "Football (6.3)".
def method(string):
if (string has parenthesis in the end):
Can I use some regular expression for this ? I am not sure how to go about it.
I think you mean this,
if re.search(r'(?m)\([^()]*\)$', line):
$ asserts that we are at the end of a line.
In the case you'd prefer regex, this is probably the simplest solution. It asserts, if there is a closing parenthesis at the end of line, irrespectively of trailing blanks:
"\)$"
For example:
test1 = "Tennis (5.5) "
test2 = "Football (6.3)"
res1 = bool(re.search(r"\)$", test1.strip()))
res2 = bool(re.search(r"\)$", test2.strip()))
print(res1, res2, sep='\n')
>>> True
>>> True

pythonic string syntax corrector

I wrote a script to catch and correct commands before they are read by a parser. The parser requires equal, not equal, greater, etc, entries to be separated by commas, such as:
'test(a>=b)' is wrong
'test(a,>=,b)' is correct
The script i wrote works fine, but i would love to know if there's a more efficient way to do this.
Here's my script:
# Correction routine
def corrector(exp):
def rep(exp,a,b):
foo = ''
while(True):
foo = exp.replace(a,b)
if foo == exp:
return exp
exp = foo
# Replace all instances with a unique identifier. Do it in a specific order
# so for example we catch an instance of '>=' before we get to '='
items = ['>=','<=','!=','==','>','<','=']
for i in range(len(items)):
exp = rep(exp,items[i],'###%s###'%i)
# Re-add items with commas
for i in range(len(items)):
exp = exp.replace('###%s###'%i,',%s,'%items[i])
# Remove accidental double commas we may have added
return exp.replace(',,',',')
print corrector('wrong_syntax(b>=c) correct_syntax(b,>=,c)')
// RESULT: wrong_syntax(b,>=,c) correct_syntax(b,>=,c)
thanks!
As mentioned in the comments, one approach would be to use a regular expression. The following regex matches any of your operators when they are not surrounded by commas, and replaces them with the same string with the commas inserted:
inputstring = 'wrong_syntax(b>=c) correct_syntax(b,>=,c)'
regex = r"([^,])(>=|<=|!=|==|>|<|=)([^,])"
replace = r"\1,\2,\3"
result = re.sub(regex, replace, inputstring)
print(result)
Simple regexes are relatively easy, but they can get complicated quickly. Check out the docs for more info:
http://docs.python.org/2/library/re.html
Here is a regex that will do what you asked:
import re
regex = re.compile(r'''
(?<!,) # Negative lookbehind
(!=|[><=]=?)
(?!,) # Negative lookahead
''', re.VERBOSE)
print regex.sub(r',\1,', 'wrong_expression(b>=c) or right_expression(b,>=,c)')
outputs
wrong_expression(b,>=,c) or right_expression(b,>=,c)

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Categories