Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.
I try to replace commas with semicolons enclosed in curly braces.
Sample string:
text = "a,b,{'c','d','e','f'},g,h"
I am aware that it comes down to lookbehinds and lookaheads, but somehow it won't work like I want it to:
substr = re.sub(r"(?<=\{)(.+?)(,)(?=.+\})",r"\1;", text)
It returns:
a,b,{'c';'d','e','f'},g,h
However, I am aiming for the following:
a,b,{'c';'d';'e';'f'},g,h
Any idea how I can achieve this?
Any help much appreciated :)
You can match the whole block {...} (with {[^{}]+}) and replace commas inside it only with a lambda:
import re
text = "a,b,{'c','d','e','f'},g,h"
print(re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), text))
See IDEONE demo
Output: a,b,{'c';'d';'e';'f'},g,h
By declaring lambda x we can get access to each match object, and get the whole match value using x.group(0). Then, all we need is replace a comma with a semi-colon.
This regex does not support recursive patterns. To use a recursive pattern, you need PyPi regex module. Something like m = regex.sub(r"\{(?:[^{}]|(?R))*}", lambda x: x.group(0).replace(",", ";"), text) should work.
Below I have posted a solution that does not rely on an regular expression. It uses a stack (list) to determine if a character is inside a curly bracket {. Regular expression are more elegant, however, they can be harder to modify when requirements change. Please note that the example below also works for nested brackets.
text = "a,b,{'c','d','e','f'},g,h"
output=''
stack = []
for char in text:
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
output += ';'
else:
output += char
print output
This gives:
'a,b,{'c';'d';'e';'f'},g,h
You can also rewrite this as a map function if you use a the global variable for stack:
stack = []
def replace_comma_in_curly_brackets(char):
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
return ';'
return char
text = "a,b,{'c','d','e','f'},g,h"
print ''.join(map(str, map(replace_comma_in_curly_brackets,text)))
Regarding performance, when running the above two methods and the regular expression solution proposed by #stribizhev on the test string at the end of this post, I get the following timings:
Regular expression (#stribizshev): 0.38 seconds
Map function: 26.3 seconds
For loop: 251 seconds
This is the test string that is 55,300,00 characters long:
text = "a,able,about,across,after,all,almost,{also,am,among,an,and,any,are,as,at,be,because},been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your" * 100000
If you don't have nested braces, it might be enough to just look ahead at each , if there is a closing } ahead without any opening { in between. Search for
,(?=[^{]*})
and replace with ;
, match a comma literally
(?=...) the lookahead to check
if there's ahead [^{]* any amount of characters, that are not {
followed by a closing curly brace }
See demo at regex101
I'm newbie for Python and wondering what is best way to write a code below in perl into python:
if ($line =~ /(\d)/) {
$a = $1
}
elsif ($line =~ /(\d\d)/) {
$b = $1
}
elsif ($line =~ /(\d\d\d)/) {
$c = $1
}
What I want to do is to retrieve a specific part of each line within a large set of lines. In python all what I can do is as below and is very ugly.
res = re.search(r'(\d)', line)
if res:
a = res.group(1)
else:
res = re.search(r'(\d\d)', line)
if res:
b = res.group(1)
else:
res = re.search(r'(\d\d\d)', line)
if res:
c = res.group(1)
Does any one know better way to write same thing without non-built-in module?
EDIT:
How do you write if you need parse line using very different re?
My point here is it should be simple so that any one can understand what the code is doing there.
In perl, we can write:
if ($line =~ /^this is a sample line (.+) and contain single value$/) {
$name = $1
}
elsif ($line =~ /^this is another sample: (.+):(.+) two values here$/) {
($address, $call) = ($1, $2)
}
elsif ($line =~ /^ahhhh thiiiss isiss (\d+) last sample line$/) {
$description = $1
}
From my view, this kind perl code is very simple and easy to understand.
EDIT2:
I found same discussion here:
http://bytes.com/topic/python/answers/750203-checking-string-against-multiple-patterns
So there's no way to write in python simple enough like perl..
You could write yourself a helper function to store the result of the match at an outer scope so that you don't need to rematch the regex in the if statement
def search(patt, str):
search.result = re.search(patt, str)
return search.result
if search(r'(\d)', line):
a = search.result.group(1)
elif search(r'(\d\d)', line):
b = search.result.group(1)
elif search(r'(\d\d\d)', line):
c = search.result.group(1)
In python 3.8, you'll be able to use:
if res := re.search(r'(\d)', line):
a = res.group(1)
elif res := re.search(r'(\d\d)', line):
b = res.group(1)
elif res := re.search(r'(\d\d\d)', line):
c = res.group(1)
Order of the pattern is very important. Because if you use this (\d)|(\d\d)|(\d\d\d) pattern, the first group alone will match all the digit characters. So, it won't try to check the next two patterns, since the first pattern alone will find all the matches.
res = re.search(r'(\d\d\d)|(\d\d)|(\d)', line)
if res:
a, b, c = res.group(3), res.group(2), res.group(1)
DEMO
Similar to perl except 'elif' instead of 'elsif' and ':' after the test and no curly braces (replaced by indentation) and optional parenthesis. There are many resources on the web which describe Python statements and more which can be easily found with a google search.
if re.search(r'(\d)', line):
a = re.search(r'(\d)', line).group(1)
elif re.search(r'(\d\d)', line):
b = re.search(r'(\d\d)', line).group(1)
elif re.search(r'(\d\d\d)', line):
c = re.search(r'(\d\d\d)', line).group(1)
Of course the logic of the code is flawed since 'b' and 'c' never get set but I think this is the syntax you were looking for.
I wrote a script to catch and correct commands before they are read by a parser. The parser requires equal, not equal, greater, etc, entries to be separated by commas, such as:
'test(a>=b)' is wrong
'test(a,>=,b)' is correct
The script i wrote works fine, but i would love to know if there's a more efficient way to do this.
Here's my script:
# Correction routine
def corrector(exp):
def rep(exp,a,b):
foo = ''
while(True):
foo = exp.replace(a,b)
if foo == exp:
return exp
exp = foo
# Replace all instances with a unique identifier. Do it in a specific order
# so for example we catch an instance of '>=' before we get to '='
items = ['>=','<=','!=','==','>','<','=']
for i in range(len(items)):
exp = rep(exp,items[i],'###%s###'%i)
# Re-add items with commas
for i in range(len(items)):
exp = exp.replace('###%s###'%i,',%s,'%items[i])
# Remove accidental double commas we may have added
return exp.replace(',,',',')
print corrector('wrong_syntax(b>=c) correct_syntax(b,>=,c)')
// RESULT: wrong_syntax(b,>=,c) correct_syntax(b,>=,c)
thanks!
As mentioned in the comments, one approach would be to use a regular expression. The following regex matches any of your operators when they are not surrounded by commas, and replaces them with the same string with the commas inserted:
inputstring = 'wrong_syntax(b>=c) correct_syntax(b,>=,c)'
regex = r"([^,])(>=|<=|!=|==|>|<|=)([^,])"
replace = r"\1,\2,\3"
result = re.sub(regex, replace, inputstring)
print(result)
Simple regexes are relatively easy, but they can get complicated quickly. Check out the docs for more info:
http://docs.python.org/2/library/re.html
Here is a regex that will do what you asked:
import re
regex = re.compile(r'''
(?<!,) # Negative lookbehind
(!=|[><=]=?)
(?!,) # Negative lookahead
''', re.VERBOSE)
print regex.sub(r',\1,', 'wrong_expression(b>=c) or right_expression(b,>=,c)')
outputs
wrong_expression(b,>=,c) or right_expression(b,>=,c)
I have a string like this:
a = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
Currently I am using this re statement to get desired result:
filter(None, re.split("\\{(.*?)\\}", a))
But this gives me:
['CGPoint={CGPoint=d{CGPoint=dd', '}}', 'CGSize=dd', 'dd', 'CSize=aa']
which is incorrect for my current situation, I need a list like this:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
As #m.buettner points out in the comments, Python's implementation of regular expressions can't match pairs of symbols nested to an arbitrary degree. (Other languages can, notably current versions of Perl.) The Pythonic thing to do when you have text that regexs can't parse is to use a recursive-descent parser.
There's no need to reinvent the wheel by writing your own, however; there are a number of easy-to-use parsing libraries out there. I recommend pyparsing which lets you define a grammar directly in your code and easily attach actions to matched tokens. Your code would look something like this:
import pyparsing
lbrace = Literal('{')
rbrace = Literal('}')
contents = Word(printables)
expr = Forward()
expr << Combine(Suppress(lbrace) + contents + Suppress(rbrace) + expr)
for line in lines:
results = expr.parseString(line)
There's an alternative regex module for Python I really like that supports recursive patterns:
https://pypi.python.org/pypi/regex
pip install regex
Then you can use a recursive pattern in your regex as demonstrated in this script:
import regex
from pprint import pprint
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
theregex = r'''
(
{
(?<match>
[^{}]*
(?:
(?1)
[^{}]*
)+
|
[^{}]+
)
}
|
(?<match>
[^{}]+
)
)
'''
matches = regex.findall(theregex, thestr, regex.X)
print 'all matches:\n'
pprint(matches)
print '\ndesired matches:\n'
print [match[1] for match in matches]
This outputs:
all matches:
[('{CGPoint={CGPoint=d{CGPoint=dd}}}', 'CGPoint={CGPoint=d{CGPoint=dd}}'),
('{CGSize=dd}', 'CGSize=dd'),
('dd', 'dd'),
('{CSize=aa}', 'CSize=aa')]
desired matches:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
pyparsing has a nestedExpr function for matching nested expressions:
import pyparsing as pp
ident = pp.Word(pp.alphanums)
expr = pp.nestedExpr("{", "}") | ident
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
for result in expr.searchString(thestr):
print(result)
yields
[['CGPoint=', ['CGPoint=d', ['CGPoint=dd']]]]
[['CGSize=dd']]
['dd']
[['CSize=aa']]
Here is some pseudo code. It creates a stack of strings and pops them when a close brace is encountered. Some extra logic to handle the fact that the first braces encountered are not included in the array.
String source = "{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}";
Array results;
Stack stack;
foreach (match in source.match("[{}]|[^{}]+")) {
switch (match) {
case '{':
if (stack.size == 0) stack.push(new String()); // add new empty string
else stack.push('{'); // child, so include matched brace.
case '}':
if (stack.size == 1) results.add(stack.pop()) // clear stack add to array
else stack.last += stack.pop() + '}"; // pop from stack and concatenate to previous
default:
if (stack.size == 0) results.add(match); // loose text, add to results
else stack.last += match; // append to latest member.
}
}