Parsing keyword next to special character (pyparsing) - python

Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance

openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.

Related

How to setup a grammar that can handle ambiguity

I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $ can signify a string, so "$This is text" would be treated as a string input in the program and & can signify a function, so &foo() can be treated as a call to the internal function foo.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string, &foo(), &foo(#arg1), &foo($arg1,,#arg2) and &foo(!w1,w2,w3,,!w4,w5,w6) are all parsed as expected. But if I'd like to add more flexibility to my simple terminal, then I need to start fiddling around with the SINGLESTR token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func), then I cannot handle them in my current situation.
If I add the parentheses in SINGLESTR, then I get Expected STARTSYMBOL, because it's getting mixed up with the func definition and it thinks that a function argument should be passed, which makes sense.
If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in SINGLESTR, then I can parse a string with parentheses, but every function I'm trying to parse gives Expected LPAR.
My intent is that anything starting with a $ would be parsed as a SINGLESTR token and then I could parse things like &foo($first arg (has) parentheses,,$second arg).
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR produces the correct tree and when I process it, then this gets translated to This is a (test).
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo() or &foo(&bar()) parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.
import lark
grammar = r'''start: instruction
?instruction: simple
| func
MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")
Output:
Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])
I hope it's what you were looking for.
Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.
The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.
EDIT:
The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.
This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).
The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.
EDIT2:
Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.
_funcPrefix = '&'
_debug = False
class ParseException(Exception):
pass
def GetRecursive(c):
if isinstance(c,ParserBase):
return c.GetRecursive()
else:
return c
class ParserBase:
def __str__(self):
return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
def GetRecursive(self):
return (type(self).__name__,[GetRecursive(c) for c in self.contents])
class Simple(ParserBase):
def __init__(self,s):
self.contents = [s]
class MD(Simple):
pass
class DB(ParserBase):
def __init__(self,s):
self.contents = s.split(',')
class Func(ParserBase):
def __init__(self,s):
if s[-1] != ')':
raise ParseException("Can't find right parenthesis: '%s'" % s)
lparInd = s.find('(')
if lparInd < 0:
raise ParseException("Can't find left parenthesis: '%s'" % s)
self.contents = [s[:lparInd]]
argsStr = s[(lparInd+1):-1]
args = list(argsStr.split(',,'))
i = 0
while i<len(args):
a = args[i]
if a[0] != _funcPrefix:
self.contents.append(Parse(a))
i += 1
else:
j = i+1
while j<=len(args):
nestedFunc = ',,'.join(args[i:j])
if _debug:
print(nestedFunc)
try:
self.contents.append(Parse(nestedFunc))
break
except ParseException as PE:
if _debug:
print(PE)
j += 1
if j>len(args):
raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
i = j
def Parse(arg):
if arg[0] not in _starterSymbols:
raise ParseException("Bad prefix: " + arg[0])
return _starterSymbols[arg[0]](arg[1:])
_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}
P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)
import pprint
pprint.pprint(P.GetRecursive())
Problem is arguments of function are enclosed in parenthesis where one of the arguments may contain parenthesis.
One of the possible solution is use backspace \ before ( or ) when it is a part of String
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*
Similar solution used by C, to include double quotes(") as a part of string constant where string constant is enclosed in double quotes.
example_string1='&f(!g\()'
example_string2='&f(#g)'
print(parser.parse(example_string1).pretty())
print(parser.parse(example_string2).pretty())
Output is
start
func
f
simple !g\(
start
func
f
simple #g

How to parse ~{expr} inside string with lark ebnf

I am trying to write a lark grammar for a dsl, but having trouble with this string interpolation syntax:
" abc " <- normal string
" xyz~{expression}abc " <- string with interpolation
so a ~{ switches from string to expression, and a } terminates that expression. I think this is close:
string : "\"" (string_interp|not_string_interp)* "\""
string_interp: "~{" expression "}"
not_string_interp: /([^~][^{])+/
But the regex will only match even numbers of characters, and if the ~{ straddles an even boundary, it will be missed.
not_string_interp: /(.?|([^~][^{])+)/
This is about as far as I could get, but still seems wrong. Can I use lookaheads? I also want to keep %ignore WS on, as it keeps the noise down massively, so a solution accounting for that would be great!
Thanks
Test cases:
""
"a"
"~{1}"
" ~{1} "
"a bc~{1}c d"
"a b~{1}c d"
I think this does it. Sadly any ~ not followed by { will split the string up, but I can reconstruct them later. I am getting fooled by the equal precedence of rules, and the greediness of regexes.
/[^"~]+/ anything that is not ~ or " (regular string)
"~{" expression "}" the normal expression
/~(?!{)/ handle ~ without {. Use ?! because we must not consume next char (it could be " or another ~)
from lark import Lark
print (Lark(r"""
string: "\"" string_thing* "\""
string_thing: /[^"~]+/
| "~{" expression "}"
| /~(?!{)/
expression: /[^}]+/
""", start='string', ambiguity="explicit").parse(
# '"a"'
'"a~b{}c}d~{1}g"'
# '"~abc~"'
# '"~{1}~~{1}~~~{1}"'
).pretty())
Here is a solution to your problem using a positive lookbehind.
(?<=~{)[^}]+
It looks for the beginning of the expression ~{ and captures everything until the closing brace }

In PyParsing, how to stop a Regex from consuming the entire string

I'm trying to write a function parse such that, for example,
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
The string consists of a fixed prefix file://, followed by a file name (which can consist of one or more of any character), followed by a colon and a string representing access flags.
Here is one implementation using regular expressions:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[rwab+0-9]+)" # The letters r, w, a, b, a '+' symbol, or any digit
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # The full pattern including the end of line character
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
I would prefer to use PyParsing, however, because it typically gives more detailed error messages if the string doesn't match the expression (rather than re.match which simply returns None), and I would eventually like to make the flags optional.
Following Paul McGuire's answer in python regex in pyparsing, I made the following attempt:
from pyparsing import Word, alphas, nums, StringEnd, Regex, FollowedBy, Suppress, Literal
def parse(string):
scheme = Literal("file://")
path = Regex(".+")
flags = Word(alphas + nums + "+")
expression = Suppress(scheme) + (~(Suppress(":") + flags + StringEnd()) + path("path") + Suppress(":") + flags("flags") + StringEnd())
tokens = expression.parseString(string)
return tokens['path'], tokens['flags']
In the second part of the expression, I'm basically trying the negative lookahead (~suffix + path + suffix), where suffix is ":" + flags + StringEnd(). However, when trying to parse "file://foo:bar.txt:r+", I run into the following error:
pyparsing.ParseException: Expected ":" (at char 21), (line:1, col:22)
Since the string is 21 characters long, I interpret this as that the Regex has 'consumed' the entire string so that the suffix is no longer 'found'.
How can I fix the parse method using pyparsing?
Try this:
s="file://foo:bar.txt:r+"
path,flag=re.sub(r'.*\/\/(.*):(.*$)',r'\1,\2',s)

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

How can I strip first and last double quotes?

I want to strip double quotes from:
string = '"" " " ""\\1" " "" ""'
to obtain:
string = '" " " ""\\1" " "" "'
I tried to use rstrip, lstrip and strip('[^\"]|[\"$]') but it did not work.
How can I do this?
If the quotes you want to strip are always going to be "first and last" as you said, then you could simply use:
string = string[1:-1]
If you can't assume that all the strings you process have double quotes you can use something like this:
if string.startswith('"') and string.endswith('"'):
string = string[1:-1]
Edit:
I'm sure that you just used string as the variable name for exemplification here and in your real code it has a useful name, but I feel obliged to warn you that there is a module named string in the standard libraries. It's not loaded automatically, but if you ever use import string make sure your variable doesn't eclipse it.
IMPORTANT: I'm extending the question/answer to strip either single or double quotes. And I interpret the question to mean that BOTH quotes must be present, and matching, to perform the strip. Otherwise, the string is returned unchanged.
To "dequote" a string representation, that might have either single or double quotes around it (this is an extension of #tgray's answer):
def dequote(s):
"""
If a string has single or double quotes around it, remove them.
Make sure the pair of quotes match.
If a matching pair of quotes is not found,
or there are less than 2 characters, return the string unchanged.
"""
if (len(s) >= 2 and s[0] == s[-1]) and s.startswith(("'", '"')):
return s[1:-1]
return s
Explanation:
startswith can take a tuple, to match any of several alternatives. The reason for the DOUBLED parentheses (( and )) is so that we pass ONE parameter ("'", '"') to startswith(), to specify the permitted prefixes, rather than TWO parameters "'" and '"', which would be interpreted as a prefix and an (invalid) start position.
s[-1] is the last character in the string.
Testing:
print( dequote("\"he\"l'lo\"") )
print( dequote("'he\"l'lo'") )
print( dequote("he\"l'lo") )
print( dequote("'he\"l'lo\"") )
=>
he"l'lo
he"l'lo
he"l'lo
'he"l'lo"
(For me, regex expressions are non-obvious to read, so I didn't try to extend #Alex's answer.)
To remove the first and last characters, and in each case do the removal only if the character in question is a double quote:
import re
s = re.sub(r'^"|"$', '', s)
Note that the RE pattern is different than the one you had given, and the operation is sub ("substitute") with an empty replacement string (strip is a string method but does something pretty different from your requirements, as other answers have indicated).
If string is always as you show:
string[1:-1]
Almost done. Quoting from http://docs.python.org/library/stdtypes.html?highlight=strip#str.strip
The chars argument is a string
specifying the set of characters to be
removed.
[...]
The chars argument is not a prefix or
suffix; rather, all combinations of
its values are stripped:
So the argument is not a regexp.
>>> string = '"" " " ""\\1" " "" ""'
>>> string.strip('"')
' " " ""\\1" " "" '
>>>
Note, that this is not exactly what you requested, because it eats multiple quotes from both end of the string!
Remove a determinated string from start and end from a string.
s = '""Hello World""'
s.strip('""')
> 'Hello World'
Starting in Python 3.9, you can use removeprefix and removesuffix:
'"" " " ""\\1" " "" ""'.removeprefix('"').removesuffix('"')
# '" " " ""\\1" " "" "'
If you are sure there is a " at the beginning and at the end, which you want to remove, just do:
string = string[1:len(string)-1]
or
string = string[1:-1]
I have some code that needs to strip single or double quotes, and I can't simply ast.literal_eval it.
if len(arg) > 1 and arg[0] in ('"\'') and arg[-1] == arg[0]:
arg = arg[1:-1]
This is similar to ToolmakerSteve's answer, but it allows 0 length strings, and doesn't turn the single character " into an empty string.
in your example you could use strip but you have to provide the space
string = '"" " " ""\\1" " "" ""'
string.strip('" ') # output '\\1'
note the \' in the output is the standard python quotes for string output
the value of your variable is '\\1'
Below function will strip the empty spces and return the strings without quotes. If there are no quotes then it will return same string(stripped)
def removeQuote(str):
str = str.strip()
if re.search("^[\'\"].*[\'\"]$",str):
str = str[1:-1]
print("Removed Quotes",str)
else:
print("Same String",str)
return str
find the position of the first and the last " in your string
>>> s = '"" " " ""\\1" " "" ""'
>>> l = s.find('"')
>>> r = s.rfind('"')
>>> s[l+1:r]
'" " " ""\\1" " "" "'

Categories