How to balance rules and terminals in python lark parser?

How to balance rules and terminals in python lark parser? - python

I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "#" NAME arguments ?
directive_definition: description? "directive" "#" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:
union Foo #something(test: 42) = Bar | Baz # This works
union Foo #directive(test: 42) = Bar | Baz # This fails
Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.
How can I balance / adjust this so there is no ambiguity possible for the LALR(1) parser ?

Author of Lark here.
This misinterpretation happens because "directive" can be two different tokens: The "directive" string, or NAME. By default, Lark's LALR lexer always chooses the more specific one, namely the string.
So how can we let the lexer know that #directive is a name, and not just two constant strings?
Solution 1 - Use the Contextual Lexer
What would probably help in this situation (it's hard to be sure without the full grammar), is to use the contextual lexer, instead of the standard LALR(1) lexer.
The contextual lexer can communicate to some degree with the parser, to figure out which terminal makes more sense at each point. This is an algorithm that is unique to Lark, and you can use it like this:
parser = Lark(grammar, parser="lalr", lexer="contextual")
(This lexer can do anything the standard lexer can do and more, so in future versions it might become the default lexer.)
Solution 2 - Prefix the terminal
If the contextual lexer doesn't solve your collision, a more "classic" solution to this situation would be to define a directive token, something like:
DIRECTIVE: "#" NAME
Unlike your directive rule, this leaves no ambiguity to the lexer. There is a clear distinction between a directive, and the "directive" string (or NAME terminal).
And if all else fails, you can always use the Earley parser, which at the price of performance, will work with any grammar you give it, regardless of how many collisions there might be.
Hope this helps!
Edit: I'd just like to point out the the contextual lexer is the default for LALR now, so it's enough to call:
parser = Lark(grammar, parser="lalr")

Related

How to prevent lark from recognizing parts of an identifier as a keyword?

I've been experimenting with lark and I came across a little problem.
Suppose I have the following grammar.
parser = Lark('''
?start: value
| start "or" value -> or
?value: DIGIT -> digit
| ID -> id
DIGIT: /[1-9]\d*/
%import common.CNAME -> ID
%import common.WS
%ignore WS
''', parser='lalr')
Let's say I want to parse 1orfoo:
print(parser.parse("1orfoo").pretty())
I would expect lark to see it as the digit 1 followed by the identifier orfoo (thus throwing an error because the grammar does not accept this kind of expressions).
However, the parser runs without error and outputs this:
or
digit 1
id foo
As you can see, lark splits the identifier and sees the expression as an or statement.
Why is this happening? Am I missing something? How can I prevent this kind of behavior?
Thank you in advance.

Lark can use different lexers to structure the input text into tokens. The default is "auto", which chooses a lexer based on the parser. For LALR, the "contextual" lexer is chosen (reference). The contextual lexer uses the LALR look-ahead to discard token choices that do not fit with the grammar (reference):
The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.
In your code, since you use the lalr parser, the contextual lexer is used. The lexer first creates a DIGIT token for 1. Next, the lexer has to decide whether to create a token for the or literal or an ID token. Since the parsing state does not expect an ID token, the lexer eliminates the latter choice and tokenizes or.
To change this behavior, you can select the standard lexer instead:
parser = Lark('''...''', parser='lalr', lexer='standard')
On your example, it will generate:
lark.exceptions.UnexpectedToken: Unexpected token Token(ID, 'orfoo') at line 1, column 2.
Expected one of:
* OR
* $END

Is the Python's grammar LL(1)?

Possible duplicate for this question however for me it's not specific enough.
The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:
foo(a)
foo(a=a)
corresponds to the following grammar:
argument: ( test [comp_for] |
test '=' test |
'**' test |
'*' test )
test appears twice in the first position of the grammar. It means that by only looking at test Python cannot determine it's test [comp_for] or test '=' test.
More examples:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
Note 'is' and 'is' 'not'
subscript: test | [test] ':' [test] [sliceop]
test also appears twice.
Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.

The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality ([a]) and Kleene closure ((a b c)*). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.
In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)
In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
In the above production, the repetition of (';' small_stmt)* may be followed by a ';', which means that a simple while loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:
simple_stmt: small_stmt rest_A
rest_A : ';' rest_B
| NEWLINE
rest_B : small_stmt rest_A
| NEWLINE
Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.
This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)

You're correct that constructs like 'is' | 'is' 'not' aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to 'is' notOpt where notOpt: 'not' | ϵ or, if you allow EBNF syntax, just 'is' 'not'? (or 'is' ['not'] depending on the flavor of EBNF).
So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.

Railroad diagram for Python grammar

I am looking for a way to get better grasp on the Python grammar.
My experience is that a railroad diagram for the grammar may be helpful.
Python documentation contains the grammar in a text form:
https://docs.python.org/3/reference/grammar.html
But that is not very easy to digest for someone who is just starting with software engineering.
Anybody has a good beginners material?
There is a Railroad Diagram Generator that I might be able to use, but I was not able to find an EBNF syntax for the Python grammar, that would be accepted by that generator.
A link to such a grammar would be very helpful as well.

To convert the Python grammar found at, e.g., https://docs.python.org/3/reference/grammar.html, to EBNF, you basically need to do three things:
Replace all #... comments with /*...*/ (or just delete them)
Use ::= instead of : for defining production rules
Use (...)? to indicate optional elements instead of [...].
For example, instead of
# The function statement
funcdef: 'def' NAME parameters ['->' test] ':' suite
you would use
/* The function statement */
funcdef ::= 'def' NAME parameters ('->' test)? ':' suite

Python/SQLite3 escaping in WHERE-Clause

How should i do real escaping in Python for SQLite3?
If i google for it (or search stackoverflow) there are tons of questions for this and every time the response is something like:
dbcursor.execute("SELECT * FROM `foo` WHERE `bar` like ?", ["foobar"])
This helps against SQL-Injections, and is enough if i would do just comperations with "=" but it doesn't stripe Wildcards of course.
So if i do
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE `nickname` ilike ?", (cookies, name))
some user could supply "%" for a nickname and would replace all of the cookie-entries with one line.
I could filter it myself (ugh… i probably will forget one of those lesser-known wildcards anyway), i could use lowercase on nick and nickname and replace "ilike" with "=", but what i would really like to do would be something along the lines of:
foo = sqlescape(nick)+"%"
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE `nickname` ilike ?", (cookies, foo))

? parameters are intended to avoid formatting problems for SQL strings (and other problematic data types like floating-point numbers and blobs).
LIKE/GLOB wildcards work on a different level; they are always part of the string itself.
SQL allows to escape them, but there is no default escape character; you have to choose some with the ESCAPE clause:
escaped_foo = my_like_escape(foo, "\\")
c.execute("UPDATE cookies SET count = ? WHERE nickname LIKE ? ESCAPE '\',
(cookies, escaped_foo))
(And you have to write your own my_like_escape function for % and _ (LIKE) or * and ? (GLOB).)

You've avoided outright code injection by using parametrized queries. Now it seems you're trying to do a pattern match with user-supplied data, but you want the user-supplied portion of the data to be treated as literal data (hence no wildcards). You have several options:
Just filter the input. SQLite's LIKE only understands % and _ as wildcards, so it's pretty hard to get it wrong. Just make sure to always filter inputs. (My preferred method: Filter just before the query is constructed, not when user input is read).
In general, a "whitelist" approach is considered safer and easier than removing specific dangerous characters. That is, instead of deleting % and _ from your string (and any "lesser-known wildcards", as you say), scan your string and keep only the characters you want. E.g., if your "nicknames" can contain ASCII letters, digits, "-" and ".", it can be sanitized like this:
name = re.sub(r"[^A-Za-z\d.-]", "", name)
This solution is specific to the particula field you are matching against, and works well for key fields and other identifiers. I would definitely do it this way if I had to search with RLIKE, which accepts full regular expressions so there are a lot more characters to watch out for.
If you don't want the user to be able to supply a wildcard, why would you use LIKE in your query anyway? If the inputs to your queries come from many places in the code (or maybe you're even writing a library), you'll make your query safer if you can avoid LIKE altogether:
Here's case insensitive matching:
SELECT * FROM ... WHERE name = 'someone' COLLATE NOCASE
In your example you use prefix matching ("sqlescape(nick)+"%""). Here's how to do it with exact search:
size = len(nick)
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE substr(`nickname`, 1, ?) = ?",
(cookies, size, nick))

Ummm, normally you'd want just replace 'ilike' with normal '=' comparison that doesn't interpret '%' in any special way. Escaping (effectively blacklisting of bad patterns) is error prone, e.g. even if you manage to escape all known patterns in the version of sqlLite you use, any future upgrade can put you at risk, etc..
It's not clear to me why you'd want to mass-update cookies based on a fuzzy match on user name.
If you really want to do that, my preferred approach would be to SELECT the list first and decide what to UPDATE at the application level to maintain a maximum level of control.

There are several very fun ways to do this with string format-ing.
From Python's Documentation:
The built-in str and unicode classes provide the ability to do complex variable substitutions and value formatting via the str.format() method:
s = "string"
c = "Cool"
print "This is a {0}. {1}, huh?".format(s,c)
#=> This is a string. Cool, huh?
Other nifty tricks you can do with string formatting:
"First, thou shalt count to {0}".format(3) # References first positional argument
"Bring me a {}".format("shrubbery!") # Implicitly references the first positional argument
"From {} to {}".format('Africa','Mercia') # Same as "From {0} to {1}"
"My quest is {name}" # References keyword argument 'name'
"Weight in tons {0.weight}" # 'weight' attribute of first positional arg
"Units destroyed: {players[0]}" # First element of keyword argument 'players'.`

lexer error-handling PLY Python

The t_error() function is used to handle lexing errors that occur when illegal characters are detected. My question is: How can I use this function to get more specific information on errors? Like error type, in which rule or section the error appears, etc.

In general, there is only very limited information available to the t_error() function. As input, it receives a token object where the value has been set to the remaining input text. Analysis of that text is entirely up to you. You can use the t.lexer.skip(n) function to have the lexer skip ahead by a certain number of characters and that's about it.
There is no notion of an "error type" other than the fact that there is an input character that does not match the regular expression of any known token. Since the lexer is decoupled from the parser, there is no direct way to get any information about the state of the parsing engine or to find out what grammar rule is being parsed. Even if you could get the state (which would simply be the underlying state number of the LALR state machine), interpretation of it would likely be very difficult since the parser could be in the intermediate stages of matching dozens of possible grammar rules looking for reduce actions.
My advice is as follows: If you need additional information in the t_error() function, you should set up some kind of object that is shared between the lexer and parser components of your code. You should explicitly make different parts of your compiler update that object as needed (e.g., it could be updated in specific grammar rules).
Just as aside, there are usually very few courses of action for a bad token. Essentially, you're getting input text that doesn't any known part of the language alphabet (e.g., no known symbol). As such, there's not even any kind of token value you can give to the parser. Usually, the only course of action is to report the bad input, throw it out, and continue.
As a followup to Raymond's answer, I would also not advise modifying any attribute of the lexer object in t_error().

Ply includes an example ANSI-C style lexer in a file called cpp.py. It has an example of how to extract some information out of t_error():
def t_error(t):
t.type = t.value[0]
t.value = t.value[0]
t.lexer.skip(1)
return t
In that function, you can also access the lexer's public attributes:
lineno - Current line number
lexpos - Current position in the input string
There are also some other attributes that aren't listed as public but may provide some useful diagnostics:
lexstate - Current lexer state
lexstatestack - Stack of lexer states
lexstateinfo - State information
lexerrorf - Error rule (if any)

There is indeed a way of managing errors in PLY, take a look at this very interesting resentation:
http://www.slideshare.net/dabeaz/writing-parsers-and-compilers-with-ply
and at chapter 6.8.1 of
http://www.dabeaz.com/ply/ply.html#ply_nn3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.