Is the Python's grammar LL(1)? - python

Possible duplicate for this question however for me it's not specific enough.
The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:
foo(a)
foo(a=a)
corresponds to the following grammar:
argument: ( test [comp_for] |
test '=' test |
'**' test |
'*' test )
test appears twice in the first position of the grammar. It means that by only looking at test Python cannot determine it's test [comp_for] or test '=' test.
More examples:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
Note 'is' and 'is' 'not'
subscript: test | [test] ':' [test] [sliceop]
test also appears twice.
Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.

The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality ([a]) and Kleene closure ((a b c)*). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.
In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)
In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
In the above production, the repetition of (';' small_stmt)* may be followed by a ';', which means that a simple while loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:
simple_stmt: small_stmt rest_A
rest_A : ';' rest_B
| NEWLINE
rest_B : small_stmt rest_A
| NEWLINE
Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.
This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)

You're correct that constructs like 'is' | 'is' 'not' aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to 'is' notOpt where notOpt: 'not' | ϵ or, if you allow EBNF syntax, just 'is' 'not'? (or 'is' ['not'] depending on the flavor of EBNF).
So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.

Related

How to balance rules and terminals in python lark parser?

I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "#" NAME arguments ?
directive_definition: description? "directive" "#" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:
union Foo #something(test: 42) = Bar | Baz # This works
union Foo #directive(test: 42) = Bar | Baz # This fails
Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.
How can I balance / adjust this so there is no ambiguity possible for the LALR(1) parser ?
Author of Lark here.
This misinterpretation happens because "directive" can be two different tokens: The "directive" string, or NAME. By default, Lark's LALR lexer always chooses the more specific one, namely the string.
So how can we let the lexer know that #directive is a name, and not just two constant strings?
Solution 1 - Use the Contextual Lexer
What would probably help in this situation (it's hard to be sure without the full grammar), is to use the contextual lexer, instead of the standard LALR(1) lexer.
The contextual lexer can communicate to some degree with the parser, to figure out which terminal makes more sense at each point. This is an algorithm that is unique to Lark, and you can use it like this:
parser = Lark(grammar, parser="lalr", lexer="contextual")
(This lexer can do anything the standard lexer can do and more, so in future versions it might become the default lexer.)
Solution 2 - Prefix the terminal
If the contextual lexer doesn't solve your collision, a more "classic" solution to this situation would be to define a directive token, something like:
DIRECTIVE: "#" NAME
Unlike your directive rule, this leaves no ambiguity to the lexer. There is a clear distinction between a directive, and the "directive" string (or NAME terminal).
And if all else fails, you can always use the Earley parser, which at the price of performance, will work with any grammar you give it, regardless of how many collisions there might be.
Hope this helps!
Edit: I'd just like to point out the the contextual lexer is the default for LALR now, so it's enough to call:
parser = Lark(grammar, parser="lalr")

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?
Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.
Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Issue creating LL(1) grammar

I'm learning how parsers work by creating a simple recursive descent parser. However I'm having a problem defining my grammar to be LL(1). I want to be able to parse the following two statements:
a = 1
a + 1
To do this I've created the following grammar rules:
statement: assignent | expression
assignment: NAME EQUALS expression
expression: term [(PLUS|MINUS) term]
term: NAME | NUMBER
However, this leads to ambiguity when using a LL(1) parser as when a NAME token is encountered in the statement rule, it doesn't know whether it is an assignment or an expression without a look-ahead.
Python's grammar is LL(1) so I know this is possible to do but I can't figure out how to do it. I've looked at Python's grammar rules found here (https://docs.python.org/3/reference/grammar.html) but I'm still not sure how they implement this.
Any help would be greatly appreciated :)
Just treat = as an operator with very low precedence. However (unless you want a language like C where = really is an operator with very low precedence), you need to exclude it from internal (e.g. parenthetic) expressions.
If you had only multiplication and addition, you could use:
expression: factor ['+' factor]
factor: term ['*' term]
term: ID | NUMBER | '(' expression ')'
That is a guide for operator precedence: has higher precedence because the arguments to + can include s but not vice versa. So we could just add assignment:
statement: expression ['=' expression]
Unfortunately, that would allow, for example:
(a + 1) = b
which is undesirable. So it needs to be eliminated, but it is possible to eliminate it when the production is accepted (by a check of the form of the first expression), rather than in the grammar itself. As I understand it, that's what the Python parser does; see the long comment about test and keywords.
If you used an LR(1) parser instead, you wouldn't have this problem.

Railroad diagram for Python grammar

I am looking for a way to get better grasp on the Python grammar.
My experience is that a railroad diagram for the grammar may be helpful.
Python documentation contains the grammar in a text form:
https://docs.python.org/3/reference/grammar.html
But that is not very easy to digest for someone who is just starting with software engineering.
Anybody has a good beginners material?
There is a Railroad Diagram Generator that I might be able to use, but I was not able to find an EBNF syntax for the Python grammar, that would be accepted by that generator.
A link to such a grammar would be very helpful as well.
To convert the Python grammar found at, e.g., https://docs.python.org/3/reference/grammar.html, to EBNF, you basically need to do three things:
Replace all #... comments with /*...*/ (or just delete them)
Use ::= instead of : for defining production rules
Use (...)? to indicate optional elements instead of [...].
For example, instead of
# The function statement
funcdef: 'def' NAME parameters ['->' test] ':' suite
you would use
/* The function statement */
funcdef ::= 'def' NAME parameters ('->' test)? ':' suite

is there need for a more declarative way of expressing regular expressions ? :)

I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.
Currently I am thinking of the description in YAML format.
So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:
# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)
etc.
Does anyone think something of this sort would be of wider use?
Do you know already existing packages that can do it?
What are the limitations that you see to this approach?
Does anyone think, having the declarative string in code, would make it more maintainable?
This is actually pretty similar (identical?) to how a lexer/parser works. If you had a defined grammar then you could probably write a parser with not too much trouble. For instance, you could write something like this:
<expression> :: == <rule> | <rule> <expression> | <rule> " followed by " <expression>
<rule> :: == <val> | <qty> <val>
<qty> :: == "literal" | "one" | "one of" | "one or more of" | "zero or more of"
<val> :: == "a" | "b" | "c" | "d" | ... | "Z" |
That's nowhere near a perfect description. For more info, take a look at this BNF of the regex language. You could then look at lexing and parsing the expression.
If you did it this way you could probably get a little closer to Natural Language/English versions of regexes.
I can see a tool like this being useful, but as was previously said, mainly for beginners.
The main limitation to this approach would be in the amount of code you have to write to translate the language into regex (and/or vice versa). On the other hand, I think a two-way translation tool would actually be more ideal and see more use. Being able to take a regex and turn it into English might be a lot more helpful to spot errors.
Of course it doesn't take too long to pickup regex as the syntax is usually terse and most of the meanings are pretty self explanatory, at least if you use | or || as OR in your language, and you think of * as multiplying by 0-N, + as adding 0-N.
Though sometimes I wouldn't mind typing "find one or more 'a' followed by three digits or 'b' then 'c'"
Please take a look at pyparsing. Many of the issues that you describe with RE's are the same ones that inspired me to write that package.
Here are some specific features of pyparsing from the O'Reilly e-book chapter "What's so special about pyparsing?".
For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.
For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.
You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:
from refunc import *
pattern = Compile(
'a',
Capture(
Choices('b', 'c'),
N_of( 'd', 1, Infin() ),
N_of( 'e', 0, Infin() ),
),
Look_ahead('foo'),
)
But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.
pattern = Compile(
'a',
Capture(
'[bc]',
'd+',
'e*',
),
Look_ahead('foo'),
)
I would add that in my experience, regular expressions are about learning a thought process. Getting comfortable with the syntax is the easy part.
maybe not exactly what you are asking for, but there is a way how to write regexes more readable way (VERBOSE, shortly X flag):
rex_name = re.compile("""
[A-Za-z] # first letter
[a-z]+ # the rest
""", re.X)
rex_name.match('Joe')

Categories