Issue creating LL(1) grammar - python

I'm learning how parsers work by creating a simple recursive descent parser. However I'm having a problem defining my grammar to be LL(1). I want to be able to parse the following two statements:
a = 1
a + 1
To do this I've created the following grammar rules:
statement: assignent | expression
assignment: NAME EQUALS expression
expression: term [(PLUS|MINUS) term]
term: NAME | NUMBER
However, this leads to ambiguity when using a LL(1) parser as when a NAME token is encountered in the statement rule, it doesn't know whether it is an assignment or an expression without a look-ahead.
Python's grammar is LL(1) so I know this is possible to do but I can't figure out how to do it. I've looked at Python's grammar rules found here (https://docs.python.org/3/reference/grammar.html) but I'm still not sure how they implement this.
Any help would be greatly appreciated :)

Just treat = as an operator with very low precedence. However (unless you want a language like C where = really is an operator with very low precedence), you need to exclude it from internal (e.g. parenthetic) expressions.
If you had only multiplication and addition, you could use:
expression: factor ['+' factor]
factor: term ['*' term]
term: ID | NUMBER | '(' expression ')'
That is a guide for operator precedence: has higher precedence because the arguments to + can include s but not vice versa. So we could just add assignment:
statement: expression ['=' expression]
Unfortunately, that would allow, for example:
(a + 1) = b
which is undesirable. So it needs to be eliminated, but it is possible to eliminate it when the production is accepted (by a check of the form of the first expression), rather than in the grammar itself. As I understand it, that's what the Python parser does; see the long comment about test and keywords.
If you used an LR(1) parser instead, you wouldn't have this problem.

Related

Is the Python's grammar LL(1)?

Possible duplicate for this question however for me it's not specific enough.
The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:
foo(a)
foo(a=a)
corresponds to the following grammar:
argument: ( test [comp_for] |
test '=' test |
'**' test |
'*' test )
test appears twice in the first position of the grammar. It means that by only looking at test Python cannot determine it's test [comp_for] or test '=' test.
More examples:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
Note 'is' and 'is' 'not'
subscript: test | [test] ':' [test] [sliceop]
test also appears twice.
Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.
The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality ([a]) and Kleene closure ((a b c)*). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.
In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)
In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
In the above production, the repetition of (';' small_stmt)* may be followed by a ';', which means that a simple while loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:
simple_stmt: small_stmt rest_A
rest_A : ';' rest_B
| NEWLINE
rest_B : small_stmt rest_A
| NEWLINE
Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.
This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)
You're correct that constructs like 'is' | 'is' 'not' aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to 'is' notOpt where notOpt: 'not' | ϵ or, if you allow EBNF syntax, just 'is' 'not'? (or 'is' ['not'] depending on the flavor of EBNF).
So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.

When/How to solve this PLY Shift-Reduce conflict in parser generator for COOL language?

I am making a compiler for the Cool: the Classroom Object-Oriented Language. I followed the manual Cool-manual.pdf to create a grammar and using PLY 3.10 in python 3.5.2 I managed to create a lexer and a parser, but somehow it found a Shift-Reduce conflict on this production:
expr : LET ID COLON TYPE assign_optional let_assignments IN expr
expr is a Non-terminal symbol that produces several things, among them is the "LET...." above but, as this LET production ends with "...IN expr" that means that the LALR(1) parser from yacc do not know if it should SHIFT after that last expr with what comes after or REDUCE that expr.
For instance:
That last expr could be something like:
a + 2
and as COOL allows I could also do this:
expr : expr + expr #that means expr produces expr PLUS expr
and therefore, after that last expr in the LET I could have a + (or anything allowed). For instance, in this line
LET a:Int <- 3 IN a + 2 + c
could mean to add a + 2 + c as the expression inside of the LET or to add c to the result of the LET expression.
My main issue is:
Should I redo the grammar to avoid this conflict? If i should, how would I do it? cause I think I am not doing anything wrong following the COOL specifications.
Else, Should I handle this on the Semantic Analysis Phase? Any thoughts on how to do it?
Thanks in advance.
I hope I explained myself. If not, please let me know and I will try to explain better.
I don't know why let is not in the list of expression precedences in section 11 of the manual, but section 7 says
The <expr> of a let extends as far (encompasses as many tokens) as the grammar allows.
which is tantamount to saying that it has the lowest possible precedence (the same as assignment).
I presume you are using precedence declarations for your expr rules, so just add let to the list.

Railroad diagram for Python grammar

I am looking for a way to get better grasp on the Python grammar.
My experience is that a railroad diagram for the grammar may be helpful.
Python documentation contains the grammar in a text form:
https://docs.python.org/3/reference/grammar.html
But that is not very easy to digest for someone who is just starting with software engineering.
Anybody has a good beginners material?
There is a Railroad Diagram Generator that I might be able to use, but I was not able to find an EBNF syntax for the Python grammar, that would be accepted by that generator.
A link to such a grammar would be very helpful as well.
To convert the Python grammar found at, e.g., https://docs.python.org/3/reference/grammar.html, to EBNF, you basically need to do three things:
Replace all #... comments with /*...*/ (or just delete them)
Use ::= instead of : for defining production rules
Use (...)? to indicate optional elements instead of [...].
For example, instead of
# The function statement
funcdef: 'def' NAME parameters ['->' test] ':' suite
you would use
/* The function statement */
funcdef ::= 'def' NAME parameters ('->' test)? ':' suite

Python - calculator program and strings

I am new to Python and am trying to write a calculator program. I have been trying to do the following but with no success, so please point me in the right direction:
I would like to input an equation as a user, for example:
f(t) = 2x^5 + 8
the program should recognize the different parts of a string and in this case make a variable f(t) and assign 2x^5 + 8 to it.
Though, if I input an equation followed by an equals sign, for example
2x^5 + 8 =
the program will instead just output the answer.
I am not asking how to code for the math-logic of solving the equation, just how to get the program to recognize the different parts of a string and make decisions accordingly.
I am sorry I don't have any code to show as an attempt as I'm not sure how to go about this and am looking for a bit of help to get started.
Thank you.
For a little bit of context: The problem you're describing is more generally known as parsing, and it can get rather complicated, depending on the grammar. The grammar is the description of the language; the language, in your case, is the set of all valid formulas for your calculator.
The first recommended step, even before you start coding, is to formalize your grammar. This is mainly for your own benefit, as it will make the programming easier. A well established way to do this is to describe the grammar using EBNF, and there exist tools like PLY for Python that you can use to generate parsers for such languages.
Let's try a simplified version of your calculator grammar:
digit := "0" | "1" # our numbers are in binary
number := digit | number digit # these numbers are all nonnegative
variable := "x" | "y" # we recognize two variable names
operator := "+" | "-" # we could have more operators
expression := number | variable | "(" expression operator expression ")"
definition := variable "=" expression
evaluation := expression "="
Note that there are multiple problems with this grammar. For example:
What about whitespace?
What about negative numbers?
What do you do about inputs like x = x (this is a valid definition)?
The first two are probably problems with the grammar itself, while the last one might need to be handled at a later stage (is the language perhaps context sensitive?).
But anyway, given such a grammar a tool like PLY can generate a parser for you, but leaving it up to you to handle any additional logic (like x = x). First, however, I'd suggest you try to implement it on your own. One idea is to write a so called Top Down Parser using recursion.

is there need for a more declarative way of expressing regular expressions ? :)

I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.
Currently I am thinking of the description in YAML format.
So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:
# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)
etc.
Does anyone think something of this sort would be of wider use?
Do you know already existing packages that can do it?
What are the limitations that you see to this approach?
Does anyone think, having the declarative string in code, would make it more maintainable?
This is actually pretty similar (identical?) to how a lexer/parser works. If you had a defined grammar then you could probably write a parser with not too much trouble. For instance, you could write something like this:
<expression> :: == <rule> | <rule> <expression> | <rule> " followed by " <expression>
<rule> :: == <val> | <qty> <val>
<qty> :: == "literal" | "one" | "one of" | "one or more of" | "zero or more of"
<val> :: == "a" | "b" | "c" | "d" | ... | "Z" |
That's nowhere near a perfect description. For more info, take a look at this BNF of the regex language. You could then look at lexing and parsing the expression.
If you did it this way you could probably get a little closer to Natural Language/English versions of regexes.
I can see a tool like this being useful, but as was previously said, mainly for beginners.
The main limitation to this approach would be in the amount of code you have to write to translate the language into regex (and/or vice versa). On the other hand, I think a two-way translation tool would actually be more ideal and see more use. Being able to take a regex and turn it into English might be a lot more helpful to spot errors.
Of course it doesn't take too long to pickup regex as the syntax is usually terse and most of the meanings are pretty self explanatory, at least if you use | or || as OR in your language, and you think of * as multiplying by 0-N, + as adding 0-N.
Though sometimes I wouldn't mind typing "find one or more 'a' followed by three digits or 'b' then 'c'"
Please take a look at pyparsing. Many of the issues that you describe with RE's are the same ones that inspired me to write that package.
Here are some specific features of pyparsing from the O'Reilly e-book chapter "What's so special about pyparsing?".
For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.
For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.
You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:
from refunc import *
pattern = Compile(
'a',
Capture(
Choices('b', 'c'),
N_of( 'd', 1, Infin() ),
N_of( 'e', 0, Infin() ),
),
Look_ahead('foo'),
)
But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.
pattern = Compile(
'a',
Capture(
'[bc]',
'd+',
'e*',
),
Look_ahead('foo'),
)
I would add that in my experience, regular expressions are about learning a thought process. Getting comfortable with the syntax is the easy part.
maybe not exactly what you are asking for, but there is a way how to write regexes more readable way (VERBOSE, shortly X flag):
rex_name = re.compile("""
[A-Za-z] # first letter
[a-z]+ # the rest
""", re.X)
rex_name.match('Joe')

Categories