I am looking for a way to get better grasp on the Python grammar.
My experience is that a railroad diagram for the grammar may be helpful.
Python documentation contains the grammar in a text form:
https://docs.python.org/3/reference/grammar.html
But that is not very easy to digest for someone who is just starting with software engineering.
Anybody has a good beginners material?
There is a Railroad Diagram Generator that I might be able to use, but I was not able to find an EBNF syntax for the Python grammar, that would be accepted by that generator.
A link to such a grammar would be very helpful as well.
To convert the Python grammar found at, e.g., https://docs.python.org/3/reference/grammar.html, to EBNF, you basically need to do three things:
Replace all #... comments with /*...*/ (or just delete them)
Use ::= instead of : for defining production rules
Use (...)? to indicate optional elements instead of [...].
For example, instead of
# The function statement
funcdef: 'def' NAME parameters ['->' test] ':' suite
you would use
/* The function statement */
funcdef ::= 'def' NAME parameters ('->' test)? ':' suite
Related
Possible duplicate for this question however for me it's not specific enough.
The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:
foo(a)
foo(a=a)
corresponds to the following grammar:
argument: ( test [comp_for] |
test '=' test |
'**' test |
'*' test )
test appears twice in the first position of the grammar. It means that by only looking at test Python cannot determine it's test [comp_for] or test '=' test.
More examples:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
Note 'is' and 'is' 'not'
subscript: test | [test] ':' [test] [sliceop]
test also appears twice.
Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.
The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality ([a]) and Kleene closure ((a b c)*). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.
In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)
In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
In the above production, the repetition of (';' small_stmt)* may be followed by a ';', which means that a simple while loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:
simple_stmt: small_stmt rest_A
rest_A : ';' rest_B
| NEWLINE
rest_B : small_stmt rest_A
| NEWLINE
Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.
This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)
You're correct that constructs like 'is' | 'is' 'not' aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to 'is' notOpt where notOpt: 'not' | ϵ or, if you allow EBNF syntax, just 'is' 'not'? (or 'is' ['not'] depending on the flavor of EBNF).
So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.
I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "#" NAME arguments ?
directive_definition: description? "directive" "#" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:
union Foo #something(test: 42) = Bar | Baz # This works
union Foo #directive(test: 42) = Bar | Baz # This fails
Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.
How can I balance / adjust this so there is no ambiguity possible for the LALR(1) parser ?
Author of Lark here.
This misinterpretation happens because "directive" can be two different tokens: The "directive" string, or NAME. By default, Lark's LALR lexer always chooses the more specific one, namely the string.
So how can we let the lexer know that #directive is a name, and not just two constant strings?
Solution 1 - Use the Contextual Lexer
What would probably help in this situation (it's hard to be sure without the full grammar), is to use the contextual lexer, instead of the standard LALR(1) lexer.
The contextual lexer can communicate to some degree with the parser, to figure out which terminal makes more sense at each point. This is an algorithm that is unique to Lark, and you can use it like this:
parser = Lark(grammar, parser="lalr", lexer="contextual")
(This lexer can do anything the standard lexer can do and more, so in future versions it might become the default lexer.)
Solution 2 - Prefix the terminal
If the contextual lexer doesn't solve your collision, a more "classic" solution to this situation would be to define a directive token, something like:
DIRECTIVE: "#" NAME
Unlike your directive rule, this leaves no ambiguity to the lexer. There is a clear distinction between a directive, and the "directive" string (or NAME terminal).
And if all else fails, you can always use the Earley parser, which at the price of performance, will work with any grammar you give it, regardless of how many collisions there might be.
Hope this helps!
Edit: I'd just like to point out the the contextual lexer is the default for LALR now, so it's enough to call:
parser = Lark(grammar, parser="lalr")
I'm learning how parsers work by creating a simple recursive descent parser. However I'm having a problem defining my grammar to be LL(1). I want to be able to parse the following two statements:
a = 1
a + 1
To do this I've created the following grammar rules:
statement: assignent | expression
assignment: NAME EQUALS expression
expression: term [(PLUS|MINUS) term]
term: NAME | NUMBER
However, this leads to ambiguity when using a LL(1) parser as when a NAME token is encountered in the statement rule, it doesn't know whether it is an assignment or an expression without a look-ahead.
Python's grammar is LL(1) so I know this is possible to do but I can't figure out how to do it. I've looked at Python's grammar rules found here (https://docs.python.org/3/reference/grammar.html) but I'm still not sure how they implement this.
Any help would be greatly appreciated :)
Just treat = as an operator with very low precedence. However (unless you want a language like C where = really is an operator with very low precedence), you need to exclude it from internal (e.g. parenthetic) expressions.
If you had only multiplication and addition, you could use:
expression: factor ['+' factor]
factor: term ['*' term]
term: ID | NUMBER | '(' expression ')'
That is a guide for operator precedence: has higher precedence because the arguments to + can include s but not vice versa. So we could just add assignment:
statement: expression ['=' expression]
Unfortunately, that would allow, for example:
(a + 1) = b
which is undesirable. So it needs to be eliminated, but it is possible to eliminate it when the production is accepted (by a check of the form of the first expression), rather than in the grammar itself. As I understand it, that's what the Python parser does; see the long comment about test and keywords.
If you used an LR(1) parser instead, you wouldn't have this problem.
I'm working with a jison file and converting it to a parser generator using the lex module from python PLY.
I've noticed that in this jison file, certain tokens have multiple rules associated with them. For example, for the token CONTENT, the file specifies the following three rules:
[^\x00]*?/("{{") {
if(yytext.slice(-2) === "\\\\") {
strip(0,1);
this.begin("mu");
} else if(yytext.slice(-1) === "\\") {
strip(0,1);
this.begin("emu");
} else {
this.begin("mu");
}
if(yytext) return 'CONTENT';
}
[^\x00]+ return 'CONTENT';
// marks CONTENT up to the next mustache or escaped mustache
<emu>[^\x00]{2,}?/("{{"|"\\{{"|"\\\\{{"|<<EOF>>) {
this.popState();
return 'CONTENT';
}
In another case, there are multiple rules for the COMMENT token:
<com>[\s\S]*?"--}}" strip(0,4); this.popState(); return 'COMMENT';
<mu>"{{!--" this.popState(); this.begin('com');
<mu>"{{!"[\s\S]*?"}}" strip(3,5); this.popState(); return 'COMMENT';
It seems easy enough to distinguish the rules when they apply to different states, but what about when they apply to the same state?
How can I translate this jison to python rules using ply.lex?
edit
In case it helps, this jison file is part of the handlebars.js source code. See: https://github.com/wycats/handlebars.js/blob/master/src/handlebars.l
This question is difficult to answer; it is also two questions in one.
Jison (that's the language that the handlebars parser is written in, not bison) has some features not found in other lexers, and in particular not found in PLY. This makes it difficult to convert the lexical code you have shown from Jison to PLY. However, this is not the question you were focussed on. It is possible to answer your base question, how can multiple regular expressions return a single token in PLY, but this would not give you the solution to implementing the code you chose as your example!
First, lets address the question you asked. Returning one token for multiple regular expressions in PLY can be accomplished by the #TOKEN decorator in PLY as shown in the PLY manual (section 4.11).
For example, we can do the following:
comment1 = r'[^\x00]*?/("{{")'
comment2 = r'[^\x00]+'
comment = r'(' + comment1 + r'|' + comment2 + r')'
#TOKEN(comment)
def t_COMMENT(t)
....
However, this won't really work for the rules you have from jison as they use a new feature of jison called start conditions (see the Jison Manual). Here, the phrase this.begin is used to introduce a state name, which can then be used elsewhere in a pattern. This is where the <mu>, <emu> and <com> come from. There is no feature like this in PLY.
To match these lexemes, it is really necessary to back to the syntax of the handlebars/moustache language/notation and create new regular expressions. Somehow I fell that completely re-implementing the whole of handlebars for you in a SO answer is perhaps a step too far.
However, I have identified the steps to a solution for you, and anyone else who treads this path.
I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.
Currently I am thinking of the description in YAML format.
So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:
# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)
etc.
Does anyone think something of this sort would be of wider use?
Do you know already existing packages that can do it?
What are the limitations that you see to this approach?
Does anyone think, having the declarative string in code, would make it more maintainable?
This is actually pretty similar (identical?) to how a lexer/parser works. If you had a defined grammar then you could probably write a parser with not too much trouble. For instance, you could write something like this:
<expression> :: == <rule> | <rule> <expression> | <rule> " followed by " <expression>
<rule> :: == <val> | <qty> <val>
<qty> :: == "literal" | "one" | "one of" | "one or more of" | "zero or more of"
<val> :: == "a" | "b" | "c" | "d" | ... | "Z" |
That's nowhere near a perfect description. For more info, take a look at this BNF of the regex language. You could then look at lexing and parsing the expression.
If you did it this way you could probably get a little closer to Natural Language/English versions of regexes.
I can see a tool like this being useful, but as was previously said, mainly for beginners.
The main limitation to this approach would be in the amount of code you have to write to translate the language into regex (and/or vice versa). On the other hand, I think a two-way translation tool would actually be more ideal and see more use. Being able to take a regex and turn it into English might be a lot more helpful to spot errors.
Of course it doesn't take too long to pickup regex as the syntax is usually terse and most of the meanings are pretty self explanatory, at least if you use | or || as OR in your language, and you think of * as multiplying by 0-N, + as adding 0-N.
Though sometimes I wouldn't mind typing "find one or more 'a' followed by three digits or 'b' then 'c'"
Please take a look at pyparsing. Many of the issues that you describe with RE's are the same ones that inspired me to write that package.
Here are some specific features of pyparsing from the O'Reilly e-book chapter "What's so special about pyparsing?".
For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.
For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.
You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:
from refunc import *
pattern = Compile(
'a',
Capture(
Choices('b', 'c'),
N_of( 'd', 1, Infin() ),
N_of( 'e', 0, Infin() ),
),
Look_ahead('foo'),
)
But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.
pattern = Compile(
'a',
Capture(
'[bc]',
'd+',
'e*',
),
Look_ahead('foo'),
)
I would add that in my experience, regular expressions are about learning a thought process. Getting comfortable with the syntax is the easy part.
maybe not exactly what you are asking for, but there is a way how to write regexes more readable way (VERBOSE, shortly X flag):
rex_name = re.compile("""
[A-Za-z] # first letter
[a-z]+ # the rest
""", re.X)
rex_name.match('Joe')