How to handle multiple rules for one token with PLY

How to handle multiple rules for one token with PLY - python

I'm working with a jison file and converting it to a parser generator using the lex module from python PLY.
I've noticed that in this jison file, certain tokens have multiple rules associated with them. For example, for the token CONTENT, the file specifies the following three rules:
[^\x00]*?/("{{") {
if(yytext.slice(-2) === "\\\\") {
strip(0,1);
this.begin("mu");
} else if(yytext.slice(-1) === "\\") {
strip(0,1);
this.begin("emu");
} else {
this.begin("mu");
}
if(yytext) return 'CONTENT';
}
[^\x00]+ return 'CONTENT';
// marks CONTENT up to the next mustache or escaped mustache
<emu>[^\x00]{2,}?/("{{"|"\\{{"|"\\\\{{"|<<EOF>>) {
this.popState();
return 'CONTENT';
}
In another case, there are multiple rules for the COMMENT token:
<com>[\s\S]*?"--}}" strip(0,4); this.popState(); return 'COMMENT';
<mu>"{{!--" this.popState(); this.begin('com');
<mu>"{{!"[\s\S]*?"}}" strip(3,5); this.popState(); return 'COMMENT';
It seems easy enough to distinguish the rules when they apply to different states, but what about when they apply to the same state?
How can I translate this jison to python rules using ply.lex?
edit
In case it helps, this jison file is part of the handlebars.js source code. See: https://github.com/wycats/handlebars.js/blob/master/src/handlebars.l

This question is difficult to answer; it is also two questions in one.
Jison (that's the language that the handlebars parser is written in, not bison) has some features not found in other lexers, and in particular not found in PLY. This makes it difficult to convert the lexical code you have shown from Jison to PLY. However, this is not the question you were focussed on. It is possible to answer your base question, how can multiple regular expressions return a single token in PLY, but this would not give you the solution to implementing the code you chose as your example!
First, lets address the question you asked. Returning one token for multiple regular expressions in PLY can be accomplished by the #TOKEN decorator in PLY as shown in the PLY manual (section 4.11).
For example, we can do the following:
comment1 = r'[^\x00]*?/("{{")'
comment2 = r'[^\x00]+'
comment = r'(' + comment1 + r'|' + comment2 + r')'
#TOKEN(comment)
def t_COMMENT(t)
....
However, this won't really work for the rules you have from jison as they use a new feature of jison called start conditions (see the Jison Manual). Here, the phrase this.begin is used to introduce a state name, which can then be used elsewhere in a pattern. This is where the <mu>, <emu> and <com> come from. There is no feature like this in PLY.
To match these lexemes, it is really necessary to back to the syntax of the handlebars/moustache language/notation and create new regular expressions. Somehow I fell that completely re-implementing the whole of handlebars for you in a SO answer is perhaps a step too far.
However, I have identified the steps to a solution for you, and anyone else who treads this path.

Related

Simple parser, but not a calculator

I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?

Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.

Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.

Railroad diagram for Python grammar

I am looking for a way to get better grasp on the Python grammar.
My experience is that a railroad diagram for the grammar may be helpful.
Python documentation contains the grammar in a text form:
https://docs.python.org/3/reference/grammar.html
But that is not very easy to digest for someone who is just starting with software engineering.
Anybody has a good beginners material?
There is a Railroad Diagram Generator that I might be able to use, but I was not able to find an EBNF syntax for the Python grammar, that would be accepted by that generator.
A link to such a grammar would be very helpful as well.

To convert the Python grammar found at, e.g., https://docs.python.org/3/reference/grammar.html, to EBNF, you basically need to do three things:
Replace all #... comments with /*...*/ (or just delete them)
Use ::= instead of : for defining production rules
Use (...)? to indicate optional elements instead of [...].
For example, instead of
# The function statement
funcdef: 'def' NAME parameters ['->' test] ':' suite
you would use
/* The function statement */
funcdef ::= 'def' NAME parameters ('->' test)? ':' suite

lexer error-handling PLY Python

The t_error() function is used to handle lexing errors that occur when illegal characters are detected. My question is: How can I use this function to get more specific information on errors? Like error type, in which rule or section the error appears, etc.

In general, there is only very limited information available to the t_error() function. As input, it receives a token object where the value has been set to the remaining input text. Analysis of that text is entirely up to you. You can use the t.lexer.skip(n) function to have the lexer skip ahead by a certain number of characters and that's about it.
There is no notion of an "error type" other than the fact that there is an input character that does not match the regular expression of any known token. Since the lexer is decoupled from the parser, there is no direct way to get any information about the state of the parsing engine or to find out what grammar rule is being parsed. Even if you could get the state (which would simply be the underlying state number of the LALR state machine), interpretation of it would likely be very difficult since the parser could be in the intermediate stages of matching dozens of possible grammar rules looking for reduce actions.
My advice is as follows: If you need additional information in the t_error() function, you should set up some kind of object that is shared between the lexer and parser components of your code. You should explicitly make different parts of your compiler update that object as needed (e.g., it could be updated in specific grammar rules).
Just as aside, there are usually very few courses of action for a bad token. Essentially, you're getting input text that doesn't any known part of the language alphabet (e.g., no known symbol). As such, there's not even any kind of token value you can give to the parser. Usually, the only course of action is to report the bad input, throw it out, and continue.
As a followup to Raymond's answer, I would also not advise modifying any attribute of the lexer object in t_error().

Ply includes an example ANSI-C style lexer in a file called cpp.py. It has an example of how to extract some information out of t_error():
def t_error(t):
t.type = t.value[0]
t.value = t.value[0]
t.lexer.skip(1)
return t
In that function, you can also access the lexer's public attributes:
lineno - Current line number
lexpos - Current position in the input string
There are also some other attributes that aren't listed as public but may provide some useful diagnostics:
lexstate - Current lexer state
lexstatestack - Stack of lexer states
lexstateinfo - State information
lexerrorf - Error rule (if any)

There is indeed a way of managing errors in PLY, take a look at this very interesting resentation:
http://www.slideshare.net/dabeaz/writing-parsers-and-compilers-with-ply
and at chapter 6.8.1 of
http://www.dabeaz.com/ply/ply.html#ply_nn3

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

avoid regex [python]

I'd like to know if it's a good idea avoid regex.
actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:
[] '|' \A \B \d \D \W \w \S \Z $ * ? ...
it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.
it gets more unreadable when it's bigger, example: validators.py
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
r')#(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE) # domain
so, I'd like to know a reason to not avoid regex?

No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.
What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)
For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".
There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.
If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)

Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.

If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.
Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.
On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.

Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):
# -*- coding: utf8 -*-
import string
print("Valid letters in this computer are: "+string.letters)
import re
def validateEmail(a):
sep=[x for x in a if not (x.isalpha() or
x.isdigit() or
x in r"!#$%&'*+-/=?^_`{|}~]") ]
sepjoined=''.join(sep)
## sep joined must be ..#.... form
if len(a)>255 or sepjoined.strip('.') != '#': return False
end=a
for i in sep:
part,i,end=end.partition(i)
if len(part)<2: return False
return len(end)>1
def emailval(address):
pattern = "[\.\w]{2,}[#]\w+[.]\w+"
return re.match(pattern, address)
if __name__ == '__main__':
emails = [ "test.#web.com","test+john#web.museum", "test+john#web.m",
"a#n.dk", "and.bun#webben.de","marjaliisa.hämäläinen#hel.fi",
"marja-liisa.hämäläinen#hel.fi", "marjaliisah#hel.",'tony#localhost',
'1234#23.45','me#somewhere']
print('\n\t'.join(["Valid emails are:"] +
filter(validateEmail,emails)))
print('\n\t'.join(["Regexp gives wrong answer:"] +
filter(emailval,emails)))
""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
test+john#web.museum
and.bun#webben.de
tony#localhost
1234#23.45
me#somewhere
Regexp gives wrong answer:
test.#web.com
and.bun#webben.de
1234#23.45
"""
EDIT: cleaned up the regex filter function from this ancient code, edited for #detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.
This code by purpose does not accept the normal a#b as valid email address, but does accept me#somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.

(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)
This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.

Regular expressions are likely the right tool for extracting/validating email addresses...
To extract one or more email addresses from raw text:
import re
pat_e = re.compile(r'(?P<email>[\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
emails.append(r.group('email'))
return emails
To see if a single piece of text is a valid email:
import re
pat_m = re.compile(r'([\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
return True
return False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.