The t_error() function is used to handle lexing errors that occur when illegal characters are detected. My question is: How can I use this function to get more specific information on errors? Like error type, in which rule or section the error appears, etc.
In general, there is only very limited information available to the t_error() function. As input, it receives a token object where the value has been set to the remaining input text. Analysis of that text is entirely up to you. You can use the t.lexer.skip(n) function to have the lexer skip ahead by a certain number of characters and that's about it.
There is no notion of an "error type" other than the fact that there is an input character that does not match the regular expression of any known token. Since the lexer is decoupled from the parser, there is no direct way to get any information about the state of the parsing engine or to find out what grammar rule is being parsed. Even if you could get the state (which would simply be the underlying state number of the LALR state machine), interpretation of it would likely be very difficult since the parser could be in the intermediate stages of matching dozens of possible grammar rules looking for reduce actions.
My advice is as follows: If you need additional information in the t_error() function, you should set up some kind of object that is shared between the lexer and parser components of your code. You should explicitly make different parts of your compiler update that object as needed (e.g., it could be updated in specific grammar rules).
Just as aside, there are usually very few courses of action for a bad token. Essentially, you're getting input text that doesn't any known part of the language alphabet (e.g., no known symbol). As such, there's not even any kind of token value you can give to the parser. Usually, the only course of action is to report the bad input, throw it out, and continue.
As a followup to Raymond's answer, I would also not advise modifying any attribute of the lexer object in t_error().
Ply includes an example ANSI-C style lexer in a file called cpp.py. It has an example of how to extract some information out of t_error():
def t_error(t):
t.type = t.value[0]
t.value = t.value[0]
t.lexer.skip(1)
return t
In that function, you can also access the lexer's public attributes:
lineno - Current line number
lexpos - Current position in the input string
There are also some other attributes that aren't listed as public but may provide some useful diagnostics:
lexstate - Current lexer state
lexstatestack - Stack of lexer states
lexstateinfo - State information
lexerrorf - Error rule (if any)
There is indeed a way of managing errors in PLY, take a look at this very interesting resentation:
http://www.slideshare.net/dabeaz/writing-parsers-and-compilers-with-ply
and at chapter 6.8.1 of
http://www.dabeaz.com/ply/ply.html#ply_nn3
Related
The error code D401 for pydocstyle reads: First line should be in imperative mood.
I often run into cases where I write a docstring, have this error thrown by my linter, and rewrite it -- but the two docstrings are semantically identical. Why is it important to have imperative mood for docstrings?
From the docstring of check_imperative_mood itself:
"""D401: First line should be in imperative mood: 'Do', not 'Does'.
[Docstring] prescribes the function or method's effect as a command:
("Do this", "Return that"), not as a description; e.g. don't write
"Returns the pathname ...".
(We'll ignore the irony that this docstring itself would fail the test.)
Consider the following example candidate for a docstring:
Make a row from a given bit string or with the given number of columns.
In English, this is a complete, grammatical sentence that begins with a capital letter and ends with a period. It's a sentence because it has a subject (implicitly, "you"), an object, "row," and a predicate (verb), "make."
Now consider an alternative:
Makes a row from a given bit string or with the given number of columns.
In English, this is ungrammatical. It is an adjectival phrase, therefore it should not begin with a capital letter and should not end with a period. Let's fix that problem:
makes a row from a given bit string or with the given number of columns
As an adjectival phrase, its antecedent --- its target --- is not explicit. Of course, we know it's the item being "docstringed," but, grammatically, it's dangling. That's a problem. Aesthetically, it's ugly, and that's another problem. So we fixed one problem and added two more.
For people who care about clear, anambiguous communication in grammatical English, the first proposal is clearly superior. I will guess that's the reason the Pythonistas chose the first proposal. In summary, "Docstrings shall be complete, grammatical sentences, specifically in the imperative mood."
It is not complete to just say it is about a convention or a consistency (otherwise, the follow-up question would be, "consistent with what?").
It is actually an explicit requirement from - albeit buried down deep in - the canonical PEP 257 Docstring Conventions. Quoted below:
def kos_root():
"""Return the pathname of the KOS root directory."""
...
Notes:
The docstring is a phrase ending in a period. It prescribes the
function or method's effect as a command ("Do this", "Return that"),
not as a description; e.g. don't write "Returns the pathname ...".
That pydocstyle docstring was actually quoted from PEP 257 paragraph above.
It is more important to have a consistent style within a project or in a company.
The whole idea comes from the PEP-257, which says
The docstring is a phrase ending in a period. It prescribes the
function or method’s effect as a command (“Do this”, “Return that”),
not as a description; e.g. don’t write “Returns the pathname …”.
But for example the Google Python Style Guide states the total opposite of this:
The docstring should be descriptive-style ("""Fetches rows from a
Bigtable.""") rather than imperative-style ("""Fetch rows from a
Bigtable.""").
Also worth to mention that both the Oracle Java style guide and the Microsoft .NET guides prefer the descriptive style.
Use 3rd person (descriptive) not 2nd person (prescriptive).
The description is in 3rd person declarative rather than 2nd person
imperative.
Gets the label. (preferred)
Get the label. (avoid)
So it looks like this preference of imperative style is Python-specific.
Why is it important? Because that's the explicit convention for Python docstrings, as detailed in PEP 257. There's nothing particularly special about it - it doesn't seem obvious to me that one of "Multiplies two integers and returns the product" and "Multiply two integers and return the product" is clearly better than the other. But it is explicitly specified in the documentation.
For consistency. It might stem from the fact that the commit messages git automatically creates, like for merge commits, also uses the imperative mood.
I find the grammatical argument compelling.
I use the imperative style for names and docstrings. In my experience, the imperative style works better in review. People are more likely to comment on an obvious untruth when "what" of the docstring disagrees with "how" of the code:
def _update_calories(meal):
return sum(item.calories for item in meal) # where is the update?
I use the descriptive style for some inline comments when I want to highlight an unexpected behaviour. In my experience, people tend to read descriptive phrases as authoritative:
def _update_calories(meal):
meal.calories = sum(item.calories for item in meal)
return meal.calories
# warning: changes pizza.calories, easy to miss side-effect
calories = _update_calories(pizza)
I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?
Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.
Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.
I'm using grako (a PEG parser generator library for python) to parse a simple declarative language where a document can contain one or more protocols.
Originally, I had the root rule for document written as:
document = {protocol}+ ;
This appropriately returns a list of protocols, but only gives helpful errors if a syntax error is in the first protocol. Otherwise, it silently discards the invalid protocol and everything after it.
I have also tried a few variations on:
document = protocol document | $ ;
But this doesn't result in a list if there's only one protocol, and doesn't give helpful error messages either, saying only no available options: (...) document if any of the protocols contains an error.
How do I write a rule that does both of the following?:
Always returns a list, even if there's only one protocol
Displays helpful error messages about the unsuccessful match, instead of just saying it's an invalid document or silently dropping the damaged protocol
This is the solution:
document = {protocol ~ }+ $ ;
If you don't add the $ for the parser to see the end of file, the parse will succeed with one or more protocol, even if there are more to parse.
Adding the cut expression (~) makes the parser commit to what was parsed in the closest option/choice in the parse (a closure is an option of X = a X|();). Additional cut expressions within what's parsed by protocol will make the error messages be closer to the expected points of failure in the input.
I want to take advantage of the cStyleComment variable, but rather than just ignoring these comments I want to process them specially. Is there any way to make pyparsing call my handler on the piece of input, which it recognizes as a comment, before it's going to be thrown away?
I'm processing some C code, which contain some "special" directives inside comments.
There is nothing inherent in any of the xxxStyleComment expressions that are defined in pyparsing that causes them to be ignored. They are there as a convenience, especially since some comment formats are easy to get wrong. They don't get ignored unless you call the ignore method on your larger grammar, as in:
cHeaderParser.ignore(cStyleComment)
(where cHeaderParser might be something you wrote to read through .h files to extract API information, for instance.)
And having pyparsing callback to a handler is built-in, just use cStyleComment.setParseAction(commentHandler). Pyparsing can handle parse actions with any of these signatures:
def commentHandler(inputString, locn, tokens):
def commentHandler(locn, tokens):
def commentHandler(tokens):
def commentHandler():
If your commentHandler returns a string or list of strings, or a new ParseResults, these will be used to replace the input tokens - if it returns None, or omits the return statement, then the tokens object is used. You can also modify the tokens object in place (such as adding new results names).
So you could write something like this that would uppercase your comments:
def commentHandler(tokens):
return tokens[0].upper()
cStyleComment.setParseAction(commentHandler)
(a parse action as simple as this could even be written cStyleComment.setParseAction(lambda t:t[0].upper()))
When writing a transforming parse action like this, one would likely use transformString rather then parseString,
print cStyleComment.transformString(source)
This will print the original source, but all of the comments will be uppercased.
How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)
You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.
It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.