Pyparsing: how to implement special processing of C-style comments? - python

I want to take advantage of the cStyleComment variable, but rather than just ignoring these comments I want to process them specially. Is there any way to make pyparsing call my handler on the piece of input, which it recognizes as a comment, before it's going to be thrown away?
I'm processing some C code, which contain some "special" directives inside comments.

There is nothing inherent in any of the xxxStyleComment expressions that are defined in pyparsing that causes them to be ignored. They are there as a convenience, especially since some comment formats are easy to get wrong. They don't get ignored unless you call the ignore method on your larger grammar, as in:
cHeaderParser.ignore(cStyleComment)
(where cHeaderParser might be something you wrote to read through .h files to extract API information, for instance.)
And having pyparsing callback to a handler is built-in, just use cStyleComment.setParseAction(commentHandler). Pyparsing can handle parse actions with any of these signatures:
def commentHandler(inputString, locn, tokens):
def commentHandler(locn, tokens):
def commentHandler(tokens):
def commentHandler():
If your commentHandler returns a string or list of strings, or a new ParseResults, these will be used to replace the input tokens - if it returns None, or omits the return statement, then the tokens object is used. You can also modify the tokens object in place (such as adding new results names).
So you could write something like this that would uppercase your comments:
def commentHandler(tokens):
return tokens[0].upper()
cStyleComment.setParseAction(commentHandler)
(a parse action as simple as this could even be written cStyleComment.setParseAction(lambda t:t[0].upper()))
When writing a transforming parse action like this, one would likely use transformString rather then parseString,
print cStyleComment.transformString(source)
This will print the original source, but all of the comments will be uppercased.

Related

How to deconstruct a Python-like function call?

Suppose I had a function call as a string, like "log(2, floor(9.4))". I want to deconstruct the call in a way that allows me to access the function name and arguments for the firstmost call and accurately deducts whether a function call as an argument is an argument or not.
For example, the arguments when deconstructing the string above would come to [2, floor(9.4)]
I've already tried to use some string parsing techniques (e.g. splitting on commas), but it doesn't appear to be working.
You can use the ast module:
import ast
data = "log(2, floor(9.4))"
parse_tree = ast.parse(data)
# ast.unparse() is for 3.9+ only.
# If using an earlier version, use the astunparse package instead.
result = [ast.unparse(node) for node in parse_tree.body[0].value.args]
print(result)
This outputs:
['2', 'floor(9.4)']
I pulled the value to iterate over from manually inspecting the output of ast.dump(parse_tree).
Note that I've written something a bit quick and dirty, since there's only one string to parse. If you're looking to parse a lot of these strings (or a larger program), you should create a subclass of ast.NodeVisitor. If you want to also make modifications to the source code, you should create a subclass of ast.NodeTransformer instead.

How to use gettext with python >3.6 f-strings

Previously you would use gettext as following:
_('Hey {},').format(username)
but what about new Python's f-string?
f'Hey {username}'
'Hey {},' is contained in your translation dictionary as is.
If you use f'Hey {username},', that creates another string, which won't be translated.
In that case, the format method remains the only one useable, but you could approach the f-string features by using named parameters
_('Hey {username},').format(username=username)
or if you have a dictionary containing your data, this cool trick where format picks the required information in the input dictionary:
d = {"username":"John", "city":"New York", "unused":"doesn't matter"}
_('Hey {username} from {city},').format(**d)
My solution is to make a function f() which performs the f-string interpolation after gettext has been called.
from copy import copy
from inspect import currentframe
def f(s):
frame = currentframe().f_back
kwargs = copy(frame.f_globals)
kwargs.update(frame.f_locals)
return eval(s.format(**kwargs))
Now you just wrap _(...) in f() and don’t preface the string with an f:
f(_('Hey, {username}'))
Note of caution
I’m usually against the use of eval as it could make the function potentially unsafe, but I personally think it should be justified here, so long as you’re aware of what’s being formatted. That said use at your own risk.
Remember
This isn’t a perfect solution, this is just my solution. As per PEP 498 states each formatting method “have their advantages, but in addition have disadvantages” including this.
For example if you need to change the expression inside the string then it will no longer match, therefore not be translated unless you also update your .po file as well. Also if you’re not the one translating them and you use an expression that’s hard to decipher what the outcome will be then that can cause miscommunication or other issues in translation.

PyYAML variables in multiline

I'm trying to get a multi-line comment to use variables in PyYAML but not sure if this is even possible.
So, in YAML, you can assign a variable like:
current_host: &hostname myhost
But it doesn't seem to expand in the following:
test: |
Hello, this is my string
which is running on *hostname
Is this at all possible or am I going to have to use Python to parse it?
The anchors (&some_id) and references (*some_id) mechanism is essentially meant to provide the possibility to share complete nodes between parts of the tree representation that is a YAML text. This is e.g. necessary in order to have one and the same complex item (sequence/list resp. mapping/dict) that occurs in a list two times load as one and same item (instead of two copies with the same values).
So yes, you need to do the parsing in Python. You could start with the mechanism I provided in this answer and change the test
if node.value and node.value.startswith(self.d['escape'])
to find the escape character in any place in the scalar and take appropriate action.
You can find the answer here.
Just use a + between lines and your strings need to be enclosed in 's.

lexer error-handling PLY Python

The t_error() function is used to handle lexing errors that occur when illegal characters are detected. My question is: How can I use this function to get more specific information on errors? Like error type, in which rule or section the error appears, etc.
In general, there is only very limited information available to the t_error() function. As input, it receives a token object where the value has been set to the remaining input text. Analysis of that text is entirely up to you. You can use the t.lexer.skip(n) function to have the lexer skip ahead by a certain number of characters and that's about it.
There is no notion of an "error type" other than the fact that there is an input character that does not match the regular expression of any known token. Since the lexer is decoupled from the parser, there is no direct way to get any information about the state of the parsing engine or to find out what grammar rule is being parsed. Even if you could get the state (which would simply be the underlying state number of the LALR state machine), interpretation of it would likely be very difficult since the parser could be in the intermediate stages of matching dozens of possible grammar rules looking for reduce actions.
My advice is as follows: If you need additional information in the t_error() function, you should set up some kind of object that is shared between the lexer and parser components of your code. You should explicitly make different parts of your compiler update that object as needed (e.g., it could be updated in specific grammar rules).
Just as aside, there are usually very few courses of action for a bad token. Essentially, you're getting input text that doesn't any known part of the language alphabet (e.g., no known symbol). As such, there's not even any kind of token value you can give to the parser. Usually, the only course of action is to report the bad input, throw it out, and continue.
As a followup to Raymond's answer, I would also not advise modifying any attribute of the lexer object in t_error().
Ply includes an example ANSI-C style lexer in a file called cpp.py. It has an example of how to extract some information out of t_error():
def t_error(t):
t.type = t.value[0]
t.value = t.value[0]
t.lexer.skip(1)
return t
In that function, you can also access the lexer's public attributes:
lineno - Current line number
lexpos - Current position in the input string
There are also some other attributes that aren't listed as public but may provide some useful diagnostics:
lexstate - Current lexer state
lexstatestack - Stack of lexer states
lexstateinfo - State information
lexerrorf - Error rule (if any)
There is indeed a way of managing errors in PLY, take a look at this very interesting resentation:
http://www.slideshare.net/dabeaz/writing-parsers-and-compilers-with-ply
and at chapter 6.8.1 of
http://www.dabeaz.com/ply/ply.html#ply_nn3

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)
You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.
It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

Categories