Parsing Tokens inside of Strings in Antlr4

Parsing Tokens inside of Strings in Antlr4 - python

I'm trying to get my head around how to write grammar which will first parse an input for strings, and then when strings are found, parse that string.
For example, if I had an input such as:
var1 = "world"
someVariable = "hello {{var1}}"
The result I want is for someVariable to be equal to "hello world".
Now, I understand how to write the grammar to set a variable to a string, but what I cannot figure out is how to parse that string for the mustache syntax in order to inject the value inside of var1.
Thanks in advance!

It's easier to do this in two steps:
Parse the input as usual (i.e. to determine the assignments, without analyzing the contents of the strings)
Then evaluate the assignments
While assigning a string to a variable, parse its contents with another parser (or perhaps even just with regex if the syntax is simple enough) to determine any replacements.

That is not what ANTLR does. ANTLR can surely parse your input, and even tokenise "hello {{var1}}" separately1, but it does not evaluate var1 and substitute it. That is something you will need to do after ANTLR is done parsing2.
checkout the docs on lexical modes: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexical-modes
this Q&A shows a simple example how to evaluate something using a visitor: If/else statements in ANTLR using listeners

Related

Why do two or more string arguments without commas separating them results in concatenation when given in argument list [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Python string literal concatenation

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Parsing list to create python code

I have a list that I have successfully converted into a python statment
ex:
from operator import mul,add,sub,abs
l = ['add(4,mul(3,abs(-3)))']
I was wondering what would I use to RUN this string as actual python code? I should be expecting a output of 13. I want to input the 0th value of the list into a function that is able to run this value as actual python code.

You don't want to run this as Python code. You're trying to parse expressions in some language that isn't Python, even if it may be superficially similar. Even if it's a proper subset of Python, unless, say, __import__('os').system('rm -rf /') happens to be a valid string in the language that you want to handle by erasing the hard drive, using eval or exec is a bad idea.
If the grammar is a proper subset of Python, you can still use the ast module for parsing, and then write your own interpreter for the parsed nodes.
However, I think what you really want to do here is build a very simple parser for your very simple language. This is a great opportunity to learn how to use a parsing library like pyparsing or a parser-generator tool like pybison, or to build a simple recursive-descent parser from scratch. But for something this simple, even basic string operations (splitting on/finding parentheses) should be sufficient.
Here's an intentionally stupid example (which you definitely shouldn't turn in if you want a good grade) to show how easy it is:
import operator
OPERATORS = operator.__dict__
def evaluate_expression(expr):
try:
return int(expr)
except ValueError:
pass
op, _, args = expr.rpartition('(')
rest, _, thisop = op.rpartition(',')
args = args.rstrip(')').split(',')
argvalues = map(int, args)
thisvalue = OPERATORS[thisop](*argvalues)
if rest:
return evaluate_expression('{},{}'.format(rest, thisvalue))
return thisvalue
while True:
expr = input()
print(evaluate_expression(expr))
Normally, you want to find the outermost expression, then evaluate it recursively—that's a lot easier than finding the rightmost, substituting it into the string, and evaluating the result recursively. Again, I'm just showing how easy it is to do even if you don't do it the easy way.

use exec like this:
exec('add(4,mul(3,abs(-3)))')
That should work
more about exec

If you want to evaluate a Python expression, use eval. This returns the value of the evaluated expression. So, for example:
>>> eval(l[0])
13
>>> results = [eval(expr) for expr in l]
>>> results
[13]
However, any time you find yourself using eval (or exec or related functionality), you're almost always doing something wrong. This blog post explains some of the reasons why.

since you're evaluating an expression, eval would suit you better than exec. Example:
x = -3
y = eval('add(4,mul(3,abs(x)))')
print y
Note the security implication of exec and eval, since they can execute arbitrary code, including for example deleting all files you have access, installing Trojans to your doc files, etc.
Check out also ast.literal_eval for python 2.6+.

Is there a python way of serially searching a string with regexes, akin to java's re.appendReplacement and appendTail

I cannot find a way to serially search a string and append replacements. Let's say I am implementing a templating language. A simplified template looks something like this:
Hello words on #DATE# in #COUNTRY# on this beautiful day.
Imagine a very long template, with many #SOMETHING# tags. Now I want to use regex to parse through this, and every time I found #SOMETHING#, do some python logic, replace it with some string, append it, and continue. All I found is that I can break the string up into tokens and matches and then reassemble it. Is there something better, without generating all those string chunks? Maybe I am trying to optimize too early, but in Java, we have the
appendReplacement(StringBuffer,String) and appendTail(StringBuffer)
methods and I was wondering if something similar can be done in Python.
See http://docs.oracle.com/javase/tutorial/essential/regex/matcher.html

You can use a function as the "replacement" in re.sub. Then re.sub will invoke your function for every match in the string, and the return value of the function will be the replacement in the string.

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.