Python string literal concatenation - python

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Related

Why do two or more string arguments without commas separating them results in concatenation when given in argument list [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.
Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Parsing Tokens inside of Strings in Antlr4

I'm trying to get my head around how to write grammar which will first parse an input for strings, and then when strings are found, parse that string.
For example, if I had an input such as:
var1 = "world"
someVariable = "hello {{var1}}"
The result I want is for someVariable to be equal to "hello world".
Now, I understand how to write the grammar to set a variable to a string, but what I cannot figure out is how to parse that string for the mustache syntax in order to inject the value inside of var1.
Thanks in advance!
It's easier to do this in two steps:
Parse the input as usual (i.e. to determine the assignments, without analyzing the contents of the strings)
Then evaluate the assignments
While assigning a string to a variable, parse its contents with another parser (or perhaps even just with regex if the syntax is simple enough) to determine any replacements.
That is not what ANTLR does. ANTLR can surely parse your input, and even tokenise "hello {{var1}}" separately1, but it does not evaluate var1 and substitute it. That is something you will need to do after ANTLR is done parsing2.
checkout the docs on lexical modes: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexical-modes
this Q&A shows a simple example how to evaluate something using a visitor: If/else statements in ANTLR using listeners

"""Different""", "quote", 'types' in function?

I am trying to figure out if the different quote types make a difference functionally. I have seen people say its preference for "" or '' but what about """ """? I tested it in a simple code to see if it would work and it does. I was wondering if """ triple quotes """ have a functional purpose for defined function arguments or is it just another quote option that can be used interchangeably like "" and ''?
As I have seen many people post about "" and '' I have not seen a post about """ """ or ''' ''' being used in functions.
My question is: Does the triple quote have a unique use as an argument or is it simply interchangeable with "" and ''? The reason I think it might have a unique function is because it is a multi line quote and I was wondering if it would allow a multi line argument to be submitted. I am not sure if something like that would even be useful but it could be.
Here is an example that prints out what you would expect using all the quote types I know of.
def myFun(var1="""different""",var2="quote",var3='types'):
return var1, var2, var3
print (myFun('All','''for''','one!'))
Result:
('All', 'for', 'one!')
EDIT:
After some more testing of the triple quote I did find some variation in how it works using return vs printing in the function.
def myFun(var1="""different""",var2="""quote""",var3='types'):
return (var1, var2, var3)
print(myFun('This',
'''Can
Be
Multi''',
'line!'))
Result:
('This', 'Can\nBe\nMulti', 'line!')
Or:
def myFun(var1="""different""",var2="""quote""",var3='types'):
print (var1, var2, var3)
myFun('This',
'''Can
Be
Multi''',
'line!')
Result:
This Can
Be
Multi line!
From the docs:
String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). [...other rules applying identically to all string literal types omitted...]
In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A “quote” is the character used to open the string, i.e. either ' or ".)
Thus, triple-quoted string literals can span multiple lines, and can contain literal quotes without use of escape sequences, but are otherwise exactly identical to string literals expressed with other quoting types (including those using escape sequences such as \n or \' to express the same content).
Also see the Python 3 documentation: Bytes and String Literals -- which expresses an effectively identical set of rules with slightly different verbiage.
A more gentle introduction is also available in the language tutorial, which explicitly introduces triple-quotes as a way to permit strings to span multiple lines:
String literals can span multiple lines. One way is using triple-quotes: """...""" or '''...'''. End of lines are automatically included in the string, but it’s possible to prevent this by adding a \ at the end of the line. The following example:
print("""\
Usage: thingy [OPTIONS]
-h Display this usage message
-H hostname Hostname to connect to
""")
produces the following output (note that the initial newline is not included):
Usage: thingy [OPTIONS]
-h Display this usage message
-H hostname Hostname to connect to
To be clear, though: These are different syntax, but the string literals they create are indistinguishable from each other. That is to say, given the following code:
s1 = '''foo
'bar'
baz
'''
s2 = 'foo\n\'bar\'\nbaz\n'
there's no possible way to tell s1 and s2 apart from each other by looking at their values: s1 == s2 is true, and so is repr(s1) == repr(s2). The Python interpreter is even allowed to intern them to the same value, so it may (or may not) make id(s1) == id(s2) true depending on details (such as whether the code was run at the REPL or imported as a module).
FWIW, my understanding is that there's a convention whereby """ """, ''' ''' are used for docstring, which is kinda like a #comment, but is a recallable attribute that can be referenced later. https://www.python.org/dev/peps/pep-0257/
I'm a beginner too, but my understanding is that using triple quotes for strings is not the best idea, even if there's little functional difference with what you're doing currently (I don't know if there might be later). Sticking with conventions is helpful to others reading and using your code, and it seems to be a good rule of thumb that some conventions will bite you if you don't follow them, as in this case where a mal-formatted string with triple quotes will be interpreted as a docstring, and maybe not throw an error, and you'll need to search through a bunch of code to find the issue.

Deobfuscation: Simplify Python3 Expressions

I'm trying to learn how to deobfuscate some code that is unneccesarily complicated. For example, I would like to be able to rewrite this line of code:
return ('d' + chr(101) + chr(97) + chr(200 - 100)) # returns 'dead'
to:
return 'dead'
So basically, I need to evaluate all literals within the py file, including complicated expressions that evaluate to simple integers. How do I go about writing this reader / is there something that exists that can do this? Thanks!
What you want is a program transformation system (PTS).
This is a tool for parsing source code to an AST, transforming the tree, and then regenerating valid source code from the tree. See my SO answer on rewriting Python text for some background.
With a PTS like (my company's) DMS Software Reengineering Toolkiit, you can write rules to do constant folding, which means essentially doing compile-time arithmetic.
For the example you show, the following rules can accomplish OP's example:
rule fold_subtract_naturals(n:NATURAL,m:NATURAL): sum->sum =
" \n + \m " -> " \subtract_naturals\(\n\,\m\) ";
rule convert_chr_to_string(c:NATURAL): term->term =
" chr(\c) " -> make_string_from_natural(c) ;
rule convert_character_literal_to_string(c:CHARACTER): term->term =
" \c " -> make_string_from_character(c) ;
rule fold_concatenate_strings(s1:STRING, s2:STRING): sum->sum =
" \s1 + \s2 " -> " \concatenate_strings\(\s1\,\s2\) ";
ruleset fold_strings = {
fold_subtract_naturals,
convert_chr_to_string,
convert_characater_to_string,
fold_concatenate_strings };
Each of the individual rules matches corresponding syntax/trees. They are written in such a way that they only apply to literal constants.
fold_add_naturals finds pairs of NATURAL constants joined by an add operation, and replaces that by the sum using a built-in function that sums two values and produces a literal value node containing the sum.
convert_chr_to_string converts chr(c) to the corresponding string literal.
convert_character_to_string converts 'C' to the corresponding string "C".
fold_concatenate_strings combines two literal strings separated by an add operator. It works analogously to the way that fold_add_naturals works.
subtract_naturals and concatenate_strings are built into DMS. convert_chr_to_string and convert_character_to_string need to be custom-coded in DMS's metaprogramming language, PARLANSE, but these routines are pretty simple (maybe 10 lines).
The ruleset packages up the set of rules so they can all be applied.
Not shown is the basic code to open a file, call the parser, invoke the ruleset transformer (which applies rules until no rule applies). The last step is to call the prettyprinter to reprint the modified AST.
Many other PTS offer similar facilities.

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)
You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.
It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

Categories