How to use PyParsing's QuotedString? - python

I'm trying to parse a string which contains several quoted values. Here is what I have so far:
from pyparsing import Word, Literal, printables
package_line = "package: name='com.sec.android.app.camera.shootingmode.dual' versionCode='6' versionName='1.003' platformBuildVersionName='5.0.1-1624448'"
package_name = Word(printables)("name")
versionCode = Word(printables)("versionCode")
versionName = Word(printables)("versionName")
platformBuildVersionName = Word(printables)("platformBuildVersionName")
expression = Literal("package:") + "name=" + package_name + "versionCode=" + versionCode \
+ "versionName=" + versionName + "platformBuildVersionName=" + platformBuildVersionName
tokens = expression.parseString(package_line)
print tokens['name']
print tokens['versionCode']
print tokens['versionName']
print tokens['platformBuildVersionName']
which prints
'com.sec.android.app.camera.shootingmode.dual'
'6'
'1.003'
'5.0.1-1624448'
Note that all the extracted tokens are contains within single quotes. I would like to remove these, and it seems like the QuotedString object is meant for this purpose. However, I'm having difficulty adapting this snippet to use QuotedStrings; in particular, their constructor doesn't seem to take printables.
How might I go about removing the single quotes?

Replacing the expressions with the following:
package_name = QuotedString(quoteChar="'")("name")
versionCode = QuotedString(quoteChar="'")("versionCode")
versionName = QuotedString(quoteChar="'")("versionName")
platformBuildVersionName = QuotedString(quoteChar="'")("platformBuildVersionName")
seems to work. Now the script prints the output
com.sec.android.app.camera.shootingmode.dual
6
1.003
5.0.1-1624448
without quotation marks.

Related

Parametrized complex string working with gpkg in python

I need to construct a chain of text like this:
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
Here is my code:
import glob, time, sys, threading, os
from datetime import date, timedelta, datetime
import time, threading
#Parameters
layer = 'C:\\layer.gpkg'
n ='2020'
outdir = 'C:\\output'
#Process
l = os.path.realpath(layer)
pn = os.path.realpath(outdir + '/' + n + '.gpkg')
p = f"'{pn}'"
f = f"'{n}'"
o = f'ogr:dbname={p} table={f} (geom) sql='
#Test
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
o == out
The goal is to get o == out.
What do I need to change in the #Process part in order to get this as True ?
Moreover I need to run this either in linux or windows.
My final goal is to create a function that give 3 strings returns the complex string line shown above.
Assuming you are using python 3.6 or above you should use format strings (also known as f strings) to construct strings from variables. Start the string with the letter "f" and then put whatever variables you want in curly brackets {}. Also if you use single quotes as the outer quote then you don't have to escape double quotes and vice versa.
Code:
db_name = "'home/user/output/prueba.gpkg'"
table_name = '"prueba"'
outputlayer = f'ogr:dbname={db_name} table={table_name} (geom) sql='
outputlayer
Output:
'ogr:dbname=\'home/user/output/prueba.gpkg\' table="prueba" (geom) sql='
I think one of the issues that this isn't working is your path here pn = os.path.realpath(outdir + '/' + n + '.gpkg'). This is trying to combine UNIX path / with windows path \\. A more robust solution in terms of portability between linux and windows would be to use the path.join function in os module.
Additionally, python f strings will only add escapes to whichever quote character you used to open the string (' or "). If the escaped quotes around both strings are necessary, you're probably better off hard coding it into an f-string instead of setting 2 new variables with different quote types.
import glob, time, sys, threading, os
from datetime import date, timedelta, datetime
import time, threading
#Parameters
layer = 'C:\\layer.gpkg'
n ='2020'
outdir = 'C:\\output'
#Process
l = os.path.realpath(layer)
pn = os.path.realpath(os.path.join(outdir, f"{n}.gpkg"))
o = f'ogr:dbname=\'{pn}\' table=\"{n}\" (geom) sql='
#Test
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
o == out
A version of this (different path) has been tested to work on my linux machine.
Another option is to use a triple quoted string:
dbname = """/home/user/output/prueba.gpkg"""
outputlayer = """ogr:dbname='"""+dbname+"""' table="prueba" (geom) sql="""
which gives:
'ogr:dbname=\'/home/user/output/prueba.gpkg\' table="prueba" (geom) sql='

Make an Optional expression throw an error if it was present but didn't match

I'm using the PyParsing library to define my own SQL-like DSL. Here is the relevant part from it:
argument_separator = pp.Suppress(",")
brace_open = pp.Suppress("(")
brace_close = pp.Suppress(")")
argument_string = pp.quotedString.setParseAction(pp.removeQuotes)
keyword_author = pp.CaselessKeyword("author")
keyword_date = pp.CaselessKeyword("date")
function_matches = pp.CaselessLiteral("matches")
function_in = pp.CaselessLiteral("in")
function_between = pp.CaselessLiteral("between")
author_matches = pp.Group(keyword_author + function_matches + brace_open +
pp.Group(argument_string) +
brace_close)
author_in = pp.Group(keyword_author + function_in + brace_open +
pp.Group(argument_string + pp.OneOrMore(argument_separator + argument_string)) +
brace_close)
date_between = pp.Group(keyword_date + function_between + brace_open +
pp.Group(argument_string + argument_separator + argument_string) +
brace_close)
expression = pp.Optional(author_matches | author_in) & pp.Optional(date_between)
Examples:
# These all match:
author in("Lukas", "Timo", "Michae1")
author matches("hallo#welt.de")
date between("two days ago", "today")
author matches("knuth") date between("two days ago", "today")
# This does (and should) not.
date between(today)
The last expression doesn't match but doesn't throw an exception either. It just returns an empty result.
My goal: A "query" in my DSL consists of multiple expressions of the form
[column] [operator]([parameter],...)
No doublets are allowed. Furthermore, all possible expressions are optional (so an empty query is perfectly legal).
My problem: The current approach doesn't throw an error if one of these expressions is malformed. Because they are all Optional, if they don't match exactly, they're just ignored. That is confusing to the user, since he doesn't get an error, but the result is wrong.
So, what I need is an expression that is optional (so can be completely omitted), but will throw a ParseException, if it was malformed.
Try setting parseAll to True when parsing each line, e.g. expression.parseString(line, parseAll=True). This will throw a ParseException exception if the entire line wasn't matched.
See the "Using the pyparsing module" page for a bit more info.

Writing multiple values in a text file using python

I want to write mulitiple values in a text file using python.
I wrote the following line in my code:
text_file.write("sA" + str(chart_count) + ".Name = " + str(State_name.groups())[2:-3] + "\n")
Note: State_name.groups() is a regex captured word. So it is captured as a tuple and to remove the ( ) brackets from the tuple I have used string slicing.
Now the output comes as:
sA0.Name = GLASS_OPEN
No problem here
But I want the output to be like this:
sA0.Name = 'GLASS_HATCH_OPENED_PROTECTION_FCT'
I want the variable value to be enclosed inside the single quotes.
Does this work for you?
text_file.write("sA" + str(chart_count) + ".Name = '" + str(State_name.groups())[2:-3] + "'\n")
# ^single quote here and here^

Contextual parsing of structured text configuration files using pyparsing

I'm trying to build a simple parser using pyparsing.
My example file looks as follows:
# comment
# comment
name1 = value1
name2 = value2
example_list a
foo
bar
grp1 (
example_list2 aa
foo
bar
example_list3 bb
foo
bar
)
grp2 (
example_list4 x
foo
bar
example_list5 x
foo
bar
example_list6 x
foo
bar
)
The parser I've come up with so far looks like this:
#!/usr/bin/python
import sys
from pyparsing import *
blank_line = lineStart + restOfLine
comment = Suppress("#") + restOfLine
alias = Word(alphas, alphanums)
name = Word(alphas, alphanums + "_")
value = Word(printables)
parameter = name + Suppress("=") + value
flag = Literal("*") | Literal("#") | Literal("!")
list_item = Optional(flag) + value
list = name + alias + lineEnd + OneOrMore(list_item) + blank_line
group = alias + Suppress("(") + lineEnd + OneOrMore(list) + lineStart + Suppress(")")
script = ZeroOrMore(Suppress(blank_line) | Suppress(comment) | parameter ^ list ^ group)
if __name__ == "__main__":
print script.parseFile(sys.argv[1])
but of course it doesn't work.
What I think I need is some way for the parser to know that if we have a string followed by an equals sign, that only then can we expect just one more string.
If we have a string followed by a bracket, then we've started a group.
And if we have two strings, then we've started a list.
How do I do this?
Also, comments could conceivably also appear on the end of lines...
I'm not sure if you are settled on your file format, but your file could easily be expressed as an RSON file (see http://code.google.com/p/rson/). The RSON format (and associated parser) was developed to be a "readable" version of JSON. I'm using the python RSON parser in some of my projects.
If you are doing this to learn how to parse a file like this, you may still be able to glean some info from the RSON parser.

Can’t fix pyparsing error…

Overview
So, I’m in the middle of refactoring a project, and I’m separating out a bunch of parsing code. The code I’m concerned with is pyparsing.
I have a very poor understanding of pyparsing, even after spending a lot of time reading through the official documentation. I’m having trouble because (1) pyparsing takes a (deliberately) unorthodox approach to parsing, and (2) I’m working on code I didn’t write, with poor comments, and a non-elementary set of existing grammars.
(I can’t get in touch with the original author, either.)
Failing Test
I’m using PyVows to test my code. One of my tests is as follows (I think this is clear even if you’re unfamiliar with PyVows; let me know if it isn’t):
def test_multiline_command_ends(self, topic):
output = parsed_input('multiline command ends\n\n',topic)
expect(output).to_equal(
r'''['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- statement: ['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- terminator: ['\n', '\n']
- terminator: ['\n', '\n']''')
But when I run the test, I get the following in the terminal:
Failed Test Results
Expected topic("['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline")
to equal "['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"
Note:
Since the output is to a Terminal, the expected output (the second one) has extra backslashes. This is normal. The test ran without issue before this piece of refactoring began.
Expected Behavior
The first line of output should match the second, but it doesn’t. Specifically, it’s not including the two newline characters in that first list object.
So I’m getting this:
"['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline"
When I should be getting this:
"['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"
Earlier in the code, there is also this statement:
pyparsing.ParserElement.setDefaultWhitespaceChars(' \t')
…Which I think should prevent exactly this kind of error. But I’m not sure.
Even if the problem can’t be identified with certainty, simply narrowing down where the problem is would be a HUGE help.
Please let me know how I might take a step or two towards fixing this.
Edit: So, uh, I should post the parser code for this, shouldn’t I? (Thanks for the tip, #andrew cooke !)
Parser code
Here’s the __init__ for my parser object.
I know it’s a nightmare. That’s why I’m refactoring the project. ☺
def __init__(self, Cmd_object=None, *args, **kwargs):
# #NOTE
# This is one of the biggest pain points of the existing code.
# To aid in readability, I CAPITALIZED all variables that are
# not set on `self`.
#
# That means that CAPITALIZED variables aren't
# used outside of this method.
#
# Doing this has allowed me to more easily read what
# variables become a part of other variables during the
# building-up of the various parsers.
#
# I realize the capitalized variables is unorthodox
# and potentially anti-convention. But after reaching out
# to the project's creator several times over roughly 5
# months, I'm still working on this project alone...
# And without help, this is the only way I can move forward.
#
# I have a very poor understanding of the parser's
# control flow when the user types a command and hits ENTER,
# and until the author (or another pyparsing expert)
# explains what's happening to me, I have to do silly
# things like this. :-|
#
# Of course, if the impossible happens and this code
# gets cleaned up, then the variables will be restored to
# proper capitalization.
#
# —Zearin
# http://github.com/zearin/
# 2012 Mar 26
if Cmd_object is not None:
self.Cmd_object = Cmd_object
else:
raise Exception('Cmd_object be provided to Parser.__init__().')
# #FIXME
# Refactor methods into this class later
preparse = self.Cmd_object.preparse
postparse = self.Cmd_object.postparse
self._allow_blank_lines = False
self.abbrev = True # Recognize abbreviated commands
self.case_insensitive = True # Commands recognized regardless of case
# make sure your terminators are not in legal_chars!
self.legal_chars = u'!#$%.:?#_' + PYP.alphanums + PYP.alphas8bit
self.multiln_commands = [] if 'multiline_commands' not in kwargs else kwargs['multiln_commands']
self.no_special_parse = {'ed','edit','exit','set'}
self.redirector = '>' # for sending output to file
self.reserved_words = []
self.shortcuts = { '?' : 'help' ,
'!' : 'shell',
'#' : 'load' ,
'##': '_relative_load'
}
# self._init_grammars()
#
# def _init_grammars(self):
# #FIXME
# Add Docstring
# ----------------------------
# Tell PYP how to parse
# file input from '< filename'
# ----------------------------
FILENAME = PYP.Word(self.legal_chars + '/\\')
INPUT_MARK = PYP.Literal('<')
INPUT_MARK.setParseAction(lambda x: '')
INPUT_FROM = FILENAME('INPUT_FROM')
INPUT_FROM.setParseAction( self.Cmd_object.replace_with_file_contents )
# ----------------------------
#OUTPUT_PARSER = (PYP.Literal('>>') | (PYP.WordStart() + '>') | PYP.Regex('[^=]>'))('output')
OUTPUT_PARSER = (PYP.Literal( 2 * self.redirector) | \
(PYP.WordStart() + self.redirector) | \
PYP.Regex('[^=]' + self.redirector))('output')
PIPE = PYP.Keyword('|', identChars='|')
STRING_END = PYP.stringEnd ^ '\nEOF'
TERMINATORS = [';']
TERMINATOR_PARSER = PYP.Or([
(hasattr(t, 'parseString') and t)
or
PYP.Literal(t) for t in TERMINATORS
])('terminator')
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '')
self.comment_grammars.addParseAction(lambda x: '')
self.comment_in_progress = '/*' + PYP.SkipTo(PYP.stringEnd ^ '*/')
# QuickRef: Pyparsing Operators
# ----------------------------
# ~ creates NotAny using the expression after the operator
#
# + creates And using the expressions before and after the operator
#
# | creates MatchFirst (first left-to-right match) using the
# expressions before and after the operator
#
# ^ creates Or (longest match) using the expressions before and
# after the operator
#
# & creates Each using the expressions before and after the operator
#
# * creates And by multiplying the expression by the integer operand;
# if expression is multiplied by a 2-tuple, creates an And of
# (min,max) expressions (similar to "{min,max}" form in
# regular expressions); if min is None, intepret as (0,max);
# if max is None, interpret as expr*min + ZeroOrMore(expr)
#
# - like + but with no backup and retry of alternatives
#
# * repetition of expression
#
# == matching expression to string; returns True if the string
# matches the given expression
#
# << inserts the expression following the operator as the body of the
# Forward expression before the operator
# ----------------------------
DO_NOT_PARSE = self.comment_grammars | \
self.comment_in_progress | \
PYP.quotedString
# moved here from class-level variable
self.URLRE = re.compile('(https?://[-\\w\\./]+)')
self.keywords = self.reserved_words + [fname[3:] for fname in dir( self.Cmd_object ) if fname.startswith('do_')]
# not to be confused with `multiln_parser` (below)
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
ONELN_COMMAND = ( ~self.multiln_command +
PYP.Word(self.legal_chars)
)('command')
#self.multiln_command.setDebug(True)
# Configure according to `allow_blank_lines` setting
if self._allow_blank_lines:
self.blankln_termination_parser = PYP.NoMatch
else:
BLANKLN_TERMINATOR = (2 * PYP.lineEnd)('terminator')
#BLANKLN_TERMINATOR('terminator')
self.blankln_termination_parser = (
(self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
BLANKLN_TERMINATOR,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ BLANKLN_TERMINATOR
)('statement')
# CASE SENSITIVITY for
# ONELN_COMMAND and self.multiln_command
if self.case_insensitive:
# Set parsers to account for case insensitivity (if appropriate)
self.multiln_command.setParseAction(lambda x: x[0].lower())
ONELN_COMMAND.setParseAction(lambda x: x[0].lower())
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx')
+ PYP.Optional(PYP.Word(self.legal_chars + '/\\'))('fname')
+ PYP.stringEnd)
AFTER_ELEMENTS = PYP.Optional(PIPE +
PYP.SkipTo(
OUTPUT_PARSER ^ STRING_END,
ignore=DO_NOT_PARSE
)('pipeTo')
) + \
PYP.Optional(OUTPUT_PARSER +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('outputTo')
)
self.multiln_parser = (((self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
TERMINATOR_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ TERMINATOR_PARSER)('statement')
+ PYP.SkipTo(
OUTPUT_PARSER ^ PIPE ^ STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('suffix')
+ AFTER_ELEMENTS
)
#self.multiln_parser.setDebug(True)
self.multiln_parser.ignore(self.comment_in_progress)
self.singleln_parser = (
( ONELN_COMMAND + PYP.SkipTo(
TERMINATOR_PARSER
^ STRING_END
^ PIPE
^ OUTPUT_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x:x[0].strip())('args'))('statement')
+ PYP.Optional(TERMINATOR_PARSER)
+ AFTER_ELEMENTS)
#self.multiln_parser = self.multiln_parser('multiln_parser')
#self.singleln_parser = self.singleln_parser('singleln_parser')
self.prefix_parser = PYP.Empty()
self.parser = self.prefix_parser + (STRING_END |
self.multiln_parser |
self.singleln_parser |
self.blankln_termination_parser |
self.multiln_command +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE)
)
self.parser.ignore(self.comment_grammars)
# a not-entirely-satisfactory way of distinguishing
# '<' as in "import from" from
# '<' as in "lesser than"
self.input_parser = INPUT_MARK + \
PYP.Optional(INPUT_FROM) + \
PYP.Optional('>') + \
PYP.Optional(FILENAME) + \
(PYP.stringEnd | '|')
self.input_parser.ignore(self.comment_in_progress)
I suspect that the problem is pyparsing's builtin whitespace skipping, which will skip over newlines by default. Even though setDefaultWhitespaceChars is used to tell pyparsing that newlines are significant, this setting only affects all expressions that are created after the call to setDefaultWhitespaceChars. The problem is that pyparsing tries to help by defining a number of convenience expressions when it is imported, like empty for Empty(), lineEnd for LineEnd() and so on. But since these are all created at import time, they are defined with the original default whitespace characters, which include '\n'.
I should probably just do this in setDefaultWhitespaceChars, but you can clean this up for yourself too. Right after calling setDefaultWhitespaceChars, redefine these module-level expressions in pyparsing:
PYP.ParserElement.setDefaultWhitespaceChars(' \t')
# redefine module-level constants to use new default whitespace chars
PYP.empty = PYP.Empty()
PYP.lineEnd = PYP.LineEnd()
PYP.stringEnd = PYP.StringEnd()
I think this will help restore the significance of your embedded newlines.
Some other bits on your parser code:
self.blankln_termination_parser = PYP.NoMatch
should be
self.blankln_termination_parser = PYP.NoMatch()
Your original author might have been overly aggressive with using '^' over '|'. Only use '^' if there is some potential for parsing one expression accidentally when you would really have parsed a longer one that follows later in the list of alternatives. For instance, in:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx')
There is no possible confusion between a Word of numeric digits or a lone '*'. Or (or '^' operator) tells pyparsing to try to evaluate all of the alternatives, and then pick the longest matching one - in case of a tie, chose the left-most alternative in the list. If you parse '*', there is no need to see if that might also match a longer integer, or if you parse an integer, no need to see if it might also pass as a lone '*'. So change this to:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)|'*')('idx')
Using a parse action to replace a string with '' is more simply written using a PYP.Suppress wrapper, or if you prefer, call expr.suppress() which returns Suppress(expr). Combined with preference for '|' over '^', this:
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '')
becomse:
self.comment_grammars = (PYP.pythonStyleComment | PYP.cStyleComment
).ignore(PYP.quotedString).suppress()
Keywords have built-in logic to automatically avoid ambiguity, so Or is completely unnecessary with them:
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
should be:
self.multiln_command = PYP.MatchFirst([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
(In the next release, I'll loosen up those initializers to accept generator expressions so that the []'s will become unnecessary.)
That's all I can see for now. Hope this helps.
I fixed it!
Pyparsing was not at fault!
I was. ☹
By separating out the parsing code into a different object, I created the problem. Originally, an attribute used to “update itself” based on the contents of a second attribute. Since this all used to be contained in one “god class”, it worked fine.
Simply by separating the code into another object, the first attribute was set at instantiation, but no longer “updated itself” if the second attribute it depended on changed.
Specifics
The attribute multiln_command (not to be confused with multiln_commands—aargh, what confusing naming!) was a pyparsing grammar definition. The multiln_command attribute should have updated its grammar if multiln_commands ever changed.
Although I knew these two attributes had similar names but very different purposes, the similarity definitely made it harder to track the problem down. I have no renamed multiln_command to multiln_grammar.
However! ☺
I am grateful to #Paul McGuire’s awesome answer, and I hope it saves me (and others) some grief in the future. Although I feel a bit foolish that I caused the problem (and misdiagnosed it as a pyparsing issue), I’m happy some good (in the form of Paul’s advice) came of asking this question.
Happy parsing, everybody. :)

Categories