I'm trying to use pyparsing to parse command line style strings in which the arguments themselves may contain backslash line continuations, such as the value for -arg4 in the following example:
import pyparsing as pp
cmd = r"""shellcmd -arg1 val1 -arg2 val2 \
-arg3 val3 \
-arg4 'quoted \
line-continued \
string \
'"""
continuation = '\\' + pp.LineEnd()
option = pp.Word('-', pp.alphanums)
arg1 = ~pp.Literal('-') + pp.Word(pp.printables)
arg2 = pp.quotedString
arg2.ignore(continuation)
arg = arg1 | arg2
command = pp.Word(pp.alphas) + pp.ZeroOrMore(pp.Group(option + pp.Optional(arg)))
command.ignore(continuation)
print command.parseString(cmd)
The result is:
['shellcmd', ['-arg1', 'val1'], ['-arg2', 'val2'], ['-arg3', 'val3'], ['-arg4', "'quoted"]]
when what I want is something like this:
['shellcmd', ['-arg1', 'val1'], ['-arg2', 'val2'], ['-arg3', 'val3'], ['-arg4', 'quoted line-continued string']]
I would very much appreciate your help in pointing out my error and the fix.
Using cmd as you've posted it above, I would parse it like this:
from pyparsing import *
continuation = ('\\' + LineEnd()).suppress()
name = Word(alphanums)
# Parse out the multiline quoted string
def QString(s,loc,tokens):
text = Word(alphanums+'-') + Optional(continuation)
g = Combine(ZeroOrMore(text),adjacent=False, joinString=" ")
return g.parseString(tokens[0])
arg = name + Optional(continuation)
qarg = QuotedString("\'",multiline=True)
qarg.setParseAction(QString)
vals = Group(ZeroOrMore(arg | qarg))
option = Literal("-").suppress() + Group(name + vals)
grammar = name + ZeroOrMore(option)
sol = grammar.parseString(cmd)
print sol
Giving:
['shellcmd', ['arg1', ['val1']], ['arg2', ['val2']], ['arg3', ['val3']], ['arg4', ['quoted line-continued string']]]
The real key here is using the QuotedString option multiline=True which saves a lot of headache. This solution is a bit more flexible than the one you proposed, being able to handle multiple arguments i.e. -arg a b c or even -arg a b 'long-string-with-dashes' c d e.
Related
I am trying to create a JSON string that I can send in a PUT request but build it dynamically. For example, once I am done, I'd like the string to look like this:
{
"request-id": 1045058,
"db-connections":
[
{
"db-name":"Sales",
"table-name":"customer"
}
]
}
I'd like to use constants for the keys, for example, for request-id, I'd like to use CONST_REQUEST_ID. So the keys will be,
CONST_REQUEST_ID = "request-id"
CONST_DB_CONNECTIONS = "db_connections"
CONST_DB_NAME = "db-name"
CONST_TABLE_NAME = "table-name"
The values for the keys will be taken from various variables. For example we can take value "Sales" from a var called dbname with the value "Sales".
I have tried json.load and getting exception.
Help would be appreciated please as I am a bit new to Python.
Create a normal python dictionary, then convert it to JSON.
raw_data = {
CONST_DB_NAME: getDbNameValue()
}
# ...
json_data = json.dumps(raw_data)
# Use json_data in your PUT request.
You can concat the string using + with variables.
CONST_REQUEST_ID = "request-id"
CONST_DB_CONNECTIONS = "db_connections"
CONST_DB_NAME = "db-name"
CONST_TABLE_NAME = "table-name"
request_id = "1045058"
db_name = "Sales"
table_name = 'customer'
json_string = '{' + \
'"' + CONST_REQUEST_ID + '": ' + request_id \
+ ',' + \
'"db-connections":' \
+ '[' \
+ '{' \
+ '"' + CONST_DB_NAME +'":"' + db_name + '",' \
+ '"' + CONST_TABLE_NAME + '":"' + table_name + '"' \
+ '}' \
+ ']' \
+ '}'
print json_string
And this is the result
python st_ans.py
{"request-id": 1045058,"db-connections":[{"db-name":"Sales","table-name":"customer"}]}
so I want to do this (but using pyparsing)
Package:numpy11 Package:scipy
will be split into
[["Package:", "numpy11"], ["Package:", "scipy"]]
My code so far is
package_header = Literal("Package:")
single_package = Word(printables + " ") + ~Literal("Package:")
full_parser = OneOrMore( pp.Group( package_header + single_package ) )
The current output is this
([(['Package:', 'numpy11 Package:scipy'], {})], {})
I was hoping for something like this
([(['Package:', 'numpy11'], {})], [(['Package:', 'scipy'], {})], {})
Essentially the rest of the text matches pp.printables
I am aware that I can use Words but I want to do
all printables but not the Literal
How do I accomplish this? Thank you.
You shouldn't need the negative lookahead, ie. this:
from pyparsing import *
package_header = Literal("Package:")
single_package = Word(printables)
full_parser = OneOrMore( Group( package_header + single_package ) )
print full_parser.parseString("Package:numpy11 Package:scipy")
prints:
[['Package:', 'numpy11'], ['Package:', 'scipy']]
Update: to parse packages delimited by | you can use the delimitedList() function (now you can also have spaces in package names):
from pyparsing import *
package_header = Literal("Package:")
package_name = Regex(r'[^|]+') # | is a printable, so create a regex that excludes it.
package = Group(package_header + package_name)
full_parser = delimitedList(package, delim="|" )
print full_parser.parseString("Package:numpy11 foo|Package:scipy")
prints:
[['Package:', 'numpy11 foo'], ['Package:', 'scipy']]
I'm trying to build a grammar to parse an Erlang tagged tuple list, and map this to a Dict in pyparsing. I'm having problems when I have a list of Dicts. The grammar works if the Dict has just one element, but when I add a second can't work out now to get it to parse.
Current (simplified grammar code (I removed the bits of the language not necessary in this case):
#!/usr/bin/env python2.7
from pyparsing import *
# Erlang config file definition:
erlangAtom = Word( alphas + '_')
erlangString = dblQuotedString.setParseAction( removeQuotes )
erlangValue = Forward()
erlangList = Forward()
erlangElements = delimitedList( erlangValue )
erlangCSList = Suppress('[') + erlangElements + Suppress(']')
erlangList <<= Group( erlangCSList )
erlangTaggedTuple = Group( Suppress('{') + erlangAtom + Suppress(',') +
erlangValue + Suppress('}') )
erlangDict = Dict( Suppress('[') + delimitedList( erlangTaggedTuple ) +
Suppress(']') )
erlangValue <<= ( erlangAtom | erlangString |
erlangTaggedTuple |
erlangDict | erlangList )
if __name__ == "__main__":
working = """
[{foo,"bar"}, {baz, "bar2"}]
"""
broken = """
[
[{foo,"bar"}, {baz, "bar2"}],
[{foo,"bob"}, {baz, "fez"}]
]
"""
w = erlangValue.parseString(working)
print w.dump()
b = erlangValue.parseString(broken)
print "b[0]:", b[0].dump()
print "b[1]:", b[1].dump()
This gives:
[['foo', 'bar'], ['baz', 'bar2']]
- baz: bar2
- foo: bar
b[0]: [['foo', 'bar'], ['baz', 'bar2'], ['foo', 'bob'], ['baz', 'fez']]
- baz: fez
- foo: bob
b[1]:
Traceback (most recent call last):
File "./erl_testcase.py", line 39, in <module>
print "b[1]:", b[1].dump()
File "/Library/Python/2.7/site-packages/pyparsing.py", line 317, in __getitem__
return self.__toklist[i]
IndexError: list index out of range
i.e. working works, but broken doesn't parse as two lists.
Any ideas?
Edit: Tweaked testcase to be more explicit about expected output.
Ok, so I have never worked with pyparsing before, so excuse me if my solution does not make sense. Here we go:
As far as I understand what you need is three main structures. The most common mistake you made was grouping delimitedLists. They are already grouped, so you have an issue of double grouping. Here are my definitions:
for {a,"b"}:
erlangTaggedTuple = Dict(Group(Suppress('{') + erlangAtom + Suppress(',') + erlangValue + Suppress('}') ))
for [{a,"b"}, {c,"d"}]:
erlangDict = Suppress('[') + delimitedList( erlangTaggedTuple ) + Suppress(']')
for the rest:
erlangList <<= Suppress('[') + delimitedList( Group(erlangDict|erlangList) ) + Suppress(']')
So my fix for your code is:
#!/usr/bin/env python2.7
from pyparsing import *
# Erlang config file definition:
erlangAtom = Word( alphas + '_')
erlangString = dblQuotedString.setParseAction( removeQuotes )
erlangValue = Forward()
erlangList = Forward()
erlangTaggedTuple = Dict(Group(Suppress('{') + erlangAtom + Suppress(',') +
erlangValue + Suppress('}') ))
erlangDict = Suppress('[') + delimitedList( erlangTaggedTuple ) + Suppress(']')
erlangList <<= Suppress('[') + delimitedList( Group(erlangDict|erlangList) ) + Suppress(']')
erlangValue <<= ( erlangAtom | erlangString |
erlangTaggedTuple |
erlangDict| erlangList )
if __name__ == "__main__":
working = """
[{foo,"bar"}, {baz, "bar2"}]
"""
broken = """
[
[{foo,"bar"}, {baz, "bar2"}],
[{foo,"bob"}, {baz, "fez"}]
]
"""
w = erlangValue.parseString(working)
print w.dump()
b = erlangValue.parseString(broken)
print "b[0]:", b[0].dump()
print "b[1]:", b[1].dump()
Which gives the output:
[['foo', 'bar'], ['baz', 'bar2']]
- baz: bar2
- foo: bar
b[0]: [['foo', 'bar'], ['baz', 'bar2']]
- baz: bar2
- foo: bar
b[1]: [['foo', 'bob'], ['baz', 'fez']]
- baz: fez
- foo: bob
Hope that helps, cheers!
I can't understand why it's not working, because your code looks very much like the JSON example, which handles nested lists just fine.
But the problem seems to happen at this line
erlangElements = delimitedList( erlangValue )
where if the erlangValues are lists, they get appended instead of cons'd. You can kludge around this with
erlangElements = delimitedList( Group(erlangValue) )
which adds an extra layer of list around the top-most element, but keeps your sub-lists from merging.
I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )
You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.
Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.
The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.
Best of luck with pyparsing!
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)
#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))
#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)
#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))
Now instead of pprint(result.asList()) you can write:
print (result.dump())
to show the Dict hierarchy:
[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- key1: value1
- list: ['XXitem', 'XXanotherItem']
- mods: ['XXone', 'XXtwo']
- product: milk
- release: now
- subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
- skey: sval
- slist: ['XXitem']
- version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
- base: base-0.1
- config: config-7.0-7
allowing you to write statements like:
print (result.sectionTwo.base)
friends.
I have a 'make'-like style file needed to be parsed. The grammar is something like:
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
Where $global contains the variables defined in a global scope. $path is a 'local' variable. $args contains the key/pair values passed in by users.
I would like to parse this file by some python libraries. Better to return some parse tree. If there are some errors, better to report them. I found this one: CodeTalker and yeanpypa. Can they be used in this case? Any other recommendations?
I had to guess what your makefile structure allows based on your example, but this should get you close:
from pyparsing import *
# elements of the makefile are delimited by line, so we must
# define skippable whitespace to include just spaces and tabs
ParserElement.setDefaultWhitespaceChars(' \t')
NL = LineEnd().suppress()
EQ,COLON,LBRACK,RBRACK = map(Suppress, "=:[]")
identifier = Word(alphas+'_', alphanums)
symbol_assignment = Group(identifier("name") + EQ + empty +
restOfLine("value"))("symbol_assignment")
symbol_ref = Word("$",alphanums+"_.")
def only_column_one(s,l,t):
if col(l,s) != 1:
raise ParseException(s,l,"not in column 1")
# task identifiers have to start in column 1
task_identifier = identifier.copy().setParseAction(only_column_one)
task_description = "des:" + empty + restOfLine("des")
task_path = "path:" + empty + restOfLine("path")
task_para_body = delimitedList(symbol_ref)
task_para = "para:" + LBRACK + task_para_body("para") + RBRACK
task_para.ignore(NL)
task_definition = Group(task_identifier("target") + COLON +
Optional(delimitedList(identifier))("deps") + NL +
(
Optional(task_description + NL) &
Optional(task_path + NL) &
Optional(task_para + NL)
)
)("task_definition")
makefile_parser = ZeroOrMore(
symbol_assignment |
task_definition |
NL
)
if __name__ == "__main__":
test = """\
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
"""
# dump out what we parsed, including results names
for element in makefile_parser.parseString(test):
print element.getName()
print element.dump()
print
Prints:
symbol_assignment
['samtools', '/path/to/samtools']
- name: samtools
- value: /path/to/samtools
symbol_assignment
['picard', '/path/to/picard']
- name: picard
- value: /path/to/picard
task_definition
['task1', 'des:', 'description ', 'path:', '/path/to/task1 ', 'para:',
'$global.samtools', '$args.input', '$path']
- des: description
- para: ['$global.samtools', '$args.input', '$path']
- path: /path/to/task1
- target: task1
task_definition
['task2', 'task1']
- deps: ['task1']
- target: task2
The dump() output shows you what names you can use to get at the fields within the parsed elements, or to distinguish what kind of element you have. dump() is a handy, generic tool to output whatever pyparsing has parsed. Here is some code that is more specific to your particular parser, showing how to use the field names as either dotted object references (element.target, element.deps, element.name, etc.) or dict-style references (element[key]):
for element in makefile_parser.parseString(test):
if element.getName() == 'task_definition':
print "TASK:", element.target,
if element.deps:
print "DEPS:(" + ','.join(element.deps) + ")"
else:
print
for key in ('des', 'path', 'para'):
if key in element:
print " ", key.upper()+":", element[key]
elif element.getName() == 'symbol_assignment':
print "SYM:", element.name, "->", element.value
prints:
SYM: samtools -> /path/to/samtools
SYM: picard -> /path/to/picard
TASK: task1
DES: description
PATH: /path/to/task1
PARA: ['$global.samtools', '$args.input', '$path']
TASK: task2 DEPS:(task1)
I've used pyparsing in the past and been immensely pleased with it (q.v., the pyparsing project site).