Splits based on pyparsing - python

so I want to do this (but using pyparsing)
Package:numpy11 Package:scipy
will be split into
[["Package:", "numpy11"], ["Package:", "scipy"]]
My code so far is
package_header = Literal("Package:")
single_package = Word(printables + " ") + ~Literal("Package:")
full_parser = OneOrMore( pp.Group( package_header + single_package ) )
The current output is this
([(['Package:', 'numpy11 Package:scipy'], {})], {})
I was hoping for something like this
([(['Package:', 'numpy11'], {})], [(['Package:', 'scipy'], {})], {})
Essentially the rest of the text matches pp.printables
I am aware that I can use Words but I want to do
all printables but not the Literal
How do I accomplish this? Thank you.

You shouldn't need the negative lookahead, ie. this:
from pyparsing import *
package_header = Literal("Package:")
single_package = Word(printables)
full_parser = OneOrMore( Group( package_header + single_package ) )
print full_parser.parseString("Package:numpy11 Package:scipy")
prints:
[['Package:', 'numpy11'], ['Package:', 'scipy']]
Update: to parse packages delimited by | you can use the delimitedList() function (now you can also have spaces in package names):
from pyparsing import *
package_header = Literal("Package:")
package_name = Regex(r'[^|]+') # | is a printable, so create a regex that excludes it.
package = Group(package_header + package_name)
full_parser = delimitedList(package, delim="|" )
print full_parser.parseString("Package:numpy11 foo|Package:scipy")
prints:
[['Package:', 'numpy11 foo'], ['Package:', 'scipy']]

Related

Pyparsing nested transformString

I had something working for a little while to transform a tag from lua to hmtl, but recently I got a special case where those tags could be nested. Here is a quick sample out of my code :
from pyparsing import Literal, Word, Suppress, SkipTo, LineEnd, hexnums
text = "|c71d5FFFFI'm saying something in color|cFFFFFFFF then in white |r|r"
def colorize (t):
hexRGB = "".join(list(t.hex)[:6])
return "<span style=\"color:#{};\">{}</span>".format(hexRGB, t.content)
vbar = Literal("|")
eol = LineEnd().suppress()
endTag = ((vbar + (Literal("r")|Literal("R"))|eol))
parser = (
Suppress(vbar + (Literal("c")|Literal("C"))) +
Word(hexnums, exact=8).setResultsName("hex") +
SkipTo(endTag).setResultsName("content") +
Suppress(endTag)
).addParseAction(colorize)
result = parser.transformString(text)
print (result)
I saw an another similar question Pyparsing: nested Markdown emphasis, but my problem is a bit different, sometime there is no closetag and lineEnd is acting as one.
You can add a while loop to iterate over result until all the colors are found:
from pyparsing import Literal, Word, Suppress, SkipTo, LineEnd, hexnums
def colorize (t):
hexRGB = "".join(list(t.hex)[:6])
return "<span style=\"color:#{};\">{}</span>".format(hexRGB, t.content)
vbar = Literal("|")
eol = LineEnd().suppress()
endTag = ((vbar + (Literal("r")|Literal("R"))|eol))
parser = (
Suppress(vbar + (Literal("c")|Literal("C"))) +
Word(hexnums, exact=8).setResultsName("hex") +
SkipTo(endTag).setResultsName("content") +
Suppress(endTag)
).addParseAction(colorize)
result = parser.transformString(text)
new_result = parser.transformString(result)
while(result != new_result):
result = new_result
new_result = parser.transformString(result)
print (result)
when text = "|c71d5FFFFI'm saying something in color|cFFFFFFFF then in white |r|r":
output:
<span style="color:#71d5FF;">I'm saying something in color<span style="color:#FFFFFF;"> then in white</span></span>
when text = "|c71d5FFFFI'm saying something in color"
output:
<span style="color:#71d5FF;">I'm saying something in color</span>

Extracting the values within the square braces

I have a file testfile with the set of server names as below.
app-server-l11[2-5].test.com
server-l34[5-8].test.com
dd-server-l[2-4].test.com
Can you please help in getting output to be as follow.
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/(.*)\[([0-9]+)-([0-9]+)\](.*)/,a){for (i=a[2]; i<=a[3]; i++) print a[1] i a[4]}' file
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
In GNU awk:
$ awk -F"[][]" '{split($2,a,"-"); for(i=a[1];i<=a[2];i++) print $1 i $3}' file
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
split to fields by [ and ] using FS
use split the get the range start (a[1]) and end (a[2])
iterate the range with for and output
There is no checking whether there was a range or not. It could be implemented with something like: print (NF==3 ? $1 i $3 : $1 ).
Worst and ugliest example:
var='app-server-l11[2-5].test.com'
for i in range(int(var[(var.find('[') +1)]), int(var[(var.find("]") - 1)])+1):
print 'app-server-l11' + str(i) + '.test.com'
Use your imagination!
ser_nm = ['app-server-l11[2-5].test.com','server-134[5-8].test.com','dd-server-[2-4].test.com']
for nm in ser_nm:
for i in range(int(nm[nm.find('[')+1 : nm.find('-',(nm.find('[')+1))]), int(nm[nm.find('-',(nm.find('[')+1))+1:nm.find(']') ] )+1):
print(nm[:nm.find('[')] + str(i) + nm[nm.find(']')+1:])
This will also take care of cases where server names are like this:
'server-134[52-823].test.com'
not the best solution, but it works...
inp = open('input.txt', 'r+').read()
print(inp)
result= ''
for i in inp.split('\n'):
if len(i) > 1:
print(repr(i))
f1 = i.find('[')
f2 = i.find(']')+1
b1 = i[:f1]
b2 = i[f2:]
ins = i[f1:f2]
ins = ins[1:-1]
for j in range(int(ins.split("-")[0]),int(ins.split("-")[1])+1):
result+=b1+str(j)+b2+'\n'
outp = open('output.txt', 'w')
outp.write(result)
outp.close()
You can use the below command for the required output without any complex statement.
awk -f test.awk file.txt
test.awk must contains the below lines:
{
if(a=match($0,"\\["))
{
start=strtonum(substr($0,a+1,1));
end=strtonum(substr($0,a+3,1));
copy=$0;
for(i=start;i<=end;i++)
{
sub("\\[[0-9]{1,}-[0-9]{1,}\\]",i,copy);
print copy;
copy = $0;
}
}
else
{
print $0;
}
}
file.txt contains your input file like below lines:
app-server-l11[2-5].test.com
server-l34[5-8].test.com
dd-server-l[2-4].test.com
output:
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
As this sounds like a school assignment I'm going to be fairly vague.
I would use a regular expression to extract the numeric range and the rest of the address components, then use a loop to iterate over the extracted numeric range to build each address (using the other captured address components).
Since it's been over a week:
import re
inputs = [ "app-server-l11[2-5].test.com", "server-l34[5-8].test.com", "dd-server-l[2-4].test.com" ]
pattern = r"\s*(?P<subdomain>[a-zA-Z0-9-_.]+)\[(?P<range_start>\d+)-(?P<range_end>\d+)\](?P<domain>\S+)"
expr = re.compile( pattern )
def expand_domain( domain ):
mo = expr.match( domain )
if mo is not None:
groups = mo.groupdict()
subdomain = groups[ "subdomain" ]
domain = groups[ "domain" ]
range_start = int( groups[ "range_start" ] )
range_end = int( groups[ "range_end" ] )
result = [ "{}{:d}{}".format( subdomain, index, domain ) for index in range( range_start, range_end + 1 ) ]
return result
else:
raise ValueError( "'{}' does not match the expected input.".format( domain ) )
for domain in inputs:
print( "'{}':".format( domain ) )
for exp_dom in expand_domain( domain ):
print( "--> {}".format( exp_dom ) )

Cannot parse correctly this file with pyparsing

I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )
You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.
Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.
The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.
Best of luck with pyparsing!
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)
#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))
#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)
#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))
Now instead of pprint(result.asList()) you can write:
print (result.dump())
to show the Dict hierarchy:
[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- key1: value1
- list: ['XXitem', 'XXanotherItem']
- mods: ['XXone', 'XXtwo']
- product: milk
- release: now
- subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
- skey: sval
- slist: ['XXitem']
- version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
- base: base-0.1
- config: config-7.0-7
allowing you to write statements like:
print (result.sectionTwo.base)

pyparsing command line strings with line continuations

I'm trying to use pyparsing to parse command line style strings in which the arguments themselves may contain backslash line continuations, such as the value for -arg4 in the following example:
import pyparsing as pp
cmd = r"""shellcmd -arg1 val1 -arg2 val2 \
-arg3 val3 \
-arg4 'quoted \
line-continued \
string \
'"""
continuation = '\\' + pp.LineEnd()
option = pp.Word('-', pp.alphanums)
arg1 = ~pp.Literal('-') + pp.Word(pp.printables)
arg2 = pp.quotedString
arg2.ignore(continuation)
arg = arg1 | arg2
command = pp.Word(pp.alphas) + pp.ZeroOrMore(pp.Group(option + pp.Optional(arg)))
command.ignore(continuation)
print command.parseString(cmd)
The result is:
['shellcmd', ['-arg1', 'val1'], ['-arg2', 'val2'], ['-arg3', 'val3'], ['-arg4', "'quoted"]]
when what I want is something like this:
['shellcmd', ['-arg1', 'val1'], ['-arg2', 'val2'], ['-arg3', 'val3'], ['-arg4', 'quoted line-continued string']]
I would very much appreciate your help in pointing out my error and the fix.
Using cmd as you've posted it above, I would parse it like this:
from pyparsing import *
continuation = ('\\' + LineEnd()).suppress()
name = Word(alphanums)
# Parse out the multiline quoted string
def QString(s,loc,tokens):
text = Word(alphanums+'-') + Optional(continuation)
g = Combine(ZeroOrMore(text),adjacent=False, joinString=" ")
return g.parseString(tokens[0])
arg = name + Optional(continuation)
qarg = QuotedString("\'",multiline=True)
qarg.setParseAction(QString)
vals = Group(ZeroOrMore(arg | qarg))
option = Literal("-").suppress() + Group(name + vals)
grammar = name + ZeroOrMore(option)
sol = grammar.parseString(cmd)
print sol
Giving:
['shellcmd', ['arg1', ['val1']], ['arg2', ['val2']], ['arg3', ['val3']], ['arg4', ['quoted line-continued string']]]
The real key here is using the QuotedString option multiline=True which saves a lot of headache. This solution is a bit more flexible than the one you proposed, being able to handle multiple arguments i.e. -arg a b c or even -arg a b 'long-string-with-dashes' c d e.

Any python module for customized BNF parser?

friends.
I have a 'make'-like style file needed to be parsed. The grammar is something like:
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
Where $global contains the variables defined in a global scope. $path is a 'local' variable. $args contains the key/pair values passed in by users.
I would like to parse this file by some python libraries. Better to return some parse tree. If there are some errors, better to report them. I found this one: CodeTalker and yeanpypa. Can they be used in this case? Any other recommendations?
I had to guess what your makefile structure allows based on your example, but this should get you close:
from pyparsing import *
# elements of the makefile are delimited by line, so we must
# define skippable whitespace to include just spaces and tabs
ParserElement.setDefaultWhitespaceChars(' \t')
NL = LineEnd().suppress()
EQ,COLON,LBRACK,RBRACK = map(Suppress, "=:[]")
identifier = Word(alphas+'_', alphanums)
symbol_assignment = Group(identifier("name") + EQ + empty +
restOfLine("value"))("symbol_assignment")
symbol_ref = Word("$",alphanums+"_.")
def only_column_one(s,l,t):
if col(l,s) != 1:
raise ParseException(s,l,"not in column 1")
# task identifiers have to start in column 1
task_identifier = identifier.copy().setParseAction(only_column_one)
task_description = "des:" + empty + restOfLine("des")
task_path = "path:" + empty + restOfLine("path")
task_para_body = delimitedList(symbol_ref)
task_para = "para:" + LBRACK + task_para_body("para") + RBRACK
task_para.ignore(NL)
task_definition = Group(task_identifier("target") + COLON +
Optional(delimitedList(identifier))("deps") + NL +
(
Optional(task_description + NL) &
Optional(task_path + NL) &
Optional(task_para + NL)
)
)("task_definition")
makefile_parser = ZeroOrMore(
symbol_assignment |
task_definition |
NL
)
if __name__ == "__main__":
test = """\
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
"""
# dump out what we parsed, including results names
for element in makefile_parser.parseString(test):
print element.getName()
print element.dump()
print
Prints:
symbol_assignment
['samtools', '/path/to/samtools']
- name: samtools
- value: /path/to/samtools
symbol_assignment
['picard', '/path/to/picard']
- name: picard
- value: /path/to/picard
task_definition
['task1', 'des:', 'description ', 'path:', '/path/to/task1 ', 'para:',
'$global.samtools', '$args.input', '$path']
- des: description
- para: ['$global.samtools', '$args.input', '$path']
- path: /path/to/task1
- target: task1
task_definition
['task2', 'task1']
- deps: ['task1']
- target: task2
The dump() output shows you what names you can use to get at the fields within the parsed elements, or to distinguish what kind of element you have. dump() is a handy, generic tool to output whatever pyparsing has parsed. Here is some code that is more specific to your particular parser, showing how to use the field names as either dotted object references (element.target, element.deps, element.name, etc.) or dict-style references (element[key]):
for element in makefile_parser.parseString(test):
if element.getName() == 'task_definition':
print "TASK:", element.target,
if element.deps:
print "DEPS:(" + ','.join(element.deps) + ")"
else:
print
for key in ('des', 'path', 'para'):
if key in element:
print " ", key.upper()+":", element[key]
elif element.getName() == 'symbol_assignment':
print "SYM:", element.name, "->", element.value
prints:
SYM: samtools -> /path/to/samtools
SYM: picard -> /path/to/picard
TASK: task1
DES: description
PATH: /path/to/task1
PARA: ['$global.samtools', '$args.input', '$path']
TASK: task2 DEPS:(task1)
I've used pyparsing in the past and been immensely pleased with it (q.v., the pyparsing project site).

Categories