Get correct brace grouping from string - python

I have files with incorrect JSON that I want to start fixing by getting it into properly grouped chunks.
The brace grouping {{ {} {} } } {{}} {{{}}} should already be correct
How can I grab all the top-level braces, correctly grouped, as separate strings?

If you don't want to install any extra modules simple function will do:
def top_level(s):
depth = 0
start = -1
for i, c in enumerate(s):
if c == '{':
if depth == 0:
start = i
depth += 1
elif c == '}' and depth:
depth -= 1
if depth == 0:
yield s[start:i+1]
print(list(top_level('{{ {} {} } } {{}} {{{}}}')))
Output:
['{{ {} {} } }', '{{}}', '{{{}}}']
It will skip invalid braces but could be easily modified to report an error when they are spotted.

Using the regex module:
In [1]: import regex
In [2]: braces = regex.compile(r"\{(?:[^{}]++|(?R))*\}")
In [3]: braces.findall("{{ {} {} } } {{}} {{{}}}")
Out[3]: ['{{ {} {} } }', '{{}}', '{{{}}}']

pyparsing can be really helpful here. It will handle pathological cases where you have braces inside strings, etc. It might be a little tricky to do all of this work yourself, but fortunately, somebody (the author of the library) has already done the hard stuff for us.... I'll reproduce the code here to prevent link-rot:
# jsonParser.py
#
# Implementation of a simple JSON parser, returning a hierarchical
# ParseResults object support both list- and dict-style data access.
#
# Copyright 2006, by Paul McGuire
#
# Updated 8 Jan 2007 - fixed dict grouping bug, and made elements and
# members optional in array and object collections
#
json_bnf = """
object
{ members }
{}
members
string : value
members , string : value
array
[ elements ]
[]
elements
value
elements , value
value
string
number
object
array
true
false
null
"""
from pyparsing import *
TRUE = Keyword("true").setParseAction( replaceWith(True) )
FALSE = Keyword("false").setParseAction( replaceWith(False) )
NULL = Keyword("null").setParseAction( replaceWith(None) )
jsonString = dblQuotedString.setParseAction( removeQuotes )
jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) +
Optional( '.' + Word(nums) ) +
Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )
jsonObject = Forward()
jsonValue = Forward()
jsonElements = delimitedList( jsonValue )
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']') )
jsonValue << ( jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL )
memberDef = Group( jsonString + Suppress(':') + jsonValue )
jsonMembers = delimitedList( memberDef )
jsonObject << Dict( Suppress('{') + Optional(jsonMembers) + Suppress('}') )
jsonComment = cppStyleComment
jsonObject.ignore( jsonComment )
def convertNumbers(s,l,toks):
n = toks[0]
try:
return int(n)
except ValueError, ve:
return float(n)
jsonNumber.setParseAction( convertNumbers )
Phew! That's a lot ... Now how do we use it? The general strategy here will be to scan the string for matches and then slice those matches out of the original string. Each scan result is a tuple of the form (lex-tokens, start_index, stop_index). For our use, we don't care about the lex-tokens, just the start and stop. We could do: string[result[1], result[2]] and it would work. We can also do string[slice(*result[1:])] -- Take your pick.
results = jsonObject.scanString(testdata)
for result in results:
print '*' * 80
print testdata[slice(*result[1:])]

Related

Parsing dbus monitor output messages

I am trying to parse the dbus monitor output messages. It has most of the messages as multi-line entries(including parameters). I need to parse and concatenate individual log messages to a single line entry.
The dbus-monitor output messages appear as below,
method call time=462.117843 sender=:1.62 -> destination=org.freedesktop.filehandler serial=122 path=/org/freedesktop/filehandler/routing; interface=org.freedesktop.filehandler.routing; member=start
int16 29877
uint16 0
method return time=462.117844 sender=org.freedesktop.filehandler -> destination=:1.62 serial=2210 reply_serial=122
int16 29877
uint16 0
method call time=462.117845 sender=:1.62 -> destination=org.freedesktop.filehandler serial=123 path=/org/freedesktop/filehandler/routing; interface=org.freedesktop.filehandler.routing; member=comment
string "starting .."
string "routing"
method return time=462.117846 sender=:1.19 -> destination=:1.62 serial=2212 reply_serial=123
int12 -23145
signal time=463.11223 sender=:1.64 -> destination=(null destination) serial=124 path=/org/freedesktop/fileserver; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
string "com.freedesktop.Systemserver"
array[
dict entry(
string "SystemTime"
variant struct{
byte 12
byte 9
byte 0
}
)
]
array [
]
This is the regex I tried to group the dbus messages(Parameter not grouped),
\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?
I expect the output in the below format,
C [sender,serial] path interface+member (parameter1, parameter2, ...)
R [destination,reply_serial] interface+member (parameter1, parameter2, ...)
S [sender, serial] path interface+member (parameter1, parameter2, ...)
A sample output for the above dbus-monitor messages is shown below,
C [:1.62,122] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.start (29877,0)
R [:1.62,122] org.freedesktop.filehandler.routing.start (29877,0)
C [:1.62,123] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.comment ("starting", "routing")
R [:1.62,123] org.freedesktop.filehandler.routing.comment (-23145)
S [:1.64, 124] /org/freedesktop/fileserver org.freedesktop.DBus.Properties.PropertiesChanged ("com.freedesktop.Systemserver"[("SystemTime",{12,9,0})][])
How can the above expected result be achieved when the entries are usually multi-line? Also, the SIGNALS has multiple encapsulations making it difficult to access the parameters. Can someone help with the parsing of these dbus messages to the expected format?
Can you suggest how the code can be rewritten to process line by line?
Here I rearranged it accordingly:
import re
import sys
regex = r'\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?'
remember = dict()
sep = None
for line in open('dbusl.in'):
m = re.match(regex, line)
if m:
if sep is not None: print ")" # end the previous parameter group
m = list(m.groups()) # each match is 9 capturing groups
if m[0] == 'method call':
print "C [{2},{4}] {5} {6}.{7}".format(*m),
remember[m[4]] = m[6:8] # store interface+member for return
if m[0] == 'method return':
m[6:8] = remember.pop(m[8]) # recall stored interface+member
print "R [{3},{8}] {6}.{7}".format(*m),
if m[0] == 'signal':
print "S [{2}, {4}] {5} {6}.{7}".format(*m),
sep = "("
else:
p = line.rstrip() # now handle parameters
if p[-1] in "[](){}": # with "encapsulations":
p = p[-1] # delete spaces, "array", "dict ..."
p = re.sub('^\s*\w*\s*', '', p) # delete spaces and data type
if p[-1] in "])}":
sep = '' # no separator before closing
print sep+p,
sys.stdout.softspace=0
if p[-1] in "[](){}": sep = ''
else: sep = ', ' # separator after data item
print ")" # end the previous parameter group
Note that I also changed m[6:8] = remember[m[8]] to m[6:8] = remember.pop(m[8]) in order to free the memory of no longer needed interface+member data.
If you absolutely have to use dbus-monitor, it’s probably best to use its PCAP output mode by passing the --pcap option to it. That outputs in a well-documented structured format which can be read by libpcap.
As you already have a usable regex, you can build on it by using it with re.split to get the needed message parts. Note that this yields a separate string for each capture group plus one string with the parameters, for each message entry. This example assumes that all the messages are in the string messages:
import re
import sys
regex = r'\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?'
m = re.split(regex, messages)
m = m[1:] # discard empty? text before first match
remember = dict()
while m: # each match group is 9 capturing groups + 1 parameter group
if m[0] == 'method call':
print "C [{2},{4}] {5} {6}.{7}".format(*m),
remember[m[4]] = m[6:8] # store interface+member for return
if m[0] == 'method return':
m[6:8] = remember[m[8]] # recall stored interface+member
print "R [{3},{8}] {6}.{7}".format(*m),
if m[0] == 'signal':
print "S [{2}, {4}] {5} {6}.{7}".format(*m),
# now handle parameters
sep = "("
for p in m[9].split('\n')[1:-1]: # except empty string at start and end
if p[-1] in "[](){}": # with "encapsulations":
p = p[-1] # delete spaces, "array", "dict ..."
p = re.sub('^\s*\w*\s*', '', p) # delete spaces and data type
if p[-1] in "])}":
sep = '' # no separator before closing
print sep+p,
sys.stdout.softspace=0
if p[-1] in "[](){}": sep = ''
else: sep = ', ' # separator after data item
print ")"
m = m[10:] # delete the processed match group of 10
The output with your sample data is:
C [:1.62,122] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.start (29877, 0)
R [:1.62,122] org.freedesktop.filehandler.routing.start (29877, 0)
C [:1.62,123] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.comment ("starting ..", "routing")
R [:1.62,123] org.freedesktop.filehandler.routing.comment (-23145)
S [:1.64, 124] /org/freedesktop/fileserver org.freedesktop.DBus.Properties.PropertiesChanged ("com.freedesktop.Systemserver", [("SystemTime", {12, 9, 0})][])

How do I Use re() in Python and Return Capture Groups within an "If" Statement?

Although I've been using Perl for many years, I've always had trouble with anything more than fairly basic use of Regular Expresions in the language. This is
only a worse situation now, as I'm trying to learn Python... and the use of re() is even more unclear to me.
I'm trying to check for a match if a substring is in a string, using re()
and also am using capture groups to extract some info from the matching process. However, I can't get things to work in a couple of
contexts; when using a re() call and assigning the returned values all
within an "if" statement.. and how to handle the situation when .groups items are not defined
in the match objects (when a match is not made).
So, what follows are examples of what I'm trying to do coded in Perl and Python, with their respective outputs.
I'd appreciate any pointers on how I might better approach the problem using Python.
Perl Code:
use strict;
use warnings;
my ($idx, $dvalue);
while (my $rec = <DATA>) {
chomp($rec);
if ( ($idx, $dvalue) = ($rec =~ /^XA([0-9]+)=(.*?)!/) ) {
printf(" Matched:\n");
printf(" rec: >%s<\n", $rec);
printf(" index = >%s< value = >%s<\n", $idx, $dvalue);
} elsif ( ($idx, $dvalue) = ($rec =~ /^PZ([0-9]+)=(.*?[^#])!/) ) {
printf(" Matched:\n");
printf(" rec: >%s<\n", $rec);
printf(" index = >%s< value = >%s<\n", $idx, $dvalue);
} else {
printf("\n Unknown Record format, \\%s\\\n\n", $rec);
}
}
close(DATA);
exit(0)
__DATA__
DUD=ABC!QUEUE=D23!
XA32=7!P^=32!
PZ112=123^!PQ=ABC!
Perl Output:
Unknown Record format, \DUD=ABC!QUEUE=D23!\
Matched:
rec: >XA32=7!P^=32!<
index = >32< value = >7<
Matched:
rec: >PZ112=123^!PQ=ABC!<
index = >112< value = >123^<
Python Code:
import re
string = 'XA32=7!P^=32!'
with open('data.dat', 'r') as fh:
for rec in fh:
orec = ' rec: >' + rec.rstrip('\n') + '<'
print(orec)
# always using 'string' at least lets this program run
(index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', string).groups()
# The following works when there is a match... but fails with an error when
# a match is NOT found, viz:-
# ...
# (index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', rec).groups()
#
# Traceback (most recent call last):
# File "T:\tmp\a.py", line 13, in <module>
# (index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', rec).groups()
# AttributeError: 'NoneType' object has no attribute 'groups'
#
buf = ' index = >' + index + '<' + ' value = >' + dvalue + '<'
print(buf)
exit(0)
data.dat contents:
DUD=ABC!QUEUE=D23!
XA32=7!P^=32!
PZ112=123^!PQ=ABC!
Python Output:
rec: >DUD=ABC!QUEUE=D23!<
index = >32< value = >7<
rec: >XA32=7!P^=32!<
index = >32< value = >7<
rec: >PZ112=123^!PQ=ABC!<
index = >32< value = >7<
Another development: Some more code to help me understand this better... but I'm unsure about when/how to use the match.group() or match.groups() ...
Python Code:
import re
rec = 'XA22=11^!S^=64!ABC=0,0!PX=0!SP=12B!'
print("rec = >{}<".format(rec))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=(.*?[^#])!(.*?)!', rec)
if match:
(index, dvalue, x) = match.groups()
print("3 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print("2 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=', rec)
if match:
#(index) = match.groups() # Why doesn't this work like above examples!?
(index, ) = match.groups() # ...and yet this works!?
# Does match.groups ALWAYS returns a tuple!?
#(index) = match.group(1) # This also works; 0 = entire matched string?
print("1 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.search(r'S\^=([0-9]+)!', rec)
if match:
(index, ) = match.groups() # Returns tuple(?!)
print("1 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
Again, I'd appreciate any thoughts on which is the 'preferred' way.. or if there's another way to deal with the groups.
You need to check for a match first, then use the groups. I.e.
compile the regexes (optional for most cases nowadays, according to the documentation)
apply each regex to the string to generate a match object
match() only matches at the beginning of a string, i.e. with an implicit ^ anchor
search() matches anywhere in the string
check if the match object is valid
extract the groups
skip to next loop iteration
# works with Python 2 and Python 3
import re
with open('dummy.txt', 'r') as fh:
for rec in fh:
orec = ' rec: >' + rec.rstrip('\n') + '<'
print(orec)
match = re.match(r'XA([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
continue
match = re.match(r'PZ([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
continue
print(" Unknown Record format")
Output:
$ python dummy.py
rec: >DUD=ABC!QUEUE=D23!<
Unknown Record format
rec: >XA32=7!P^=32!<
index = >32< value = >7<
rec: >PZ112=123^!PQ=ABC!<
index = >112< value = >123^<
But I'm wondering why you don't simplify your Perl & Python code to just use a single regex instead? E.g.:
match = re.match(r'(?:XA|PZ)([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
else:
print(" Unknown Record format")

Extracting the values within the square braces

I have a file testfile with the set of server names as below.
app-server-l11[2-5].test.com
server-l34[5-8].test.com
dd-server-l[2-4].test.com
Can you please help in getting output to be as follow.
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/(.*)\[([0-9]+)-([0-9]+)\](.*)/,a){for (i=a[2]; i<=a[3]; i++) print a[1] i a[4]}' file
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
In GNU awk:
$ awk -F"[][]" '{split($2,a,"-"); for(i=a[1];i<=a[2];i++) print $1 i $3}' file
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
split to fields by [ and ] using FS
use split the get the range start (a[1]) and end (a[2])
iterate the range with for and output
There is no checking whether there was a range or not. It could be implemented with something like: print (NF==3 ? $1 i $3 : $1 ).
Worst and ugliest example:
var='app-server-l11[2-5].test.com'
for i in range(int(var[(var.find('[') +1)]), int(var[(var.find("]") - 1)])+1):
print 'app-server-l11' + str(i) + '.test.com'
Use your imagination!
ser_nm = ['app-server-l11[2-5].test.com','server-134[5-8].test.com','dd-server-[2-4].test.com']
for nm in ser_nm:
for i in range(int(nm[nm.find('[')+1 : nm.find('-',(nm.find('[')+1))]), int(nm[nm.find('-',(nm.find('[')+1))+1:nm.find(']') ] )+1):
print(nm[:nm.find('[')] + str(i) + nm[nm.find(']')+1:])
This will also take care of cases where server names are like this:
'server-134[52-823].test.com'
not the best solution, but it works...
inp = open('input.txt', 'r+').read()
print(inp)
result= ''
for i in inp.split('\n'):
if len(i) > 1:
print(repr(i))
f1 = i.find('[')
f2 = i.find(']')+1
b1 = i[:f1]
b2 = i[f2:]
ins = i[f1:f2]
ins = ins[1:-1]
for j in range(int(ins.split("-")[0]),int(ins.split("-")[1])+1):
result+=b1+str(j)+b2+'\n'
outp = open('output.txt', 'w')
outp.write(result)
outp.close()
You can use the below command for the required output without any complex statement.
awk -f test.awk file.txt
test.awk must contains the below lines:
{
if(a=match($0,"\\["))
{
start=strtonum(substr($0,a+1,1));
end=strtonum(substr($0,a+3,1));
copy=$0;
for(i=start;i<=end;i++)
{
sub("\\[[0-9]{1,}-[0-9]{1,}\\]",i,copy);
print copy;
copy = $0;
}
}
else
{
print $0;
}
}
file.txt contains your input file like below lines:
app-server-l11[2-5].test.com
server-l34[5-8].test.com
dd-server-l[2-4].test.com
output:
app-server-l112.test.com
app-server-l113.test.com
app-server-l114.test.com
app-server-l115.test.com
server-l345.test.com
server-l346.test.com
server-l347.test.com
server-l348.test.com
dd-server-l2.test.com
dd-server-l3.test.com
dd-server-l4.test.com
As this sounds like a school assignment I'm going to be fairly vague.
I would use a regular expression to extract the numeric range and the rest of the address components, then use a loop to iterate over the extracted numeric range to build each address (using the other captured address components).
Since it's been over a week:
import re
inputs = [ "app-server-l11[2-5].test.com", "server-l34[5-8].test.com", "dd-server-l[2-4].test.com" ]
pattern = r"\s*(?P<subdomain>[a-zA-Z0-9-_.]+)\[(?P<range_start>\d+)-(?P<range_end>\d+)\](?P<domain>\S+)"
expr = re.compile( pattern )
def expand_domain( domain ):
mo = expr.match( domain )
if mo is not None:
groups = mo.groupdict()
subdomain = groups[ "subdomain" ]
domain = groups[ "domain" ]
range_start = int( groups[ "range_start" ] )
range_end = int( groups[ "range_end" ] )
result = [ "{}{:d}{}".format( subdomain, index, domain ) for index in range( range_start, range_end + 1 ) ]
return result
else:
raise ValueError( "'{}' does not match the expected input.".format( domain ) )
for domain in inputs:
print( "'{}':".format( domain ) )
for exp_dom in expand_domain( domain ):
print( "--> {}".format( exp_dom ) )

pyparsing nested structure not working as expected

I'm trying to parse a simple JSON-like structure into python dics and then turn it into a proper JSON structure. The block is as follows:
###################################################
# HEADER TEXT
# HEADER TEXT
###################################################
NAME => {
NAME => VALUE,
NAME => VALUE,
NAME => VALUE,
NAME => {
NAME => {
NAME => VALUE, NAME => VALUE, NAME => VALUE,
},
} # comment
}, # more comments
and repeating. Rules:
NAME = alphanums and _
VALUE = decimal(6) | hex (0xA) | list of hex ([0x1,0x2]) | text in brackets([A]) | string("A")
I set up the following grammar:
cfgName = Word(alphanums+"_")
cfgString = dblQuotedString().setParseAction(removeQuotes)
cfgNumber = Word("0123456789ABCDEFx")
LBRACK, RBRACK, LBRACE, RBRACE = map(Suppress, "[]{}")
EQUAL = Literal('=>').suppress()
cfgObject = Forward()
cfgValue = Forward()
cfgElements = delimitedList(cfgValue)
cfgArray = Group(LBRACK + Optional(cfgElements, []) + RBRACK)
cfgValue << (cfgString | cfgNumber | cfgArray | cfgName | Group(cfgObject))
memberDef = Group(cfgName + EQUAL + cfgValue)
cfgMembers = delimitedList(memberDef)
cfgObject << Dict(LBRACE + Optional(cfgMembers) + RBRACE)
cfgComment = pythonStyleComment
cfgObject.ignore(cfgComment)
EDIT: I've managed to isolate the problem. Proper JSON is
{member,member,member}
however my structure is:
{member,member,member,}
the last element in every nested structure is comma separated and I don't know how to account for that in the grammar.

Cannot parse correctly this file with pyparsing

I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )
You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.
Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.
The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.
Best of luck with pyparsing!
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)
#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))
#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)
#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))
Now instead of pprint(result.asList()) you can write:
print (result.dump())
to show the Dict hierarchy:
[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- key1: value1
- list: ['XXitem', 'XXanotherItem']
- mods: ['XXone', 'XXtwo']
- product: milk
- release: now
- subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
- skey: sval
- slist: ['XXitem']
- version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
- base: base-0.1
- config: config-7.0-7
allowing you to write statements like:
print (result.sectionTwo.base)

Categories