I am trying to parse the dbus monitor output messages. It has most of the messages as multi-line entries(including parameters). I need to parse and concatenate individual log messages to a single line entry.
The dbus-monitor output messages appear as below,
method call time=462.117843 sender=:1.62 -> destination=org.freedesktop.filehandler serial=122 path=/org/freedesktop/filehandler/routing; interface=org.freedesktop.filehandler.routing; member=start
int16 29877
uint16 0
method return time=462.117844 sender=org.freedesktop.filehandler -> destination=:1.62 serial=2210 reply_serial=122
int16 29877
uint16 0
method call time=462.117845 sender=:1.62 -> destination=org.freedesktop.filehandler serial=123 path=/org/freedesktop/filehandler/routing; interface=org.freedesktop.filehandler.routing; member=comment
string "starting .."
string "routing"
method return time=462.117846 sender=:1.19 -> destination=:1.62 serial=2212 reply_serial=123
int12 -23145
signal time=463.11223 sender=:1.64 -> destination=(null destination) serial=124 path=/org/freedesktop/fileserver; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
string "com.freedesktop.Systemserver"
array[
dict entry(
string "SystemTime"
variant struct{
byte 12
byte 9
byte 0
}
)
]
array [
]
This is the regex I tried to group the dbus messages(Parameter not grouped),
\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?
I expect the output in the below format,
C [sender,serial] path interface+member (parameter1, parameter2, ...)
R [destination,reply_serial] interface+member (parameter1, parameter2, ...)
S [sender, serial] path interface+member (parameter1, parameter2, ...)
A sample output for the above dbus-monitor messages is shown below,
C [:1.62,122] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.start (29877,0)
R [:1.62,122] org.freedesktop.filehandler.routing.start (29877,0)
C [:1.62,123] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.comment ("starting", "routing")
R [:1.62,123] org.freedesktop.filehandler.routing.comment (-23145)
S [:1.64, 124] /org/freedesktop/fileserver org.freedesktop.DBus.Properties.PropertiesChanged ("com.freedesktop.Systemserver"[("SystemTime",{12,9,0})][])
How can the above expected result be achieved when the entries are usually multi-line? Also, the SIGNALS has multiple encapsulations making it difficult to access the parameters. Can someone help with the parsing of these dbus messages to the expected format?
Can you suggest how the code can be rewritten to process line by line?
Here I rearranged it accordingly:
import re
import sys
regex = r'\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?'
remember = dict()
sep = None
for line in open('dbusl.in'):
m = re.match(regex, line)
if m:
if sep is not None: print ")" # end the previous parameter group
m = list(m.groups()) # each match is 9 capturing groups
if m[0] == 'method call':
print "C [{2},{4}] {5} {6}.{7}".format(*m),
remember[m[4]] = m[6:8] # store interface+member for return
if m[0] == 'method return':
m[6:8] = remember.pop(m[8]) # recall stored interface+member
print "R [{3},{8}] {6}.{7}".format(*m),
if m[0] == 'signal':
print "S [{2}, {4}] {5} {6}.{7}".format(*m),
sep = "("
else:
p = line.rstrip() # now handle parameters
if p[-1] in "[](){}": # with "encapsulations":
p = p[-1] # delete spaces, "array", "dict ..."
p = re.sub('^\s*\w*\s*', '', p) # delete spaces and data type
if p[-1] in "])}":
sep = '' # no separator before closing
print sep+p,
sys.stdout.softspace=0
if p[-1] in "[](){}": sep = ''
else: sep = ', ' # separator after data item
print ")" # end the previous parameter group
Note that I also changed m[6:8] = remember[m[8]] to m[6:8] = remember.pop(m[8]) in order to free the memory of no longer needed interface+member data.
If you absolutely have to use dbus-monitor, it’s probably best to use its PCAP output mode by passing the --pcap option to it. That outputs in a well-documented structured format which can be read by libpcap.
As you already have a usable regex, you can build on it by using it with re.split to get the needed message parts. Note that this yields a separate string for each capture group plus one string with the parameters, for each message entry. This example assumes that all the messages are in the string messages:
import re
import sys
regex = r'\b(signal|method call|method return)\b time=([\d,.]*) sender=([\w,.,:,(,), ]*) -> destination=([\w,.,:,(,), ]*) serial=([(,),\w]*) (?:path=([\w,\/]*); interface=([\w,.]*); member=([\w,_,-]*))?(?:reply_serial=([\d]*))?'
m = re.split(regex, messages)
m = m[1:] # discard empty? text before first match
remember = dict()
while m: # each match group is 9 capturing groups + 1 parameter group
if m[0] == 'method call':
print "C [{2},{4}] {5} {6}.{7}".format(*m),
remember[m[4]] = m[6:8] # store interface+member for return
if m[0] == 'method return':
m[6:8] = remember[m[8]] # recall stored interface+member
print "R [{3},{8}] {6}.{7}".format(*m),
if m[0] == 'signal':
print "S [{2}, {4}] {5} {6}.{7}".format(*m),
# now handle parameters
sep = "("
for p in m[9].split('\n')[1:-1]: # except empty string at start and end
if p[-1] in "[](){}": # with "encapsulations":
p = p[-1] # delete spaces, "array", "dict ..."
p = re.sub('^\s*\w*\s*', '', p) # delete spaces and data type
if p[-1] in "])}":
sep = '' # no separator before closing
print sep+p,
sys.stdout.softspace=0
if p[-1] in "[](){}": sep = ''
else: sep = ', ' # separator after data item
print ")"
m = m[10:] # delete the processed match group of 10
The output with your sample data is:
C [:1.62,122] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.start (29877, 0)
R [:1.62,122] org.freedesktop.filehandler.routing.start (29877, 0)
C [:1.62,123] /org/freedesktop/filehandler/routing org.freedesktop.filehandler.routing.comment ("starting ..", "routing")
R [:1.62,123] org.freedesktop.filehandler.routing.comment (-23145)
S [:1.64, 124] /org/freedesktop/fileserver org.freedesktop.DBus.Properties.PropertiesChanged ("com.freedesktop.Systemserver", [("SystemTime", {12, 9, 0})][])
Related
Although I've been using Perl for many years, I've always had trouble with anything more than fairly basic use of Regular Expresions in the language. This is
only a worse situation now, as I'm trying to learn Python... and the use of re() is even more unclear to me.
I'm trying to check for a match if a substring is in a string, using re()
and also am using capture groups to extract some info from the matching process. However, I can't get things to work in a couple of
contexts; when using a re() call and assigning the returned values all
within an "if" statement.. and how to handle the situation when .groups items are not defined
in the match objects (when a match is not made).
So, what follows are examples of what I'm trying to do coded in Perl and Python, with their respective outputs.
I'd appreciate any pointers on how I might better approach the problem using Python.
Perl Code:
use strict;
use warnings;
my ($idx, $dvalue);
while (my $rec = <DATA>) {
chomp($rec);
if ( ($idx, $dvalue) = ($rec =~ /^XA([0-9]+)=(.*?)!/) ) {
printf(" Matched:\n");
printf(" rec: >%s<\n", $rec);
printf(" index = >%s< value = >%s<\n", $idx, $dvalue);
} elsif ( ($idx, $dvalue) = ($rec =~ /^PZ([0-9]+)=(.*?[^#])!/) ) {
printf(" Matched:\n");
printf(" rec: >%s<\n", $rec);
printf(" index = >%s< value = >%s<\n", $idx, $dvalue);
} else {
printf("\n Unknown Record format, \\%s\\\n\n", $rec);
}
}
close(DATA);
exit(0)
__DATA__
DUD=ABC!QUEUE=D23!
XA32=7!P^=32!
PZ112=123^!PQ=ABC!
Perl Output:
Unknown Record format, \DUD=ABC!QUEUE=D23!\
Matched:
rec: >XA32=7!P^=32!<
index = >32< value = >7<
Matched:
rec: >PZ112=123^!PQ=ABC!<
index = >112< value = >123^<
Python Code:
import re
string = 'XA32=7!P^=32!'
with open('data.dat', 'r') as fh:
for rec in fh:
orec = ' rec: >' + rec.rstrip('\n') + '<'
print(orec)
# always using 'string' at least lets this program run
(index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', string).groups()
# The following works when there is a match... but fails with an error when
# a match is NOT found, viz:-
# ...
# (index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', rec).groups()
#
# Traceback (most recent call last):
# File "T:\tmp\a.py", line 13, in <module>
# (index, dvalue) = re.search(r'^XA([0-9]+)=(.*?[^#])!', rec).groups()
# AttributeError: 'NoneType' object has no attribute 'groups'
#
buf = ' index = >' + index + '<' + ' value = >' + dvalue + '<'
print(buf)
exit(0)
data.dat contents:
DUD=ABC!QUEUE=D23!
XA32=7!P^=32!
PZ112=123^!PQ=ABC!
Python Output:
rec: >DUD=ABC!QUEUE=D23!<
index = >32< value = >7<
rec: >XA32=7!P^=32!<
index = >32< value = >7<
rec: >PZ112=123^!PQ=ABC!<
index = >32< value = >7<
Another development: Some more code to help me understand this better... but I'm unsure about when/how to use the match.group() or match.groups() ...
Python Code:
import re
rec = 'XA22=11^!S^=64!ABC=0,0!PX=0!SP=12B!'
print("rec = >{}<".format(rec))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=(.*?[^#])!(.*?)!', rec)
if match:
(index, dvalue, x) = match.groups()
print("3 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print("2 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.match(r'XA([0-9]+)=', rec)
if match:
#(index) = match.groups() # Why doesn't this work like above examples!?
(index, ) = match.groups() # ...and yet this works!?
# Does match.groups ALWAYS returns a tuple!?
#(index) = match.group(1) # This also works; 0 = entire matched string?
print("1 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
# ----
index = 0 ; dvalue = 0 ; x = 0
match = re.search(r'S\^=([0-9]+)!', rec)
if match:
(index, ) = match.groups() # Returns tuple(?!)
print("1 (): index = >{}< value = >{}< x = >{}<".format(index, dvalue, x))
Again, I'd appreciate any thoughts on which is the 'preferred' way.. or if there's another way to deal with the groups.
You need to check for a match first, then use the groups. I.e.
compile the regexes (optional for most cases nowadays, according to the documentation)
apply each regex to the string to generate a match object
match() only matches at the beginning of a string, i.e. with an implicit ^ anchor
search() matches anywhere in the string
check if the match object is valid
extract the groups
skip to next loop iteration
# works with Python 2 and Python 3
import re
with open('dummy.txt', 'r') as fh:
for rec in fh:
orec = ' rec: >' + rec.rstrip('\n') + '<'
print(orec)
match = re.match(r'XA([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
continue
match = re.match(r'PZ([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
continue
print(" Unknown Record format")
Output:
$ python dummy.py
rec: >DUD=ABC!QUEUE=D23!<
Unknown Record format
rec: >XA32=7!P^=32!<
index = >32< value = >7<
rec: >PZ112=123^!PQ=ABC!<
index = >112< value = >123^<
But I'm wondering why you don't simplify your Perl & Python code to just use a single regex instead? E.g.:
match = re.match(r'(?:XA|PZ)([0-9]+)=(.*?[^#])!', rec)
if match:
(index, dvalue) = match.groups()
print(" index = >{}< value = >{}<".format(index, dvalue))
else:
print(" Unknown Record format")
I have files with incorrect JSON that I want to start fixing by getting it into properly grouped chunks.
The brace grouping {{ {} {} } } {{}} {{{}}} should already be correct
How can I grab all the top-level braces, correctly grouped, as separate strings?
If you don't want to install any extra modules simple function will do:
def top_level(s):
depth = 0
start = -1
for i, c in enumerate(s):
if c == '{':
if depth == 0:
start = i
depth += 1
elif c == '}' and depth:
depth -= 1
if depth == 0:
yield s[start:i+1]
print(list(top_level('{{ {} {} } } {{}} {{{}}}')))
Output:
['{{ {} {} } }', '{{}}', '{{{}}}']
It will skip invalid braces but could be easily modified to report an error when they are spotted.
Using the regex module:
In [1]: import regex
In [2]: braces = regex.compile(r"\{(?:[^{}]++|(?R))*\}")
In [3]: braces.findall("{{ {} {} } } {{}} {{{}}}")
Out[3]: ['{{ {} {} } }', '{{}}', '{{{}}}']
pyparsing can be really helpful here. It will handle pathological cases where you have braces inside strings, etc. It might be a little tricky to do all of this work yourself, but fortunately, somebody (the author of the library) has already done the hard stuff for us.... I'll reproduce the code here to prevent link-rot:
# jsonParser.py
#
# Implementation of a simple JSON parser, returning a hierarchical
# ParseResults object support both list- and dict-style data access.
#
# Copyright 2006, by Paul McGuire
#
# Updated 8 Jan 2007 - fixed dict grouping bug, and made elements and
# members optional in array and object collections
#
json_bnf = """
object
{ members }
{}
members
string : value
members , string : value
array
[ elements ]
[]
elements
value
elements , value
value
string
number
object
array
true
false
null
"""
from pyparsing import *
TRUE = Keyword("true").setParseAction( replaceWith(True) )
FALSE = Keyword("false").setParseAction( replaceWith(False) )
NULL = Keyword("null").setParseAction( replaceWith(None) )
jsonString = dblQuotedString.setParseAction( removeQuotes )
jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) +
Optional( '.' + Word(nums) ) +
Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )
jsonObject = Forward()
jsonValue = Forward()
jsonElements = delimitedList( jsonValue )
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']') )
jsonValue << ( jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL )
memberDef = Group( jsonString + Suppress(':') + jsonValue )
jsonMembers = delimitedList( memberDef )
jsonObject << Dict( Suppress('{') + Optional(jsonMembers) + Suppress('}') )
jsonComment = cppStyleComment
jsonObject.ignore( jsonComment )
def convertNumbers(s,l,toks):
n = toks[0]
try:
return int(n)
except ValueError, ve:
return float(n)
jsonNumber.setParseAction( convertNumbers )
Phew! That's a lot ... Now how do we use it? The general strategy here will be to scan the string for matches and then slice those matches out of the original string. Each scan result is a tuple of the form (lex-tokens, start_index, stop_index). For our use, we don't care about the lex-tokens, just the start and stop. We could do: string[result[1], result[2]] and it would work. We can also do string[slice(*result[1:])] -- Take your pick.
results = jsonObject.scanString(testdata)
for result in results:
print '*' * 80
print testdata[slice(*result[1:])]
I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )
You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.
Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.
The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.
Best of luck with pyparsing!
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)
#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))
#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)
#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))
Now instead of pprint(result.asList()) you can write:
print (result.dump())
to show the Dict hierarchy:
[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- key1: value1
- list: ['XXitem', 'XXanotherItem']
- mods: ['XXone', 'XXtwo']
- product: milk
- release: now
- subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
- skey: sval
- slist: ['XXitem']
- version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
- base: base-0.1
- config: config-7.0-7
allowing you to write statements like:
print (result.sectionTwo.base)
friends.
I have a 'make'-like style file needed to be parsed. The grammar is something like:
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
Where $global contains the variables defined in a global scope. $path is a 'local' variable. $args contains the key/pair values passed in by users.
I would like to parse this file by some python libraries. Better to return some parse tree. If there are some errors, better to report them. I found this one: CodeTalker and yeanpypa. Can they be used in this case? Any other recommendations?
I had to guess what your makefile structure allows based on your example, but this should get you close:
from pyparsing import *
# elements of the makefile are delimited by line, so we must
# define skippable whitespace to include just spaces and tabs
ParserElement.setDefaultWhitespaceChars(' \t')
NL = LineEnd().suppress()
EQ,COLON,LBRACK,RBRACK = map(Suppress, "=:[]")
identifier = Word(alphas+'_', alphanums)
symbol_assignment = Group(identifier("name") + EQ + empty +
restOfLine("value"))("symbol_assignment")
symbol_ref = Word("$",alphanums+"_.")
def only_column_one(s,l,t):
if col(l,s) != 1:
raise ParseException(s,l,"not in column 1")
# task identifiers have to start in column 1
task_identifier = identifier.copy().setParseAction(only_column_one)
task_description = "des:" + empty + restOfLine("des")
task_path = "path:" + empty + restOfLine("path")
task_para_body = delimitedList(symbol_ref)
task_para = "para:" + LBRACK + task_para_body("para") + RBRACK
task_para.ignore(NL)
task_definition = Group(task_identifier("target") + COLON +
Optional(delimitedList(identifier))("deps") + NL +
(
Optional(task_description + NL) &
Optional(task_path + NL) &
Optional(task_para + NL)
)
)("task_definition")
makefile_parser = ZeroOrMore(
symbol_assignment |
task_definition |
NL
)
if __name__ == "__main__":
test = """\
samtools=/path/to/samtools
picard=/path/to/picard
task1:
des: description
path: /path/to/task1
para: [$global.samtools,
$args.input,
$path
]
task2: task1
"""
# dump out what we parsed, including results names
for element in makefile_parser.parseString(test):
print element.getName()
print element.dump()
print
Prints:
symbol_assignment
['samtools', '/path/to/samtools']
- name: samtools
- value: /path/to/samtools
symbol_assignment
['picard', '/path/to/picard']
- name: picard
- value: /path/to/picard
task_definition
['task1', 'des:', 'description ', 'path:', '/path/to/task1 ', 'para:',
'$global.samtools', '$args.input', '$path']
- des: description
- para: ['$global.samtools', '$args.input', '$path']
- path: /path/to/task1
- target: task1
task_definition
['task2', 'task1']
- deps: ['task1']
- target: task2
The dump() output shows you what names you can use to get at the fields within the parsed elements, or to distinguish what kind of element you have. dump() is a handy, generic tool to output whatever pyparsing has parsed. Here is some code that is more specific to your particular parser, showing how to use the field names as either dotted object references (element.target, element.deps, element.name, etc.) or dict-style references (element[key]):
for element in makefile_parser.parseString(test):
if element.getName() == 'task_definition':
print "TASK:", element.target,
if element.deps:
print "DEPS:(" + ','.join(element.deps) + ")"
else:
print
for key in ('des', 'path', 'para'):
if key in element:
print " ", key.upper()+":", element[key]
elif element.getName() == 'symbol_assignment':
print "SYM:", element.name, "->", element.value
prints:
SYM: samtools -> /path/to/samtools
SYM: picard -> /path/to/picard
TASK: task1
DES: description
PATH: /path/to/task1
PARA: ['$global.samtools', '$args.input', '$path']
TASK: task2 DEPS:(task1)
I've used pyparsing in the past and been immensely pleased with it (q.v., the pyparsing project site).
How does one truncate a string to 75 characters in Python?
This is how it is done in JavaScript:
var data="saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
var info = (data.length > 75) ? data.substring[0,75] + '..' : data;
info = (data[:75] + '..') if len(data) > 75 else data
Even more concise:
data = data[:75]
If it is less than 75 characters there will be no change.
Even shorter :
info = data[:75] + (data[75:] and '..')
If you are using Python 3.4+, you can use textwrap.shorten from the standard library:
Collapse and truncate the given text to fit in the given width.
First the whitespace in text is collapsed (all whitespace is replaced
by single spaces). If the result fits in the width, it is returned.
Otherwise, enough words are dropped from the end so that the remaining
words plus the placeholder fit within width:
>>> textwrap.shorten("Hello world!", width=12)
'Hello world!'
>>> textwrap.shorten("Hello world!", width=11)
'Hello [...]'
>>> textwrap.shorten("Hello world", width=10, placeholder="...")
'Hello...'
For a Django solution (which has not been mentioned in the question):
from django.utils.text import Truncator
value = Truncator(value).chars(75)
Have a look at Truncator's source code to appreciate the problem:
https://github.com/django/django/blob/master/django/utils/text.py#L66
Concerning truncation with Django:
Django HTML truncation
With regex:
re.sub(r'^(.{75}).*$', '\g<1>...', data)
Long strings are truncated:
>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'111111111122222222223333333333444444444455555555556666666666777777777788888...'
Shorter strings never get truncated:
>>> data="11111111112222222222333333"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'11111111112222222222333333'
This way, you can also "cut" the middle part of the string, which is nicer in some cases:
re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
'11111...88888'
limit = 75
info = data[:limit] + '..' * (len(data) > limit)
This method doesn't use any if:
data[:75] + bool(data[75:]) * '..'
This just in:
n = 8
s = '123'
print s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '12345678'
print s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789'
print s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789012345'
print s[:n-3] + (s[n-3:], '...')[len(s) > n]
123
12345678
12345...
12345...
info = data[:75] + ('..' if len(data) > 75 else '')
info = data[:min(len(data), 75)
You can't actually "truncate" a Python string like you can do a dynamically allocated C string. Strings in Python are immutable. What you can do is slice a string as described in other answers, yielding a new string containing only the characters defined by the slice offsets and step.
In some (non-practical) cases this can be a little annoying, such as when you choose Python as your interview language and the interviewer asks you to remove duplicate characters from a string in-place. Doh.
Yet another solution. With True and False you get a little feedback about the test at the end.
data = {True: data[:75] + '..', False: data}[len(data) > 75]
Coming very late to the party I want to add my solution to trim text at character level that also handles whitespaces properly.
def trim_string(s: str, limit: int, ellipsis='…') -> str:
s = s.strip()
if len(s) > limit:
return s[:limit-1].strip() + ellipsis
return s
Simple, but it will make sure you that hello world with limit=6 will not result in an ugly hello … but hello… instead.
It also removes leading and trailing whitespaces, but not spaces inside. If you also want to remove spaces inside, checkout this stackoverflow post
>>> info = lambda data: len(data)>10 and data[:10]+'...' or data
>>> info('sdfsdfsdfsdfsdfsdfsdfsdfsdfsdfsdf')
'sdfsdfsdfs...'
>>> info('sdfsdf')
'sdfsdf'
>>>
Simple and short helper function:
def truncate_string(value, max_length=255, suffix='...'):
string_value = str(value)
string_truncated = string_value[:min(len(string_value), (max_length - len(suffix)))]
suffix = (suffix if len(string_value) > max_length else '')
return string_truncated+suffix
Usage examples:
# Example 1 (default):
long_string = ""
for number in range(1, 1000):
long_string += str(number) + ','
result = truncate_string(long_string)
print(result)
# Example 2 (custom length):
short_string = 'Hello world'
result = truncate_string(short_string, 8)
print(result) # > Hello...
# Example 3 (not truncated):
short_string = 'Hello world'
result = truncate_string(short_string)
print(result) # > Hello world
If you wish to do some more sophisticated string truncate you can adopt sklearn approach as implement by:
sklearn.base.BaseEstimator.__repr__
(See Original full code at: https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/base.py#L262)
It adds benefits such as avoiding truncate in the middle of the word.
def truncate_string(data, N_CHAR_MAX=70):
# N_CHAR_MAX is the (approximate) maximum number of non-blank
# characters to render. We pass it as an optional parameter to ease
# the tests.
lim = N_CHAR_MAX // 2 # apprx number of chars to keep on both ends
regex = r"^(\s*\S){%d}" % lim
# The regex '^(\s*\S){%d}' % n
# matches from the start of the string until the nth non-blank
# character:
# - ^ matches the start of string
# - (pattern){n} matches n repetitions of pattern
# - \s*\S matches a non-blank char following zero or more blanks
left_lim = re.match(regex, data).end()
right_lim = re.match(regex, data[::-1]).end()
if "\n" in data[left_lim:-right_lim]:
# The left side and right side aren't on the same line.
# To avoid weird cuts, e.g.:
# categoric...ore',
# we need to start the right side with an appropriate newline
# character so that it renders properly as:
# categoric...
# handle_unknown='ignore',
# so we add [^\n]*\n which matches until the next \n
regex += r"[^\n]*\n"
right_lim = re.match(regex, data[::-1]).end()
ellipsis = "..."
if left_lim + len(ellipsis) < len(data) - right_lim:
# Only add ellipsis if it results in a shorter repr
data = data[:left_lim] + "..." + data[-right_lim:]
return data
There's no need for a regular expression but you do want to use string formatting rather than the string concatenation in the accepted answer.
This is probably the most canonical, Pythonic way to truncate the string data at 75 characters.
>>> data = "saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
>>> info = "{}..".format(data[:75]) if len(data) > 75 else data
>>> info
'111111111122222222223333333333444444444455555555556666666666777777777788888...'
Here's a function I made as part of a new String class... It allows adding a suffix ( if the string is size after trimming and adding it is long enough - although you don't need to force the absolute size )
I was in the process of changing a few things around so there are some useless logic costs ( if _truncate ... for instance ) where it is no longer necessary and there is a return at the top...
But, it is still a good function for truncating data...
##
## Truncate characters of a string after _len'nth char, if necessary... If _len is less than 0, don't truncate anything... Note: If you attach a suffix, and you enable absolute max length then the suffix length is subtracted from max length... Note: If the suffix length is longer than the output then no suffix is used...
##
## Usage: Where _text = 'Testing', _width = 4
## _data = String.Truncate( _text, _width ) == Test
## _data = String.Truncate( _text, _width, '..', True ) == Te..
##
## Equivalent Alternates: Where _text = 'Testing', _width = 4
## _data = String.SubStr( _text, 0, _width ) == Test
## _data = _text[ : _width ] == Test
## _data = ( _text )[ : _width ] == Test
##
def Truncate( _text, _max_len = -1, _suffix = False, _absolute_max_len = True ):
## Length of the string we are considering for truncation
_len = len( _text )
## Whether or not we have to truncate
_truncate = ( False, True )[ _len > _max_len ]
## Note: If we don't need to truncate, there's no point in proceeding...
if ( not _truncate ):
return _text
## The suffix in string form
_suffix_str = ( '', str( _suffix ) )[ _truncate and _suffix != False ]
## The suffix length
_len_suffix = len( _suffix_str )
## Whether or not we add the suffix
_add_suffix = ( False, True )[ _truncate and _suffix != False and _max_len > _len_suffix ]
## Suffix Offset
_suffix_offset = _max_len - _len_suffix
_suffix_offset = ( _max_len, _suffix_offset )[ _add_suffix and _absolute_max_len != False and _suffix_offset > 0 ]
## The truncate point.... If not necessary, then length of string.. If necessary then the max length with or without subtracting the suffix length... Note: It may be easier ( less logic cost ) to simply add the suffix to the calculated point, then truncate - if point is negative then the suffix will be destroyed anyway.
## If we don't need to truncate, then the length is the length of the string.. If we do need to truncate, then the length depends on whether we add the suffix and offset the length of the suffix or not...
_len_truncate = ( _len, _max_len )[ _truncate ]
_len_truncate = ( _len_truncate, _max_len )[ _len_truncate <= _max_len ]
## If we add the suffix, add it... Suffix won't be added if the suffix is the same length as the text being output...
if ( _add_suffix ):
_text = _text[ 0 : _suffix_offset ] + _suffix_str + _text[ _suffix_offset: ]
## Return the text after truncating...
return _text[ : _len_truncate ]
Suppose that stryng is a string which we wish to truncate and that nchars is the number of characters desired in the output string.
stryng = "sadddddddddddddddddddddddddddddddddddddddddddddddddd"
nchars = 10
We can truncate the string as follows:
def truncate(stryng:str, nchars:int):
return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]
The results for certain test cases are shown below:
s = "sadddddddddddddddddddddddddddddd!"
s = "sa" + 30*"d" + "!"
truncate(s, 2) == sa
truncate(s, 4) == sadd
truncate(s, 10) == sadd [...]
truncate(s, len(s)//2) == sadddddddd [...]
My solution produces reasonable results for the test cases above.
However, some pathological cases are shown below:
Some Pathological Cases!
truncate(s, len(s) - 3)() == sadddddddddddddddddddddd [...]
truncate(s, len(s) - 2)() == saddddddddddddddddddddddd [...]
truncate(s, len(s) - 1)() == sadddddddddddddddddddddddd [...]
truncate(s, len(s) + 0)() == saddddddddddddddddddddddddd [...]
truncate(s, len(s) + 1)() == sadddddddddddddddddddddddddd [...
truncate(s, len(s) + 2)() == saddddddddddddddddddddddddddd [..
truncate(s, len(s) + 3)() == sadddddddddddddddddddddddddddd [.
truncate(s, len(s) + 4)() == saddddddddddddddddddddddddddddd [
truncate(s, len(s) + 5)() == sadddddddddddddddddddddddddddddd
truncate(s, len(s) + 6)() == sadddddddddddddddddddddddddddddd!
truncate(s, len(s) + 7)() == sadddddddddddddddddddddddddddddd!
truncate(s, 9999)() == sadddddddddddddddddddddddddddddd!
Notably,
When the string contains new-line characters (\n) there could be an issue.
When nchars > len(s) we should print string s without trying to print the "[...]"
Below is some more code:
import io
class truncate:
"""
Example of Code Which Uses truncate:
```
s = "\r<class\n 'builtin_function_or_method'>"
s = truncate(s, 10)()
print(s)
```
Examples of Inputs and Outputs:
truncate(s, 2)() == \r
truncate(s, 4)() == \r<c
truncate(s, 10)() == \r<c [...]
truncate(s, 20)() == \r<class\n 'bu [...]
truncate(s, 999)() == \r<class\n 'builtin_function_or_method'>
```
Other Notes:
Returns a modified copy of string input
Does not modify the original string
"""
def __init__(self, x_stryng: str, x_nchars: int) -> str:
"""
This initializer mostly exists to sanitize function inputs
"""
try:
stryng = repr("".join(str(ch) for ch in x_stryng))[1:-1]
nchars = int(str(x_nchars))
except BaseException as exc:
invalid_stryng = str(x_stryng)
invalid_stryng_truncated = repr(type(self)(invalid_stryng, 20)())
invalid_x_nchars = str(x_nchars)
invalid_x_nchars_truncated = repr(type(self)(invalid_x_nchars, 20)())
strm = io.StringIO()
print("Invalid Function Inputs", file=strm)
print(type(self).__name__, "(",
invalid_stryng_truncated,
", ",
invalid_x_nchars_truncated, ")", sep="", file=strm)
msg = strm.getvalue()
raise ValueError(msg) from None
self._stryng = stryng
self._nchars = nchars
def __call__(self) -> str:
stryng = self._stryng
nchars = self._nchars
return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]
Here's a simple function that will truncate a given string from either side:
def truncate(string, length=75, beginning=True, insert='..'):
'''Shorten the given string to the given length.
An ellipsis will be added to the section trimmed.
:Parameters:
length (int) = The maximum allowed length before trunicating.
beginning (bool) = Trim starting chars, else; ending.
insert (str) = Chars to add at the trimmed area. (default: ellipsis)
:Return:
(str)
ex. call: truncate('12345678', 4)
returns: '..5678'
'''
if len(string)>length:
if beginning: #trim starting chars.
string = insert+string[-length:]
else: #trim ending chars.
string = string[:length]+insert
return string
Here I use textwrap.shorten and handle more edge cases. also include part of the last word in case this word is more than 50% of the max width.
import textwrap
def shorten(text: str, width=30, placeholder="..."):
"""Collapse and truncate the given text to fit in the given width.
The text first has its whitespace collapsed. If it then fits in the *width*, it is returned as is.
Otherwise, as many words as possible are joined and then the placeholder is appended.
"""
if not text or not isinstance(text, str):
return str(text)
t = text.strip()
if len(t) <= width:
return t
# textwrap.shorten also throws ValueError if placeholder too large for max width
shorten_words = textwrap.shorten(t, width=width, placeholder=placeholder)
# textwrap.shorten doesn't split words, so if the text contains a long word without spaces, the result may be too short without this word.
# Here we use a different way to include the start of this word in case shorten_words is less than 50% of `width`
if len(shorten_words) - len(placeholder) < (width - len(placeholder)) * 0.5:
return t[:width - len(placeholder)].strip() + placeholder
return shorten_words
Tests:
>>> shorten("123 456", width=7, placeholder="...")
'123 456'
>>> shorten("1 23 45 678 9", width=12, placeholder="...")
'1 23 45...'
>>> shorten("1 23 45 678 9", width=10, placeholder="...")
'1 23 45...'
>>> shorten("01 23456789", width=10, placeholder="...")
'01 2345...'
>>> shorten("012 3 45678901234567", width=17, placeholder="...")
'012 3 45678901...'
>>> shorten("1 23 45 678 9", width=9, placeholder="...")
'1 23...'
>>> shorten("1 23456", width=5, placeholder="...")
'1...'
>>> shorten("123 456", width=5, placeholder="...")
'12...'
>>> shorten("123 456", width=6, placeholder="...")
'123...'
>>> shorten("12 3456789", width=9, placeholder="...")
'12 345...'
>>> shorten(" 12 3456789 ", width=9, placeholder="...")
'12 345...'
>>> shorten('123 45', width=4, placeholder="...")
'1...'
>>> shorten('123 45', width=3, placeholder="...")
'...'
>>> shorten("123456", width=3, placeholder="...")
'...'
>>> shorten([1], width=9, placeholder="...")
'[1]'
>>> shorten(None, width=5, placeholder="...")
'None'
>>> shorten("", width=9, placeholder="...")
''