I have a quite large configuration file that consists of blocks delimited by
#start <some-name> ... #end <some-name> were some-name has to be the same for the block. The block can appear multiple times but is never contained within itself. Only some other blocks may appear in certain blocks. I'm not interested in these contained blocks, but on the blocks in the second level.
In the real file the names do not start with blockX but are very different from each other.
An example:
#start block1
#start block2
/* string but no more name2 or name1 in here */
#end block2
#start block3
/* configuration data */
#end block3
#end block1
This is being parsed with regex and is, when run without a debugger attached, quite fast. 0.23s for a 2k 2.7MB file with simple rules like:
blocks2 = re.findAll('#start block2\s+(.*?)#end block2', contents)
I tried parsing this with pyparsing but the speed is VERY slow even without a debugger attached, it took 16s for the same file.
My approach was to produce a pyparsing code that would mimic the simple parsing from the regex so I can use some of the other code for now and avoid having to parse every block now. The grammar is quite extense.
Here is what I tried
block = [Group(Keyword(x) + SkipTo(Keyword('#end') + Keyword(x)) + Keyword('#end') - x )(x + '*') for x in ['block3', 'block4', 'block5', 'block6', 'block7', 'block8']]
blocks = Keyword('#start') + block
x = OneOrMore(blocks).searchString(contents) # I also tried parseString() but the results were similar.
What am I doing wrong? How can I optimize this to come anywhere close to the speed achieved by the regex implementation?
Edit: The previous example was way to easy compared to the real data, so i created a proper one now:
/* all comments are C comments */
VERSION 1 0
#start PROJECT project_name "what is it about"
/* why not another comment here too! */
#start SECTION where_the_wild_things_are "explain this section"
/* I need all sections at this level */
/* In the real data there are about 10k of such blocks.
There are around 10 different names (types) of blocks */
#start INTERFACE_SPEC
There can be anything in the section. Not Really but i want to skip anything until the matching (hash)end.
/* can also have comments */
#end INTERFACE_SPEC
#start some_other_section
name 'section name'
#start with_inner_section
number_of_points 3 /* can have comments anywhere */
#end with_inner_section
#end some_other_section /* basically comments can be anywhere */
#start some_other_section
name 'section name'
other_section_attribute X
ref_to_section another_section
#end some_other_section
#start another_section
degrees
#start section_i_do_not_care_about_at_the_moment
ref_to some_other_section
/* of course can have comments */
#end section_i_do_not_care_about_at_the_moment
#end another_section
#end SECTION
#end PROJECT
For this i had to expand your original suggestion. I hard coded the two outer blocks (PROJECT and SECTION) because they MUST exist.
With this version the time is still at ~16s:
def test_parse(f):
import pyparsing as pp
import io
comment = pp.cStyleComment
start = pp.Literal("#start")
end = pp.Literal("#end")
ident = pp.Word(pp.alphas + "_", pp.printables)
inner_ident = ident.copy()
inner_start = start + inner_ident
inner_end = end + pp.matchPreviousLiteral(inner_ident)
inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)
version = pp.Literal('VERSION') - pp.Word(pp.nums)('major_version') - pp.Word(pp.nums)('minor_version')
project = pp.Keyword('#start') - pp.Keyword('PROJECT') - pp.Word(pp.alphas + "_", pp.printables)(
'project_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
pp.Keyword('#start') - pp.Keyword('SECTION') - pp.Word(pp.alphas, pp.printables)(
'section_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
pp.OneOrMore(inner_block) + \
pp.Keyword('#end') - pp.Keyword('SECTION') + \
pp.ZeroOrMore(comment) - pp.Keyword('#end') - pp.Keyword('PROJECT')
grammar = pp.ZeroOrMore(comment) - version.ignore(comment) - project.ignore(comment)
with io.open(f) as ff:
return grammar.parseString(ff.read())
EDIT: Typo, said it was 2k but it instead it is a 2.7MB file.
First of all, this code as posted doesn't work for me:
blocks = Keyword('#start') + block
Changing to this:
blocks = Keyword('#start') + MatchFirst(block)
at least runs against your sample text.
Rather than hard-code all the keywords, you can try using one of pyparsing's adaptive expressions, matchPreviousLiteral:
(EDITED)
def grammar():
import pyparsing as pp
comment = pp.cStyleComment
start = pp.Keyword("#start")
end = pp.Keyword('#end')
ident = pp.Word(pp.alphas + "_", pp.printables)
integer = pp.Word(pp.nums)
inner_ident = ident.copy()
inner_start = start + inner_ident
inner_end = end + pp.matchPreviousLiteral(inner_ident)
inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)
VERSION, PROJECT, SECTION = map(pp.Keyword, "VERSION PROJECT SECTION".split())
version = VERSION - pp.Group(integer('major_version') + integer('minor_version'))
project = (start - PROJECT + ident('project_name') + pp.dblQuotedString
+ start + SECTION + ident('section_name') + pp.dblQuotedString
+ pp.OneOrMore(inner_block)('blocks')
+ end + SECTION
+ end + PROJECT)
grammar = version + project
grammar.ignore(comment)
return grammar
It is only necessary to call ignore() on the topmost expression in your grammar - it will propagate down to all internal expressions. Also, it should be unnecessary to sprinkle ZeroOrMore(comment)s in your grammar, if you have already called ignore().
I parsed a 2MB input string (containing 10,000 inner blocks) in about 16 seconds, so a 2K file should only take about 1/1000th as long.
Related
I'm trying to use pyparsing to parse key:value pairs from the comments in a document. A key starts at the beginning of a line, and a value follows. Values may be continued on multiple lines that begin with whitespace.
import pyparsing as pp
instring = """
-- This is (a) #%^& comment
/*
name1: val
name2: val2 with $*&##) junk
name3: val3: with #)(*% multi-
line: content
*/
"""
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = pp.LineStart() + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = pp.LineStart() + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
if __name__ == "__main__":
p = metalist.parseString(instring)
print(p)
Fails with:
Matched {Empty SkipTo:(LineEnd) Empty} -> ['This is (a) #%^& comment']
File "C:\Users\user\py3\lib\site-packages\pyparsing.py", line 2305, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected start of line (at char 32), (line:4, col:1)
The answer to pyparsing whitespace match issues says
LineStart has always been difficult to work with, but ...
If the parser is at line 4 column 1 (the first key:value pair), then why is it not finding a start of line? What is the correct pyparsing syntax to recognize lines beginning with no whitespace and lines beginning with whitespace?
I think the confusion I have with LineStart is that, for LineEnd, I can look for a '\n' character, but there is no separate character for LineStart. So in LineStart I look to see if the current parser location is positioned just after a '\n'; or if it is currently on a '\n', move past it and still continue. Unfortunately, I implemented this in a place that messes up the reporting location, so you get those weird errors that read like "failed to find a start of line on line X col 1," which really does sound like it should be a successfully matched start of a line. Also, I think I need to revisit this implicit newline-skipping, or for that matter, all whitespace-skipping in general for LineStart.
For now, I've gotten your code to work by expanding your line-starting expression slightly, as:
LS = pp.Optional(pp.LineEnd()) + pp.LineStart()
and replaced the LineStart references in meta1 and meta2 with LS:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = LS + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = LS + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
If this situation with LineStart leaves you uncomfortable, here is another tactic you can try: using a parse-time condition to only accept identifiers that start in column 1:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setName("identifier")
identifier.addCondition(lambda instring,loc,toks: pp.col(loc,instring) == 1)
meta1 = identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd()).setDebug()
meta2 = pp.White().setDebug() + pp.SkipTo(pp.LineEnd()).setDebug()
metaval = meta1 + pp.ZeroOrMore(meta2, stopOn=pp.Literal('*/'))
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.LineEnd() + pp.OneOrMore(metaval) + pp.Literal("*/")
This code does away with LineStart completely, while I figure out just what I want this particular token to do. I also had to modify the ZeroOrMore repetition in metaval so that */ would not be accidentally processed as continued comment content.
Thanks for your patience with this - I am not keen to quickly put out a patched LineStart change and then find that I have overlooked other compatibility or other edge cases that just put me back in the current less-than-great state on this class. But I'll put some effort into clarifying this behavior before putting out 2.1.10.
I am trying to make a snippet that will help me choose the right revision
number for migration, by reading all migration files from
application/migrations.
What I managed to do myself is that my filenames are being filtered while I am
typing, and when only one match left insert its revision number at the cursor
position (which are first 14 chars of filename always).
The problem is that when I hit TAB to select, I am also left with what I have
typed so far to search for that revision number, meaning something like this
remo20160812110447.
Question is, how to get rid of that remo in this case!?
NOTE: Example uses hardcoded values, for easier testing, those will be later
replaced by # lst = os.listdir('application/migrations') line.
Also an added bonus effect would be to present those 20160710171947 values as
human readable date format while choosing, but after hitting TAB insert their
original source version.
global !p
import datetime
def complete(t, opts):
if t:
opts = [ m for m in opts if t in m ]
if len(opts) == 1:
return opts[0][:14]
return "(" + '|'.join(opts) + ')'
endglobal
snippet cimigration "Inserts desired migration number, obtained via filenames"
$1`!p import os
# lst = os.listdir('application/migrations')
lst = [
'20160710171947_create.php',
'20160810112347_delete.php',
'20160812110447_remove.php'
]
snip.rv = complete(t[1], lst)`
endsnippet
This can definitely be performed in pure vimscript.
Here is a working prototype. It does work but has some issues with portability: global variables, reliance on iskeyword and uses two keybindings instead of one. But it was put together in an hour or so:
set iskeyword=#,48-57,_,-,.,192-255
let g:wordidx = 0
let g:word = ''
let g:match = 0
function! Suggest()
let l:glob = globpath('application/migrations', '*.php')
let l:files = map(split(l:glob), 'fnamemodify(v:val, ":t")')
let l:char = getline('.')[col('.')-1]
let l:word = ''
let l:suggestions = []
if l:char =~# '[a-zA-Z0-9_]'
if g:word ==# ''
let g:word = expand('<cword>')
let g:match = matchadd('ErrorMsg', g:word)
endif
let l:word = g:word
"let l:reg = '^' . l:word
let l:suggestions = filter(l:files, 'v:val =~ l:word')
if !empty(l:suggestions)
call add(l:suggestions, l:word)
"echo l:suggestions
let l:change = l:suggestions[g:wordidx]
let g:wordidx = (g:wordidx + 1) % len(l:suggestions)
"echo g:wordidx + 10
execute "normal! mqviwc" . l:change . "\<esc>`q"
endif
endif
"echo [l:word, l:suggestions]
endfunction
function! SuggestClear()
call matchdelete(g:match)
let g:wordidx = 0
let g:word = ''
let g:match = 0
endfunction
nnoremap <leader><tab> :call Suggest()<cr>
nnoremap <leader><cr> :call SuggestClear()<cr>
Adding this to your ~/.vimrc will allow you to steps through search matches with <leader><tab>. It will highlight the part that is being matched, to drop the highlight you need to type <leader><cr>.
You should always drop the highlight after use because the original search word is kept internally until you destroy it. Using <leader><tab> before clearing the match will substitute for the suggestions from the previous match.
Screencast (my leader is -):
If you have more vim questions join the vi.SE subsection of the website. You can probably get better answers there.
This can be achieved using post-expand-actions: https://github.com/SirVer/ultisnips/blob/master/doc/UltiSnips.txt#L1602
I recently acquired a trial version of some source code to check MISRA compliance before purchasing. I have run pc-lint over the C code to verify compliance, and have got an output of a huge amount of violations. I was wanting to nicify the html generated so that I can sort what violations there are. I have tried googling for something that exists already to do this with little yield, so instead i began writing a python script...
In short, the script iterates through every line of the html output multiple times in order to check for a particular string. Of course this takes a ridiculously long time to execute, I have been unable to find an elegant solution to this, but I'm hoping im missing something obvious that someone could point out... otherwise, perhaps another language would be more appropriate that would execute faster. Cheers!
#!/usr/bin/env python
import re
rule_search = re.compile("Required Rule (.*?),",re.DOTALL|re.M)
rule_search2 = re.compile("MISRA 2004 Rule (.*?)]",re.DOTALL|re.M)
line_search = re.compile("<br>(.*?)<br>",re.DOTALL|re.M)
data=open('lint-all.html').read()
unique_rules = list(set(rule_search.findall(data)))
unique_rules2 = list(set(rule_search2.findall(data)))
MISRA_Rules = unique_rules + unique_rules2
count = [0] * len(MISRA_Rules)
page_lines = {}
pages = {}
counts = open("pages/counts.html",'w')
counts.write("<h2>Violated Rules Count</h2><h3><ol>")
counts.close()
for i in range (len(MISRA_Rules)):
pages[i] = open("pages/" + str(MISRA_Rules[i]).translate(None, '.') + ".html", 'w')
pages[i].close()
counts = open("pages/counts.html",'a+')
counts.write("<a href=" + str(MISRA_Rules[i]).translate(None, '.') + ".html>" + str(MISRA_Rules[i]) + "</a>: <font size='3'> 0 </font> " )
if i%4 == 0 and i != 0:
counts.write("<br />")
counts.write("<br /><a href=sorted.html>Total:</a> " + "<font size='3'>" + str(count) + "</font>")
counts.write("</h3>")
for i in range (len(MISRA_Rules)):
pages[i] = open("pages/" + str(MISRA_Rules[i]).translate(None, '.') + ".html", 'a+')
pages[i].write("<h1>MISRA Rule " + str(MISRA_Rules[i]) + "</h1>")
pages[i].write("""<link rel="import" href="counts.html">""")
for j in range (len(line_search.findall(data))):
if "Rule " + str(MISRA_Rules[i]) in line_search.findall(data)[j]:
count[i] += 1
pages[i].write("<br>")
pages[i].write(line_search.findall(data)[j])
pages[i].write("</br>")
print "out"
new_html = open('pages/sorted.html', 'w')
counts = """<h2>Violated Rules Count</h2><h3><ol>"""
for i in range (len(MISRA_Rules)):
counts += """""" + str(MISRA_Rules[i]) + """: <font size="3">""" + str(count[i]) + """</font> """
if i%4 == 0 and i != 0:
counts += """<br />"""
counts += """<br /><a href=sorted.html>Total:</a> """ + """<font size="3">""" + str(count) + """</font>"""
counts += """</h3>"""
counts.close()
new_html.write(counts)
new_html.write(data)
new_html.close()
Several approaches possible.
First is to optimize existing code. It's difficult to say what's wrong with it. In this case one goes to cprofile docs and sets up a profiler. There you'll see the bottlenecks.
Second approach (most preferable to my opinion): parse data in Python, but leave HTML generation to specialized tools, such as jinja2 template engine, which is extensively used in web development. The simpler version of jinja2 is mustache, most likely that in won't require any installation.
Third approach is to do all this stuff in-browser. Add jQuery for DOM manipulation (introduce new tags and classes) and a css stylesheet (determine how new tags and classes should look like).
Paul McGuire, the author of pyparsing, was kind enough to help a lot with a problem I'm trying to solve. We're on 1st down with a yard to goal, but I can't even punt it across the goal line. Confucius said if he gave a student 1/4 of the solution, and he did not return with the other 3/4s, then he would not teach that student again. So it is after almost a week of frustation and with great anxiety that I ask this...
How do I open an input file for pyparsing and print the output to another file?
Here is what I've got so far, but it's really all his work
from pyparsing import *
datafile = open( 'test.txt' )
# Backaus Nuer Form
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")
lastPatientData = None
# PARSE ACTIONS
def patientRecord( datafile ):
for match in partMatch.searchString(datafile):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
lastPatientData.patientData, match.gleason
)
patientData.setParseAction(lastPatientData)
# MAIN PROGRAM
if __name__=="__main__":
patientRecord()
It looks like you need to call datafile.read() in order to read the contents of the file. Right now you are trying to call searchString on the file object itself, not the text in the file. You should really look at the Python tutorial (particularly this section) to get up to speed on how to read files, etc.
It seems like you need some help putting it together. The advice of #BrenBarn is spot-on, work with problem of simple complexity before you put it all together. I can help by giving you a minimal example of what you are trying to do, with a much simpler grammar. You can use this as a template to learn how to read/write a file in python. Consider the input text file data.txt:
cat 3
dog 5
foo 7
Let's parse this file and output the results. To have some fun, let's mulpitply the second column by 2:
from pyparsing import *
# Read the input data
filename = "data.txt"
FIN = open(filename)
TEXT = FIN.read()
# Define a simple grammar for the text, multiply the first col by 2
digits = Word(nums)
digits.setParseAction(lambda x:int(x[0]) * 2)
blocks = Group(Word(alphas) + digits)
grammar = OneOrMore(blocks)
# Parse the results
result = grammar.parseString( TEXT )
# This gives a list of lists
# [['cat', 6], ['dog', 10], ['foo', 14]]
# Open up a new file for the output
filename2 = "data2.txt"
FOUT = open(filename2,'w')
# Walk through the results and write to the file
for item in result:
print item
FOUT.write("%s %i\n" % (item[0],item[1]))
FOUT.close()
This gives in data2.txt:
cat 6
dog 10
foo 14
Break each piece down until you understand it. From here, you can slowly adapt this minimal example to your more complex problem above. It's OK to read the file in (as long as it is relatively small) since Paul himself notes:
parseFile is really just a simple shortcut around parseString, pretty
much the equivalent of expr.parseString(open(filename).read()).
Overview
So, I’m in the middle of refactoring a project, and I’m separating out a bunch of parsing code. The code I’m concerned with is pyparsing.
I have a very poor understanding of pyparsing, even after spending a lot of time reading through the official documentation. I’m having trouble because (1) pyparsing takes a (deliberately) unorthodox approach to parsing, and (2) I’m working on code I didn’t write, with poor comments, and a non-elementary set of existing grammars.
(I can’t get in touch with the original author, either.)
Failing Test
I’m using PyVows to test my code. One of my tests is as follows (I think this is clear even if you’re unfamiliar with PyVows; let me know if it isn’t):
def test_multiline_command_ends(self, topic):
output = parsed_input('multiline command ends\n\n',topic)
expect(output).to_equal(
r'''['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- statement: ['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- terminator: ['\n', '\n']
- terminator: ['\n', '\n']''')
But when I run the test, I get the following in the terminal:
Failed Test Results
Expected topic("['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline")
to equal "['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"
Note:
Since the output is to a Terminal, the expected output (the second one) has extra backslashes. This is normal. The test ran without issue before this piece of refactoring began.
Expected Behavior
The first line of output should match the second, but it doesn’t. Specifically, it’s not including the two newline characters in that first list object.
So I’m getting this:
"['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline"
When I should be getting this:
"['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"
Earlier in the code, there is also this statement:
pyparsing.ParserElement.setDefaultWhitespaceChars(' \t')
…Which I think should prevent exactly this kind of error. But I’m not sure.
Even if the problem can’t be identified with certainty, simply narrowing down where the problem is would be a HUGE help.
Please let me know how I might take a step or two towards fixing this.
Edit: So, uh, I should post the parser code for this, shouldn’t I? (Thanks for the tip, #andrew cooke !)
Parser code
Here’s the __init__ for my parser object.
I know it’s a nightmare. That’s why I’m refactoring the project. ☺
def __init__(self, Cmd_object=None, *args, **kwargs):
# #NOTE
# This is one of the biggest pain points of the existing code.
# To aid in readability, I CAPITALIZED all variables that are
# not set on `self`.
#
# That means that CAPITALIZED variables aren't
# used outside of this method.
#
# Doing this has allowed me to more easily read what
# variables become a part of other variables during the
# building-up of the various parsers.
#
# I realize the capitalized variables is unorthodox
# and potentially anti-convention. But after reaching out
# to the project's creator several times over roughly 5
# months, I'm still working on this project alone...
# And without help, this is the only way I can move forward.
#
# I have a very poor understanding of the parser's
# control flow when the user types a command and hits ENTER,
# and until the author (or another pyparsing expert)
# explains what's happening to me, I have to do silly
# things like this. :-|
#
# Of course, if the impossible happens and this code
# gets cleaned up, then the variables will be restored to
# proper capitalization.
#
# —Zearin
# http://github.com/zearin/
# 2012 Mar 26
if Cmd_object is not None:
self.Cmd_object = Cmd_object
else:
raise Exception('Cmd_object be provided to Parser.__init__().')
# #FIXME
# Refactor methods into this class later
preparse = self.Cmd_object.preparse
postparse = self.Cmd_object.postparse
self._allow_blank_lines = False
self.abbrev = True # Recognize abbreviated commands
self.case_insensitive = True # Commands recognized regardless of case
# make sure your terminators are not in legal_chars!
self.legal_chars = u'!#$%.:?#_' + PYP.alphanums + PYP.alphas8bit
self.multiln_commands = [] if 'multiline_commands' not in kwargs else kwargs['multiln_commands']
self.no_special_parse = {'ed','edit','exit','set'}
self.redirector = '>' # for sending output to file
self.reserved_words = []
self.shortcuts = { '?' : 'help' ,
'!' : 'shell',
'#' : 'load' ,
'##': '_relative_load'
}
# self._init_grammars()
#
# def _init_grammars(self):
# #FIXME
# Add Docstring
# ----------------------------
# Tell PYP how to parse
# file input from '< filename'
# ----------------------------
FILENAME = PYP.Word(self.legal_chars + '/\\')
INPUT_MARK = PYP.Literal('<')
INPUT_MARK.setParseAction(lambda x: '')
INPUT_FROM = FILENAME('INPUT_FROM')
INPUT_FROM.setParseAction( self.Cmd_object.replace_with_file_contents )
# ----------------------------
#OUTPUT_PARSER = (PYP.Literal('>>') | (PYP.WordStart() + '>') | PYP.Regex('[^=]>'))('output')
OUTPUT_PARSER = (PYP.Literal( 2 * self.redirector) | \
(PYP.WordStart() + self.redirector) | \
PYP.Regex('[^=]' + self.redirector))('output')
PIPE = PYP.Keyword('|', identChars='|')
STRING_END = PYP.stringEnd ^ '\nEOF'
TERMINATORS = [';']
TERMINATOR_PARSER = PYP.Or([
(hasattr(t, 'parseString') and t)
or
PYP.Literal(t) for t in TERMINATORS
])('terminator')
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '')
self.comment_grammars.addParseAction(lambda x: '')
self.comment_in_progress = '/*' + PYP.SkipTo(PYP.stringEnd ^ '*/')
# QuickRef: Pyparsing Operators
# ----------------------------
# ~ creates NotAny using the expression after the operator
#
# + creates And using the expressions before and after the operator
#
# | creates MatchFirst (first left-to-right match) using the
# expressions before and after the operator
#
# ^ creates Or (longest match) using the expressions before and
# after the operator
#
# & creates Each using the expressions before and after the operator
#
# * creates And by multiplying the expression by the integer operand;
# if expression is multiplied by a 2-tuple, creates an And of
# (min,max) expressions (similar to "{min,max}" form in
# regular expressions); if min is None, intepret as (0,max);
# if max is None, interpret as expr*min + ZeroOrMore(expr)
#
# - like + but with no backup and retry of alternatives
#
# * repetition of expression
#
# == matching expression to string; returns True if the string
# matches the given expression
#
# << inserts the expression following the operator as the body of the
# Forward expression before the operator
# ----------------------------
DO_NOT_PARSE = self.comment_grammars | \
self.comment_in_progress | \
PYP.quotedString
# moved here from class-level variable
self.URLRE = re.compile('(https?://[-\\w\\./]+)')
self.keywords = self.reserved_words + [fname[3:] for fname in dir( self.Cmd_object ) if fname.startswith('do_')]
# not to be confused with `multiln_parser` (below)
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
ONELN_COMMAND = ( ~self.multiln_command +
PYP.Word(self.legal_chars)
)('command')
#self.multiln_command.setDebug(True)
# Configure according to `allow_blank_lines` setting
if self._allow_blank_lines:
self.blankln_termination_parser = PYP.NoMatch
else:
BLANKLN_TERMINATOR = (2 * PYP.lineEnd)('terminator')
#BLANKLN_TERMINATOR('terminator')
self.blankln_termination_parser = (
(self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
BLANKLN_TERMINATOR,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ BLANKLN_TERMINATOR
)('statement')
# CASE SENSITIVITY for
# ONELN_COMMAND and self.multiln_command
if self.case_insensitive:
# Set parsers to account for case insensitivity (if appropriate)
self.multiln_command.setParseAction(lambda x: x[0].lower())
ONELN_COMMAND.setParseAction(lambda x: x[0].lower())
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx')
+ PYP.Optional(PYP.Word(self.legal_chars + '/\\'))('fname')
+ PYP.stringEnd)
AFTER_ELEMENTS = PYP.Optional(PIPE +
PYP.SkipTo(
OUTPUT_PARSER ^ STRING_END,
ignore=DO_NOT_PARSE
)('pipeTo')
) + \
PYP.Optional(OUTPUT_PARSER +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('outputTo')
)
self.multiln_parser = (((self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
TERMINATOR_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ TERMINATOR_PARSER)('statement')
+ PYP.SkipTo(
OUTPUT_PARSER ^ PIPE ^ STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('suffix')
+ AFTER_ELEMENTS
)
#self.multiln_parser.setDebug(True)
self.multiln_parser.ignore(self.comment_in_progress)
self.singleln_parser = (
( ONELN_COMMAND + PYP.SkipTo(
TERMINATOR_PARSER
^ STRING_END
^ PIPE
^ OUTPUT_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x:x[0].strip())('args'))('statement')
+ PYP.Optional(TERMINATOR_PARSER)
+ AFTER_ELEMENTS)
#self.multiln_parser = self.multiln_parser('multiln_parser')
#self.singleln_parser = self.singleln_parser('singleln_parser')
self.prefix_parser = PYP.Empty()
self.parser = self.prefix_parser + (STRING_END |
self.multiln_parser |
self.singleln_parser |
self.blankln_termination_parser |
self.multiln_command +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE)
)
self.parser.ignore(self.comment_grammars)
# a not-entirely-satisfactory way of distinguishing
# '<' as in "import from" from
# '<' as in "lesser than"
self.input_parser = INPUT_MARK + \
PYP.Optional(INPUT_FROM) + \
PYP.Optional('>') + \
PYP.Optional(FILENAME) + \
(PYP.stringEnd | '|')
self.input_parser.ignore(self.comment_in_progress)
I suspect that the problem is pyparsing's builtin whitespace skipping, which will skip over newlines by default. Even though setDefaultWhitespaceChars is used to tell pyparsing that newlines are significant, this setting only affects all expressions that are created after the call to setDefaultWhitespaceChars. The problem is that pyparsing tries to help by defining a number of convenience expressions when it is imported, like empty for Empty(), lineEnd for LineEnd() and so on. But since these are all created at import time, they are defined with the original default whitespace characters, which include '\n'.
I should probably just do this in setDefaultWhitespaceChars, but you can clean this up for yourself too. Right after calling setDefaultWhitespaceChars, redefine these module-level expressions in pyparsing:
PYP.ParserElement.setDefaultWhitespaceChars(' \t')
# redefine module-level constants to use new default whitespace chars
PYP.empty = PYP.Empty()
PYP.lineEnd = PYP.LineEnd()
PYP.stringEnd = PYP.StringEnd()
I think this will help restore the significance of your embedded newlines.
Some other bits on your parser code:
self.blankln_termination_parser = PYP.NoMatch
should be
self.blankln_termination_parser = PYP.NoMatch()
Your original author might have been overly aggressive with using '^' over '|'. Only use '^' if there is some potential for parsing one expression accidentally when you would really have parsed a longer one that follows later in the list of alternatives. For instance, in:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx')
There is no possible confusion between a Word of numeric digits or a lone '*'. Or (or '^' operator) tells pyparsing to try to evaluate all of the alternatives, and then pick the longest matching one - in case of a tie, chose the left-most alternative in the list. If you parse '*', there is no need to see if that might also match a longer integer, or if you parse an integer, no need to see if it might also pass as a lone '*'. So change this to:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)|'*')('idx')
Using a parse action to replace a string with '' is more simply written using a PYP.Suppress wrapper, or if you prefer, call expr.suppress() which returns Suppress(expr). Combined with preference for '|' over '^', this:
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '')
becomse:
self.comment_grammars = (PYP.pythonStyleComment | PYP.cStyleComment
).ignore(PYP.quotedString).suppress()
Keywords have built-in logic to automatically avoid ambiguity, so Or is completely unnecessary with them:
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
should be:
self.multiln_command = PYP.MatchFirst([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
(In the next release, I'll loosen up those initializers to accept generator expressions so that the []'s will become unnecessary.)
That's all I can see for now. Hope this helps.
I fixed it!
Pyparsing was not at fault!
I was. ☹
By separating out the parsing code into a different object, I created the problem. Originally, an attribute used to “update itself” based on the contents of a second attribute. Since this all used to be contained in one “god class”, it worked fine.
Simply by separating the code into another object, the first attribute was set at instantiation, but no longer “updated itself” if the second attribute it depended on changed.
Specifics
The attribute multiln_command (not to be confused with multiln_commands—aargh, what confusing naming!) was a pyparsing grammar definition. The multiln_command attribute should have updated its grammar if multiln_commands ever changed.
Although I knew these two attributes had similar names but very different purposes, the similarity definitely made it harder to track the problem down. I have no renamed multiln_command to multiln_grammar.
However! ☺
I am grateful to #Paul McGuire’s awesome answer, and I hope it saves me (and others) some grief in the future. Although I feel a bit foolish that I caused the problem (and misdiagnosed it as a pyparsing issue), I’m happy some good (in the form of Paul’s advice) came of asking this question.
Happy parsing, everybody. :)