I'm trying to use pyparsing to parse key:value pairs from the comments in a document. A key starts at the beginning of a line, and a value follows. Values may be continued on multiple lines that begin with whitespace.
import pyparsing as pp
instring = """
-- This is (a) #%^& comment
/*
name1: val
name2: val2 with $*&##) junk
name3: val3: with #)(*% multi-
line: content
*/
"""
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = pp.LineStart() + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = pp.LineStart() + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
if __name__ == "__main__":
p = metalist.parseString(instring)
print(p)
Fails with:
Matched {Empty SkipTo:(LineEnd) Empty} -> ['This is (a) #%^& comment']
File "C:\Users\user\py3\lib\site-packages\pyparsing.py", line 2305, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected start of line (at char 32), (line:4, col:1)
The answer to pyparsing whitespace match issues says
LineStart has always been difficult to work with, but ...
If the parser is at line 4 column 1 (the first key:value pair), then why is it not finding a start of line? What is the correct pyparsing syntax to recognize lines beginning with no whitespace and lines beginning with whitespace?
I think the confusion I have with LineStart is that, for LineEnd, I can look for a '\n' character, but there is no separate character for LineStart. So in LineStart I look to see if the current parser location is positioned just after a '\n'; or if it is currently on a '\n', move past it and still continue. Unfortunately, I implemented this in a place that messes up the reporting location, so you get those weird errors that read like "failed to find a start of line on line X col 1," which really does sound like it should be a successfully matched start of a line. Also, I think I need to revisit this implicit newline-skipping, or for that matter, all whitespace-skipping in general for LineStart.
For now, I've gotten your code to work by expanding your line-starting expression slightly, as:
LS = pp.Optional(pp.LineEnd()) + pp.LineStart()
and replaced the LineStart references in meta1 and meta2 with LS:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = LS + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = LS + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
If this situation with LineStart leaves you uncomfortable, here is another tactic you can try: using a parse-time condition to only accept identifiers that start in column 1:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setName("identifier")
identifier.addCondition(lambda instring,loc,toks: pp.col(loc,instring) == 1)
meta1 = identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd()).setDebug()
meta2 = pp.White().setDebug() + pp.SkipTo(pp.LineEnd()).setDebug()
metaval = meta1 + pp.ZeroOrMore(meta2, stopOn=pp.Literal('*/'))
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.LineEnd() + pp.OneOrMore(metaval) + pp.Literal("*/")
This code does away with LineStart completely, while I figure out just what I want this particular token to do. I also had to modify the ZeroOrMore repetition in metaval so that */ would not be accidentally processed as continued comment content.
Thanks for your patience with this - I am not keen to quickly put out a patched LineStart change and then find that I have overlooked other compatibility or other edge cases that just put me back in the current less-than-great state on this class. But I'll put some effort into clarifying this behavior before putting out 2.1.10.
Related
I've tried to find / build a solution but It's too complicated for me at the moment. I have a text from SAP (stored in tkinter's scrooledtext.
session.findById("wnd[0]").sendVKey 4
session.findById("wnd[1]/tbar[0]/btn[17]").press
session.findById("wnd[1]/usr/tabsG_SELONETABSTRIP/tabpTAB001/ssubSUBSCR_PRESEL:SAPLSDH4:0220/sub:SAPLSDH4:0220/ctxtG_SELFLD_TAB-LOW[3,24]").setFocus
session.findById("wnd[1]").sendVKey 0
session.findById("wnd[1]/usr/lbl[1,1]").setFocus
session.findById("wnd[1]").sendVKey 33
I would like to find every line when .sendVKey occurs, and put newtext at start of the line (or above the line) - eg in example there are 3 lines:
Added new text before sendkey text occurs!
session.findById("wnd[0]").sendVKey 4
(...)
Added new text before sendkey text occurs!<\b>
session.findById("wnd[1]").sendVKey 0
(...)
Tried using re.sub, re.findall and replace, but I think this is not a good approach. My working code is below, but it is not dynamic (in .sendVkey method it can be different window (0, 1, 2, 3 etc) in ("wnd[1]"). Any hints / solutions? Please help :(
def multiple_replace(string, rep_dict):
pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict, key=len, reverse=True)]), flags=re.DOTALL)
return pattern.sub(lambda x: rep_dict[x.group(0)], string)
date_div = RPAcode.get("1.0", tk.END)
delay_vba_function = " Added new text before sendkey text occurs"
"
replaced_text = multiple_replace(date_div, {'.sendVKey' : '\n' + delay_vba_function + '\n' + '.sendVKey', \
'session.findById("wnd[1]").sendVKey' : '\n' + delay_vba_function + '\n' + 'session.findById("wnd[1]").sendVKey', \
'session.findById("wnd[2]").sendVKey' : '\n' + delay_vba_function + '\n' + 'session.findById("wnd[2]").sendVKey', \
'session.findById("wnd[3]").sendVKey' : '\n' + delay_vba_function + '\n' + 'session.findById("wnd[3]").sendVKey'})
#print(replaced_text)
RPAcode.delete('0.0', tk.END) #Deletes previous data
RPAcode.insert(1.0, replaced_text)
The simplest way to check whether a substring is in a string is by using the in-operator (in Python3.5 at least). Consider the following:
text = 'session.findById("wnd[0]").sendVKey 4'
if '.sendVKey' in text:
text = '(...)' + text
text
If you run it in the terminal, the output is gonna be:
'(...)session.findById("wnd[0]").sendVKey 4'
So, as you can see, the script even adds already some text in front of the text line. If you don't say text = '(...)' + text, but text = '(...)\n' + text, the (...) would show up in a line above the actual text. If you put the above code sample into a for loop iterating through all text lines, I think that approach might solve your problem.
Edit:
You first have to split the text into lines before iterating through the lines. I guess that is precisely what you need:
text = """session.findById("wnd[0]").sendVKey 4
session.findById("wnd[1]/tbar[0]/btn[17]").press
session.findById("wnd[1]/usr/tabsG_SELONETABSTRIP/tabpTAB001/ssubSUBSCR_PRESEL:SAPLSDH4:0220
session.findById("wnd[1]").sendVKey 0
session.findById("wnd[1]/usr/lbl[1,1]").setFocus
session.findById("wnd[1]").sendVKey 33"""
query_text = ".sendVKey"
addition = "(...) "
def indicate_lines(text, query_text, indication):
result = ''
text = text.splitlines()
for line in text:
if query_text in line:
result = result + indication + line + "\n"
else:
result = result + line + "\n"
return result
result = indicate_lines(text, query_text, addition)
print(result)
The result is gonna be:
(...) session.findById("wnd[0]").sendVKey 4
session.findById("wnd[1]/tbar[0]/btn[17]").press
session.findById("wnd[1]/usr/tabsG_SELONETABSTRIP/tabpTAB001
/ssubSUBSCR_PRESEL:SAPLSDH4:0220
(...) session.findById("wnd[1]").sendVKey 0
session.findById("wnd[1]/usr/lbl[1,1]").setFocus
(...) session.findById("wnd[1]").sendVKey 33
Note that I would expect regex to perform faster in case you want to have a scalable solution (because for-loops are comparatively slow in Python). But for most medium sized applications that will do the job.
I need to parse the following three lines:
Uptime is 1w2d
Last reset at 23:05:56
Reason: reload
But last two lines are not always there, output could look like this prior to 1st reboot:
Uptime is 1w2d
Last reset
My parser looks like this:
parser = SkipTo(Literal('is'), include=True)('uptime') +
delimitedList(Suppress(SkipTo(Literal('at'), include=True))'(reset)' +
SkipTo(Literal(':'), include=true) +
SkipTo(lineEnd)('reason'), combine=True)
)
It works in first case with 3 lines, but doesnt work with second case.
I will use for the file that you've reported this syntax (supposing that the order is relevant):
from pyparsing import Literal, Word, alphanums, nums, alphas, Optional, delimitedList
def createParser():
firstLine = Literal('Uptime is') + Word(alphanums)
secLine = Literal('Last reset at') + delimitedList(Word(nums) + Literal(':') + Word(nums) + Literal(':') + Word(nums))
thirdLine = Literal('Reason:') + Word(alphas)
return firstLine + secLine + Optional(thirdLine)
if __name__ == '__main__':
parser = createParser()
firstText = """Uptime is 1w2d\n
Last reset at 23:05:56\n
Reason: reload"""
print(parser.parseString(firstText))
Declaring a parsing element optional you are able to let the parser skip it when it is not present, without raising any errors.
Was wondering whether anyone has a clever solution for fixing bad
insert statements in Python, exported by a not so clever program. It didn't add
two single quotes for strings with a single quote in the string. To
make it a bit easier all the values being inserted are strings.
So it has:
INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');
instead of:
INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');
Obviously there are multiple lines of this and I don't want to replace
the enclosing single quotes as well.
Was thinking of using split and join, but I'm not sure how to easily update the split values while looping in a loop. Sorry I'm a noob. Something like the below, where I'm not sure how to do #update bit
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
bits = line.split("','")
for bit in bits:
if bit.find("'") > -1:
#update bit
line_out = "','".join(bits)
sys.stdout.write(line_out)
line = fileIN.readline()
Thanks
Based on katrielalex's suggestion, how about this:
>>> import re
>>> s = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
>>> def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
>>> re.sub("(.)'(.)", repl, s)
"INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');"
and if you're into negative lookbehind assertions, this is the headache inducing pure regex version:
re.sub("((?<![(,])'(?![,)]))", "''", s)
while line:
# Restrain line2 to inside parentheses
line1, rest = line.split('(')
line2, line3 = rest.split(')')
# A bit more cleaner
new_bits = []
for bit in line2.split(','):
# Remove border ' characters
bit = bit[1:-1]
# Duplicate the ones inside
if "'" in bit:
bit = bit.replace("'", "''")
# Re-add border '
new_bits.append("'" + bit + "'")
sys.stdout.write(line1 + '(' + ','.join(new_bits + ')' + line3)
line = fileIN.readline()
Warning: This depends way too much on the formatting of the SQL statement. However, if your input is only ever going to have the format "statements (params) end" then this will work every time.
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
#split out the parameters (between the ()'s)
start, temp = line.split("(")
params, end = temp.split(")")
#replace the "'"s in the parameters (without the start and end quote)
newParams = "','".join([x.replace("'", "''") for x in params[1:-1].split("','")])
#join the statement back together
line_out = start + "('" + newParams + "')" + end
#next line
sys.stdout.write(line_out)
line = fileIN.readline()
Explanation:
Split the string into 3 parts: The query start, the parameters, and the end.
The generator takes the parameters (without the starting/ending 's), splits it on ',', and, for every element in the list the split generates (the individual data entries), replaces the 's with ''s.
The last line then joins the query start, the new params (with the parenthesis and quotes that were removed previously), and the end of the statement.
Another answer:
a = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
open_par = a.find("(")
close_par = a.find(")")
b = a[open_par+1:close_par]
c = b.split(",")
d = map(lambda x: '"' + x.strip().strip("'") + '"',c)
result = a[:open_par+1] + ",".join(d) + a[close_par:]
Went with:
import sys
import re
def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
line_out = re.sub("(.)'(.)", repl, line)
sys.stdout.write(line_out)
# Next line.
line = fileIN.readline()
The following code gives me the error 'no such attribute _ParseResuls__tokdict' when run on an input with more than one line.
With single-line files, there is no error. If I comment out either the second or third line shown here, then I don't get that error either, no matter how long the file is.
for line in input:
final = delimitedList(expr).parseString(line)
notid = delimitedList(notid).parseString(line)
dash_tags = ', '.join(format_tree(notid))
print final.lineId + ": " + dash_tags
Does anyone know what's going on here?
EDIT: As suggested, I'm adding the complete code to allow others to reproduce the error.
from pyparsing import *
#first are the basic elements of the expression
#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break
lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+"'")
semicolon = Literal(';')
#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName("lineId")) + topicString.setResultsName("topicString") + \
Optional(nestedExpr(content=delimitedList(expr))).setResultsName("parenthetical") + \
Optional(Suppress(semicolon).setResultsName("semicolon") + expr.setResultsName("subsequentlines"))
notid = Suppress(lineId) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)
#naming the parenthetical portion for independent reference later
parenthetical = nestedExpr(content=delimitedList(expr))
#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')
#defining functions
#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):
prefix = ''
for node in tree:
if isinstance(node, basestring):
prefix = node
yield node
else:
for elt in format_tree(node):
yield prefix + '_' + elt
#function for passing tokens from setResultsName
def id_number(tokens):
#print tokens.dump()
lineId = tokens
lineId["lineId"] = lineId.lineId
def topic_string(tokens):
topicString = tokens
topicString["topicString"] = topicString.topicString
def parenthetical_fun(tokens):
parenthetical = tokens
parenthetical["parenthetical"] = parenthetical.parenthetical
#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
return '\n' + final.lineId
#setting parse actions
lineId.setParseAction(id_number)
topicString.setParseAction(topic_string)
parenthetical.setParseAction(parenthetical)
#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line and assign names to the tokens for later use
#calls the 'notid' varient to easily return the other elements in the line aside from the lineId
#applies the format tree function and joins the tokens in a comma-separated string
#prints the lineId + the tokens from that line
for line in input:
final = delimitedList(expr).parseString(line)
notid = delimitedList(notid).parseString(line)
dash_tags = ', '.join(format_tree(notid))
print final.lineId + ": " + dash_tags
The input file is a txt document with the following two lines:
1768 dummy; data
1768 dummy data; price
Reassigning of notid breaks the second iteration when used in delimitedList. Your third line destroys the notid expression defined earlier in the code, so it will only work the first iteration. Use a different name for the notid assignment.
I'm trying to remove multiple lines containing an obsoleted code fragment from various file with the help of python. I looked for some examples but could not really find what I was looking for. What I basically need is something that does in principle the following (contains non-python syntax):
def cleanCode(filepath):
"""Clean out the obsolete or superflous 'bar()' code."""
with open(filepath, 'r') as foo_file:
string = foo_file[index_of("bar("):]
depth = 0
for char in string:
if char == "(": depth += 1
if char == ")": depth -= 1
if depth == 0: last_index = current_char_position
with open(filepath,'w') as foo_file:
mo_file.write(string)
The thing is that the construct I'm parsing for and want to replace could contain other nested statements that also need to be removed as part of the bar(...) removal.
Here is what a sample, to be cleaned, code snippet would look like:
annotation (
foo1(k=3),
bar(
x=0.29,
y=0,
bar1(
x=3, y=4),
width=0.71,
height=0.85),
foo2(System(...))
I would think that someone might have solved something similar before :)
Pyparsing has some built-ins for matching nested parenthetical text - in your case, you aren't really trying to extract the content of the parens, you just want the text between the outermost '(' and ')'.
from pyparsing import White, Keyword, nestedExpr, lineEnd, Suppress
insource = """
annotation (
foo1(k=3),
bar(
x=0.29,
y=0,
bar1(
x=3, y=4),
width=0.71,
height=0.85),
foo2(System(...))
"""
barRef = White(' \t') + Keyword('bar') + nestedExpr() + ',' + lineEnd
out = Suppress(barRef).transformString(insource)
print out
Prints
annotation (
foo1(k=3),
foo2(System(...))
EDIT: parse action to not strip bar() calls ending with '85':
barRef = White(' \t') + Keyword('bar') + nestedExpr()('barargs') + ','
def skipEndingIn85(tokens):
if tokens.barargs[0][-1].endswith('85'):
raise ParseException('ends with 85, skipping...')
barRef.setParseAction(skipEndingIn85)
try this :
clo=0
def remov(bar):
global clo
open_tag=strs.find('(',bar) # search for a '(' open tag
close_tag=strs.find(')',bar)# search for a ')' close tag
if open_tag > close_tag:
clo=strs.find(')',close_tag+1)
elif open_tag < close_tag and open_tag!=-1:
remov(close_tag)
f=open('small.in')
strs="".join(f.readlines())
bar=strs.find('bar(')
remov(bar+4)
new_strs=strs[0:bar]+strs[clo+2:]
print(new_strs)
f.close()
output:
annotation (
foo1(k=3),
foo2(System(...))