I'm trying to remove multiple lines containing an obsoleted code fragment from various file with the help of python. I looked for some examples but could not really find what I was looking for. What I basically need is something that does in principle the following (contains non-python syntax):
def cleanCode(filepath):
"""Clean out the obsolete or superflous 'bar()' code."""
with open(filepath, 'r') as foo_file:
string = foo_file[index_of("bar("):]
depth = 0
for char in string:
if char == "(": depth += 1
if char == ")": depth -= 1
if depth == 0: last_index = current_char_position
with open(filepath,'w') as foo_file:
mo_file.write(string)
The thing is that the construct I'm parsing for and want to replace could contain other nested statements that also need to be removed as part of the bar(...) removal.
Here is what a sample, to be cleaned, code snippet would look like:
annotation (
foo1(k=3),
bar(
x=0.29,
y=0,
bar1(
x=3, y=4),
width=0.71,
height=0.85),
foo2(System(...))
I would think that someone might have solved something similar before :)
Pyparsing has some built-ins for matching nested parenthetical text - in your case, you aren't really trying to extract the content of the parens, you just want the text between the outermost '(' and ')'.
from pyparsing import White, Keyword, nestedExpr, lineEnd, Suppress
insource = """
annotation (
foo1(k=3),
bar(
x=0.29,
y=0,
bar1(
x=3, y=4),
width=0.71,
height=0.85),
foo2(System(...))
"""
barRef = White(' \t') + Keyword('bar') + nestedExpr() + ',' + lineEnd
out = Suppress(barRef).transformString(insource)
print out
Prints
annotation (
foo1(k=3),
foo2(System(...))
EDIT: parse action to not strip bar() calls ending with '85':
barRef = White(' \t') + Keyword('bar') + nestedExpr()('barargs') + ','
def skipEndingIn85(tokens):
if tokens.barargs[0][-1].endswith('85'):
raise ParseException('ends with 85, skipping...')
barRef.setParseAction(skipEndingIn85)
try this :
clo=0
def remov(bar):
global clo
open_tag=strs.find('(',bar) # search for a '(' open tag
close_tag=strs.find(')',bar)# search for a ')' close tag
if open_tag > close_tag:
clo=strs.find(')',close_tag+1)
elif open_tag < close_tag and open_tag!=-1:
remov(close_tag)
f=open('small.in')
strs="".join(f.readlines())
bar=strs.find('bar(')
remov(bar+4)
new_strs=strs[0:bar]+strs[clo+2:]
print(new_strs)
f.close()
output:
annotation (
foo1(k=3),
foo2(System(...))
Related
I'm trying to use pyparsing to parse key:value pairs from the comments in a document. A key starts at the beginning of a line, and a value follows. Values may be continued on multiple lines that begin with whitespace.
import pyparsing as pp
instring = """
-- This is (a) #%^& comment
/*
name1: val
name2: val2 with $*&##) junk
name3: val3: with #)(*% multi-
line: content
*/
"""
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = pp.LineStart() + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = pp.LineStart() + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
if __name__ == "__main__":
p = metalist.parseString(instring)
print(p)
Fails with:
Matched {Empty SkipTo:(LineEnd) Empty} -> ['This is (a) #%^& comment']
File "C:\Users\user\py3\lib\site-packages\pyparsing.py", line 2305, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected start of line (at char 32), (line:4, col:1)
The answer to pyparsing whitespace match issues says
LineStart has always been difficult to work with, but ...
If the parser is at line 4 column 1 (the first key:value pair), then why is it not finding a start of line? What is the correct pyparsing syntax to recognize lines beginning with no whitespace and lines beginning with whitespace?
I think the confusion I have with LineStart is that, for LineEnd, I can look for a '\n' character, but there is no separate character for LineStart. So in LineStart I look to see if the current parser location is positioned just after a '\n'; or if it is currently on a '\n', move past it and still continue. Unfortunately, I implemented this in a place that messes up the reporting location, so you get those weird errors that read like "failed to find a start of line on line X col 1," which really does sound like it should be a successfully matched start of a line. Also, I think I need to revisit this implicit newline-skipping, or for that matter, all whitespace-skipping in general for LineStart.
For now, I've gotten your code to work by expanding your line-starting expression slightly, as:
LS = pp.Optional(pp.LineEnd()) + pp.LineStart()
and replaced the LineStart references in meta1 and meta2 with LS:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = LS + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = LS + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
If this situation with LineStart leaves you uncomfortable, here is another tactic you can try: using a parse-time condition to only accept identifiers that start in column 1:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setName("identifier")
identifier.addCondition(lambda instring,loc,toks: pp.col(loc,instring) == 1)
meta1 = identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd()).setDebug()
meta2 = pp.White().setDebug() + pp.SkipTo(pp.LineEnd()).setDebug()
metaval = meta1 + pp.ZeroOrMore(meta2, stopOn=pp.Literal('*/'))
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.LineEnd() + pp.OneOrMore(metaval) + pp.Literal("*/")
This code does away with LineStart completely, while I figure out just what I want this particular token to do. I also had to modify the ZeroOrMore repetition in metaval so that */ would not be accidentally processed as continued comment content.
Thanks for your patience with this - I am not keen to quickly put out a patched LineStart change and then find that I have overlooked other compatibility or other edge cases that just put me back in the current less-than-great state on this class. But I'll put some effort into clarifying this behavior before putting out 2.1.10.
Was wondering whether anyone has a clever solution for fixing bad
insert statements in Python, exported by a not so clever program. It didn't add
two single quotes for strings with a single quote in the string. To
make it a bit easier all the values being inserted are strings.
So it has:
INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');
instead of:
INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');
Obviously there are multiple lines of this and I don't want to replace
the enclosing single quotes as well.
Was thinking of using split and join, but I'm not sure how to easily update the split values while looping in a loop. Sorry I'm a noob. Something like the below, where I'm not sure how to do #update bit
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
bits = line.split("','")
for bit in bits:
if bit.find("'") > -1:
#update bit
line_out = "','".join(bits)
sys.stdout.write(line_out)
line = fileIN.readline()
Thanks
Based on katrielalex's suggestion, how about this:
>>> import re
>>> s = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
>>> def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
>>> re.sub("(.)'(.)", repl, s)
"INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');"
and if you're into negative lookbehind assertions, this is the headache inducing pure regex version:
re.sub("((?<![(,])'(?![,)]))", "''", s)
while line:
# Restrain line2 to inside parentheses
line1, rest = line.split('(')
line2, line3 = rest.split(')')
# A bit more cleaner
new_bits = []
for bit in line2.split(','):
# Remove border ' characters
bit = bit[1:-1]
# Duplicate the ones inside
if "'" in bit:
bit = bit.replace("'", "''")
# Re-add border '
new_bits.append("'" + bit + "'")
sys.stdout.write(line1 + '(' + ','.join(new_bits + ')' + line3)
line = fileIN.readline()
Warning: This depends way too much on the formatting of the SQL statement. However, if your input is only ever going to have the format "statements (params) end" then this will work every time.
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
#split out the parameters (between the ()'s)
start, temp = line.split("(")
params, end = temp.split(")")
#replace the "'"s in the parameters (without the start and end quote)
newParams = "','".join([x.replace("'", "''") for x in params[1:-1].split("','")])
#join the statement back together
line_out = start + "('" + newParams + "')" + end
#next line
sys.stdout.write(line_out)
line = fileIN.readline()
Explanation:
Split the string into 3 parts: The query start, the parameters, and the end.
The generator takes the parameters (without the starting/ending 's), splits it on ',', and, for every element in the list the split generates (the individual data entries), replaces the 's with ''s.
The last line then joins the query start, the new params (with the parenthesis and quotes that were removed previously), and the end of the statement.
Another answer:
a = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
open_par = a.find("(")
close_par = a.find(")")
b = a[open_par+1:close_par]
c = b.split(",")
d = map(lambda x: '"' + x.strip().strip("'") + '"',c)
result = a[:open_par+1] + ",".join(d) + a[close_par:]
Went with:
import sys
import re
def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
line_out = re.sub("(.)'(.)", repl, line)
sys.stdout.write(line_out)
# Next line.
line = fileIN.readline()
In python I have big text in multline.
I need to get the text between {{book and }}
I tired using regular expression
problem is text inside is in mutiline string
I tried {{book (.+) it gives me text only in first line
the I tried {{book (.+) }} this gives error
re.search("{{book .*?}", pagetext).group()
I have tried varied expression... problem is how do i go to next line in regular expression...
lot of other text {{book series |name = Twilight |image = [[File:The twilight saga hardback.jpg|260px|]] |language = English<!-- Do not link, per WP: OVERLINK --> |genre = [[Romance (novel)|Romance]], [[fantasy literature|fantasy]], [[young-adult fiction]] |publisher = [[Little, Brown and Company]] |pub_date = 2005–2008 |media_type = Print }} <lot of other text >
You need to use the re.DOTALL flag to allow . to capture newline characters. Also, you should escape the braces because they are special characters in Python regex syntax.
re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL)
If there could be nested {{expr}} then regular expressions are not enough e.g.:
pagetext = "start {{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}} {{e}}nd"
#XXX doesn't work: the text is truncated
print("Wrong: %r" % re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL).group())
# -> Wrong: '{{book with {{n{{e}}'
Adapting my answer from the
get first paragraph from wikipedia article question:
# extract everything from the first "{{book " to matching "}}"
prefix, sep, rest = pagetext.partition("{{book ")
if sep: # found the first "{{"
depth = 1
prevc = None
for i, c in enumerate(rest):
if c == "{" and prevc == c: # found "{{"
depth += 1
prevc = None # match "{{{ " only once
elif c == "}" and prevc == c: # found "}}"
depth -= 1
if depth == 0: # found matching "}}"
pagetext = sep + rest[:i+1] # include "}}"
break
prevc = None # match "}}} " only once
else:
prevc = c
print(pagetext)
Output
{{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}}
I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.
"thing1,blah","thing2,blah","thing3,blah",thing4
needs to become
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
my code:
inFile = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
pg = p.search(line)
# found comment block
if pg:
q = re.compile(r'[^\\],')
# found comma within comment block
qg = q.search(pg.group(0))
if qg:
# Here I want to reconstitute the line and print it with the replaced text
#print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))
I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.
How can I do this in Python?
The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.
import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()
result:
"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"
General Edit
There was
"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4
in the question, and now it is not there anymore.
Moreover, I hadn't remarked r'[^\\],'.
So, I completely rewrite my answer.
"thing1,blah","thing2,blah","thing3,blah",thing4
and
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
being displays of strings (I suppose)
import re
ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\\\),') ):
return ri.sub('\\\\',mat.group())
print ss
print repr(ss)
print
print regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
result
"thing1,blah","thing2,blah","thing3\,blah",thing4
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '
"thing1\blah","thing2\blah","thing3\,blah",thing4
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '
You can try this regex.
>>> re.sub('(?<!"),(?!")', r"\\,",
'"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4
The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "
I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?
outfile = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
pg = p.finditer(line)
pglen = len(p.findall(line))
if pglen > 0:
mpgstart = 0;
mpgend = 0;
for i,mpg in enumerate(pg):
if i == 0:
outfile.write(line[:mpg.start()])
qg = q.finditer(mpg.group(0))
qglen = len(q.findall(mpg.group(0)))
if i > 0 and i < pglen:
outfile.write(line[mpgend:mpg.start()])
if qglen > 0:
for j,mqg in enumerate(qg):
if j == 0:
outfile.write( mpg.group(0)[:mqg.start()] )
outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )
if j == (qglen-1):
outfile.write( mpg.group(0)[mqg.end():] )
else:
outfile.write(mpg.group(0))
if i == (pglen-1):
outfile.write(line[mpg.end():])
mpgstart = mpg.start()
mpgend = mpg.end()
else:
outfile.write(line)
outfile.close()
have you looked into str.replace()?
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old
replaced by new. If the optional argument count is given, only the
first count occurrences are replaced.
here is some documentation
hope this helps
I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'