Python RegEx nested search and replace

Python RegEx nested search and replace - python

I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.
"thing1,blah","thing2,blah","thing3,blah",thing4
needs to become
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
my code:
inFile = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
pg = p.search(line)
# found comment block
if pg:
q = re.compile(r'[^\\],')
# found comma within comment block
qg = q.search(pg.group(0))
if qg:
# Here I want to reconstitute the line and print it with the replaced text
#print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))
I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.
How can I do this in Python?

The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.
import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()
result:
"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"

General Edit
There was
"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4
in the question, and now it is not there anymore.
Moreover, I hadn't remarked r'[^\\],'.
So, I completely rewrite my answer.
"thing1,blah","thing2,blah","thing3,blah",thing4
and
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
being displays of strings (I suppose)
import re
ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\\\),') ):
return ri.sub('\\\\',mat.group())
print ss
print repr(ss)
print
print regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
result
"thing1,blah","thing2,blah","thing3\,blah",thing4
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '
"thing1\blah","thing2\blah","thing3\,blah",thing4
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '

You can try this regex.
>>> re.sub('(?<!"),(?!")', r"\\,",
'"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4
The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "

I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?
outfile = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
pg = p.finditer(line)
pglen = len(p.findall(line))
if pglen > 0:
mpgstart = 0;
mpgend = 0;
for i,mpg in enumerate(pg):
if i == 0:
outfile.write(line[:mpg.start()])
qg = q.finditer(mpg.group(0))
qglen = len(q.findall(mpg.group(0)))
if i > 0 and i < pglen:
outfile.write(line[mpgend:mpg.start()])
if qglen > 0:
for j,mqg in enumerate(qg):
if j == 0:
outfile.write( mpg.group(0)[:mqg.start()] )
outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )
if j == (qglen-1):
outfile.write( mpg.group(0)[mqg.end():] )
else:
outfile.write(mpg.group(0))
if i == (pglen-1):
outfile.write(line[mpg.end():])
mpgstart = mpg.start()
mpgend = mpg.end()
else:
outfile.write(line)
outfile.close()

have you looked into str.replace()?
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old
replaced by new. If the optional argument count is given, only the
first count occurrences are replaced.
here is some documentation
hope this helps

Related

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?

import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False

Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']

You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.

If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.

You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

Replace single quotes with double quotes in python, for use with insert into database

Was wondering whether anyone has a clever solution for fixing bad
insert statements in Python, exported by a not so clever program. It didn't add
two single quotes for strings with a single quote in the string. To
make it a bit easier all the values being inserted are strings.
So it has:
INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');
instead of:
INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');
Obviously there are multiple lines of this and I don't want to replace
the enclosing single quotes as well.
Was thinking of using split and join, but I'm not sure how to easily update the split values while looping in a loop. Sorry I'm a noob. Something like the below, where I'm not sure how to do #update bit
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
bits = line.split("','")
for bit in bits:
if bit.find("'") > -1:
#update bit
line_out = "','".join(bits)
sys.stdout.write(line_out)
line = fileIN.readline()
Thanks

Based on katrielalex's suggestion, how about this:
>>> import re
>>> s = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
>>> def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
>>> re.sub("(.)'(.)", repl, s)
"INSERT INTO addresses VALUES ('1','1','CUCKOO''S NEST','CUCKOO''S NEST STREET');"
and if you're into negative lookbehind assertions, this is the headache inducing pure regex version:
re.sub("((?<![(,])'(?![,)]))", "''", s)

while line:
# Restrain line2 to inside parentheses
line1, rest = line.split('(')
line2, line3 = rest.split(')')
# A bit more cleaner
new_bits = []
for bit in line2.split(','):
# Remove border ' characters
bit = bit[1:-1]
# Duplicate the ones inside
if "'" in bit:
bit = bit.replace("'", "''")
# Re-add border '
new_bits.append("'" + bit + "'")
sys.stdout.write(line1 + '(' + ','.join(new_bits + ')' + line3)
line = fileIN.readline()

Warning: This depends way too much on the formatting of the SQL statement. However, if your input is only ever going to have the format "statements (params) end" then this will work every time.
import sys
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
#split out the parameters (between the ()'s)
start, temp = line.split("(")
params, end = temp.split(")")
#replace the "'"s in the parameters (without the start and end quote)
newParams = "','".join([x.replace("'", "''") for x in params[1:-1].split("','")])
#join the statement back together
line_out = start + "('" + newParams + "')" + end
#next line
sys.stdout.write(line_out)
line = fileIN.readline()
Explanation:
Split the string into 3 parts: The query start, the parameters, and the end.
The generator takes the parameters (without the starting/ending 's), splits it on ',', and, for every element in the list the split generates (the individual data entries), replaces the 's with ''s.
The last line then joins the query start, the new params (with the parenthesis and quotes that were removed previously), and the end of the statement.

Another answer:
a = "INSERT INTO addresses VALUES ('1','1','CUCKOO'S NEST','CUCKOO'S NEST STREET');"
open_par = a.find("(")
close_par = a.find(")")
b = a[open_par+1:close_par]
c = b.split(",")
d = map(lambda x: '"' + x.strip().strip("'") + '"',c)
result = a[:open_par+1] + ",".join(d) + a[close_par:]

Went with:
import sys
import re
def repl(m):
if m.group(1) in ('(', ',') or m.group(2) in (',', ')'):
return m.group(0)
return m.group(1) + "''" + m.group(2)
fileIN = open('a.sql', "r")
line = fileIN.readline()
while line:
line_out = re.sub("(.)'(.)", repl, line)
sys.stdout.write(line_out)
# Next line.
line = fileIN.readline()

Python RegEx Woes

I'm not sure why this isn't working:
import re
import csv
def check(q, s):
match = re.search(r'%s' % q, s, re.IGNORECASE)
if match:
return True
else:
return False
tstr = []
# test strings
tstr.append('testthisisnotworking')
tstr.append('This is a TEsT')
tstr.append('This is a TEST mon!')
f = open('testwords.txt', 'rU')
reader = csv.reader(f)
for type, term, exp in reader:
for i in range(2):
if check(exp, tstr[i]):
print exp + " hit on " + tstr[i]
else:
print exp + " did NOT hit on " + tstr[i]
f.close()
testwords.txt contains this line:
blah, blah, test
So essentially 'test' is the RegEx pattern. Nothing complex, just a simple word. Here's the output:
test did NOT hit on testthisisnotworking
test hit on This is a TEsT
test hit on This is a TEST mon!
Why does it NOT hit on the first string? I also tried \s*test\s* with no luck. Help?

The csv module by default returns blank spaces around words in the input (this can be changed by using a different "dialect"). So exp contains " test" with a leading space.
A quick way to fix this would be to add:
exp = exp.strip()
after you read from the CSV file.

Adding a print repr(exp) to the top of the first for loop shows that exp is ' test', note the leading space.
This isn't that surprising since csv.reader() splits on commas, try changing your code to the following:
for type, term, exp in reader:
exp = exp.strip()
for s in tstr:
if check(exp, s):
print exp + " hit on " + s
else:
print exp + " did NOT hit on " + s
Note that in addition to the strip() call which will remove the leading a trailing whitespace, I change your second for loop to just loop directly over the strings in tstr instead of over a range. There was actually a bug in your current code because tstr contained three values but you only checked the first two because for i in range(2) will only give you i=0 and i=1.

How to remove underscore('_') along with preceeding digits in first columns only

I have been trying to remove underscore('_') with associated digits after it.
This is the first row in my text file.
JP_001033692.1_551 N -1 NO 99.5425% 0.0022875
I would like to remove "_551" from "JP_001033692.1_551" without removing other items from the subsequent columns.
Expected row would be:
JP_001033692.1 N -1 NO 99.5425% 0.0022875
Here is my code:
fname = open(raw_input('Enter input filename: '),'r' )
outfile = open('decValues.txt','w')
for line in fname:
line = re.sub('[\(\)\{\}\'\'\,<>]','', line)
fields = line.rstrip("\n").split()
outfile.write('%s %s %s %s %1.4f\n' % (fields[0],fields[1],fields[2],fields[3],(float(fields[5]))))
Thanks guys for helping out.
Kesh

str.rpartition(sep)¶ will split the string on the last occurance of sep
s = "this_is_a_string"
split_s = s.rpartition('_')
split_s
('this_is_a', '_', 'string')
split_s[0]
'this_is_a'

That should do it:
re.sub(r"(\.\d+)_\d+", r"\1", line)

Does this do what you're looking for?
re.sub("(?P<x>(_.*)?)_\w*","\g<x>",str)

This should do what you want:
re.sub(r'^([^ ]*)(_[0-9]*)( +)', r'\1\3', line)
Test from the Python repl:
>>> import re
>>> line = 'JP_001033692.1_551 N -1 NO 99.5425% 0.0022875'
>>> re.sub(r'^([^ ]*)(_[0-9]*)( +)', r'\1\3', line)
'JP_001033692.1 N -1 NO 99.5425% 0.0022875'

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19

\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.

Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.

You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python RegEx nested search and replace - python

You can try this regex. >>> re.sub('(?<!"),(?!")', r"\\,", '"thing1,blah","thing2,blah","thing3,blah",thing4') #Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4 The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "

have you looked into str.replace()? str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced. here is some documentation hope this helps

Related

Python - How to make sure that a line being read from a file contain only a given string and nothing else

Replace single quotes with double quotes in python, for use with insert into database

Python RegEx Woes

How to remove underscore('_') along with preceeding digits in first columns only

str.startswith() not working as I intended

Categories

Resources