I want to process every line in my log file, and extract IP address if line matches my pattern. There are several different types of messages, in example below I am using p1andp2`.
I could read the file line by line, and for each line match to each pattern. But
Since there can be many more patterns, I would like to do it as efficiently as possible. I was hoping to compile thos patterns into one object, and do the match only once for each line:
import re
IP = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
p1 = 'Registration from' + IP + '- Wrong password'
p2 = 'Call from' + IP + 'rejected because extension not found'
c = re.compile(r'(?:' + p1 + '|' + p2 + ')')
for line in sys.stdin:
match = re.search(c, line)
if match:
print(match['ip'])
but the above code does not work, it complains that ip is used twice.
What is the most elegant way to achieve my goal ?
EDIT:
I have modified my code based on answer from #Dev Khadka.
But I am still struggling with how to properly handle the multiple ip matches. The code below prints all IPs that matched p1:
for line in sys.stdin:
match = c.search(line)
if match:
print(match['ip1'])
But some lines don't match p1. They match p2. ie, I get:
1.2.3.4
None
2.3.4.5
...
How do I print the matching ip, when I don't know wheter it was p1, p2, ... ? All I want is the IP. I don't care which pattern it matched.
You can consider installing the excellent regex module, which supports many advanced regex features, including branch reset groups, designed to solve exactly the problem you outlined in this question. Branch reset groups are denoted by (?|...). All capture groups of the same positions or names in different alternative patterns within a branch reset grouop share the same capture groups for output.
Notice that in the example below the matching capture group becomes the named capture group, so that you don't need to iterate over multiple groups searching for a non-empty group:
import regex
ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
'Registration from {ip} - Wrong password',
'Call from {ip} rejected because extension not found'
]
pattern = regex.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
match = regex.search(pattern, line)
if match:
print(match['ip'])
Demo: https://repl.it/#blhsing/RegularEmbellishedBugs
why don't you check which regex matched?
if 'ip1' in match :
print match['ip1']
if 'ip2' in match :
print match['ip2']
or something like:
names = [ 'ip1', 'ip2', 'ip3' ]
for n in names :
if n in match :
print match[n]
or even
num = 1000 # can easily handle millions of patterns =)
for i in range(num) :
name = 'ip%d' % i
if name in match :
print match[name]
thats because you are using same group name for two group
try this, this will give group names ip1 and ip2
import re
IP = r'(?P<ip%d>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
p1 = 'Registration from' + IP%1 + '- Wrong password'
p2 = 'Call from' + IP%2 + 'rejected because extension not found'
c = re.compile(r'(?:' + p1 + '|' + p2 + ')')
Named capture groups must have distinct names, but since all of your capture groups are meant to capture the same pattern, it's better not to use named capture groups in this case but instead simply use regular capture groups and iterate through the groups from the match object to print the first group that is not empty:
ip_pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
'Registration from {ip} - Wrong password',
'Call from {ip} rejected because extension not found'
]
pattern = re.compile('|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
match = re.search(pattern, line)
if match:
print(next(filter(None, match.groups())))
Demo: https://repl.it/#blhsing/UnevenCheerfulLight
Adding ip address validity to already accepted answer.
Altho import ipaddress & import socket should be ideal ways, this code will parse-the-host,
import regex as re
from io import StringIO
def valid_ip(address):
try:
host_bytes = address.split('.')
valid = [int(b) for b in host_bytes]
valid = [b for b in valid if b >= 0 and b<=255]
return len(host_bytes) == 4 and len(valid) == 4
except:
return False
ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = patterns = [
'Registration from {ip} - Wrong password',
'Call from {ip} rejected because extension not found'
]
file = StringIO('''
Registration from 259.1.1.1 - Wrong password,
Call from 1.1.2.2 rejected because extension not found
''')
pattern = re.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))
list1 = []
list2 = []
for line in file:
match = re.search(pattern, line)
if match:
list1.append(match['ip']) # List of ip address
list2.append(valid_ip(match['ip'])) # Boolean results of valid_ip
for i in range(len(list1)):
if list2[i] == False:
print(f'{list1[i]} is invalid IP')
else:
print(list1[i])
259.1.1.1 is invalid IP
1.1.2.2
[Program finished]
Related
I want get the word 'MASTER_INACTIVE' in the string:
'p_esco_link->state = MASTER_INACTIVE; /*M-t10*/'
by searching reg-expression 'p_esco_link->state =' to find the following word.
I have to replace date accessing to API functions. I try some reg-expression in python 3.6, but it does not work.
pattern = '(?<=\bp_esco_link->state =\W)\w+'
if __name__ == "__main__":
syslogger.info(sys.argv)
if version_info.major != 3:
raise Exception('Olny work on Python 3.x')
with open(cFile, encoding='utf-8') as file_obj:
lineNum = 0
for line in file_obj:
print(len(line))
re_obj = re.compile(pattern)
result = re.search(pattern, line)
lineNum += 1
#print(result)
if result:
print(str(lineNum) + ' ' +str(result.span()) + ' ' + result.group())
excepted Python re module can find the position of 'MASTER_INACTIVE' and put it into result.group().
error message is that Python re module find nothing.
Your pattern is working fine,
Just change the bellow line in your code,
pattern = r'(?<=\bp_esco_link->state =\W)\w+' # add r prefix
Check this sample work, I added line as your string.
import re
pattern = r'(?<=\bp_esco_link->state =\W)\w+'
line = 'p_esco_link->state = MASTER_INACTIVE; /*M-t10*/'
re_obj = re.compile(pattern)
result = re.search(pattern, line)
print(result.span()) # (21, 36)
print(result.group()) # 'MASTER_INACTIVE'
Check below question to get more understand about 'r' prefix,
Python regex - r prefix
What exactly do “u” and “r” string flags do, and what are raw string literals?
What does preceding a string literal with “r” mean? [duplicate]
I'm sure this is a basic question, but I have spent about an hour on it already and can't quite figure it out. I'm parsing smartctl output, and here is the a sample of the data I'm working with:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-39-pve] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA MD04ACA500
Serial Number: Y9MYK6M4BS9K
LU WWN Device Id: 5 000039 5ebe01bc8
Firmware Version: FP2A
User Capacity: 5,000,981,078,016 bytes [5.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Jul 2 11:24:08 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
What I'm trying to achieve is pulling out the device model (some devices it's just one string, other devices, such as this one, it's two words), serial number, time, and a couple other fields. I assume it would be easiest to capture all data after the colon, but how to eliminate the variable amounts of spaces?
Here is the relevant code I currently came up with:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
parts = line.split()
if str(parts):
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
vprint(3, "Device model: %s" %deviceModel)
vprint(3, "Serial number: %s" %serialNumber)
The error I keep getting is:
File "./tester.py", line 152, in parseOutput
if parts[0] == "Device Model: ":
IndexError: list index out of range
I get what the error is saying (kinda), but I'm not sure what else the range could be, or if I'm even attempting this in the right way. Looking for guidance to get me going in the right direction. Any help is greatly appreciated.
Thanks!
The IndexError occurs when the split returns a list of length one or zero and you access the second element. This happens when it isn't finding anything to split (empty line).
No need for regular expressions:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
if line.startswith("Device Model:"):
deviceModel = line.split(":")[1].strip()
elif line.startswith("Serial Number:"):
serialNumber = line.split(":")[1].strip()
print("Device model: %s" %deviceModel)
print("Serial number: %s" %serialNumber)
I guess your problem is the empty line in the middle. Because,
>>> '\n'.split()
[]
You can do something like,
>>> f = open('a.txt')
>>> lines = f.readlines()
>>> deviceModel = [line for line in lines if 'Device Model' in line][0].split(':')[1].strip()
# 'TOSHIBA MD04ACA500'
>>> serialNumber = [line for line in lines if 'Serial Number' in line][0].split(':')[1].strip()
# 'Y9MYK6M4BS9K'
Try using regular expressions:
import re
r = re.compile("^[^:]*:\s+(.*)$")
m = r.match("Device Model: TOSHIBA MD04ACA500")
print m.group(1) # Prints "TOSHIBA MD04ACA500"
Not sure what version you're running, but on 2.7, line.split() is splitting the line by word, so
>>> parts = line.split()
parts = ['Device', 'Model:', 'TOSHIBA', 'MD04ACA500']
You can also try line.startswith() to find the lines you want https://docs.python.org/2/library/stdtypes.html#str.startswith
The way I would debug this is by printing out parts at every iteration. Try that and show us what the list is when it fails.
Edit: Your problem is most likely what #jonrsharpe said. parts is probably an empty list when it gets to an empty line and str(parts) will just return '[]' which is True. Try to test that.
I think it would be far easier to use regular expressions here.
import re
for line in lines:
# Splits the string into at most two parts
# at the first colon which is followed by one or more spaces
parts = re.split(':\s+', line, 1)
if parts:
if parts[0] == "Device Model":
deviceModel = parts[1]
elif parts[0] == "Serial Number":
serialNumber = parts[1]
Mind you, if you only care about the two fields, startswith might be better.
When you split the blank line, parts is an empty list.
You try to accommodate that by checking for an empty list, But you turn the empty list to a string which causes your conditional statement to be True.
>>> s = []
>>> bool(s)
False
>>> str(s)
'[]'
>>> bool(str(s))
True
>>>
Change if str(parts): to if parts:.
Many would say that using a try/except block would be idiomatic for Python
for line in lines:
parts = line.split()
try:
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
except IndexError:
pass
I am trying to open a text file. Parse the text file for specific regex patterns then when if I find that pattern I write the regex returned pattern to another text file.
Specifically a list of IP Addresses which I want to parse specific ones out of.
So the file may have
10.10.10.10
9.9.9.9
5.5.5.5
6.10.10.10
And say I want just the IPs that end in 10 (the regex I think I am good with) My example looks for the 10.180.42, o4 41.XX IP hosts. But I will adjust as needed.
I've tried several method and fail miserably at them all. It's days like this I know why I just never mastered any language. But I'm committed to Python so here goes.
import re
textfile = open("SymantecServers.txt", 'r')
matches = re.findall('^10.180\.4[3,1].\d\d',str(textfile))
print(matches)
This gives me empty backets. I had to encase the textfile in the str function or it just puked. I don't know if this is right.
This just failed all over the place no matter how I fine tuned it.
f = open("SymantecServers.txt","r")
o = open("JustIP.txt",'w', newline="\r\n")
for line in f:
pattern = re.compile("^10.180\.4[3,1].\d\d")
print(pattern)
#o.write(pattern)
#o.close()
f.close()
I did get one working but it just returned the entire line (including netmask and other test like hostname which are all on the same line in the text file. I just want IP)
Any help on how to read a text file and if it has a pattern of IP grab the full IP and write that into another text file so I end up with a text file with a list of just the IPs I want. I am 3 hours into it and behind on work so going to do the first file by hand...
I am just at a loss what I am missing. Sorry for being a newbie
here is it working:
>>> s = """10.10.10.10
... 9.9.9.9
... 5.5.5.5
... 10.180.43.99
... 6.10.10.10"""
>>> re.findall(r'10\.180\.4[31]\.\d\d', s)
['10.180.43.99']
you do not really need to add line boundaries, as you're matching a very specific IP address, if your file does not have weird things like '123.23.234.10.180.43.99.21354' that you don't want to match, it should be ok!
your syntax of [3,1] is matching either 3, 1 or , and you don't want to match against a comma ;-)
about your function:
r = re.compile(r'10\.180\.4[31]\.\d\d')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
o.write(match)
though if I were you, I'd extract IPs using:
r = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
a, b, c, d = match.split('.')
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
or another way to do it:
r = re.compile(r'(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
m = r.match(line)
if m:
a, b, c, d = m.groups()
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
which uses the regex to split the IP address into groups.
What you're missing is that you're doing a re.compile() which creates a Regular Expression object in Python. You're forgetting to match.
You could try:
# This isn't the best way to match IP's, but if it fits for your use-case keep it for now.
pattern = re.compile("^10.180\.4[13].\d\d")
f = open("SymantecServers.txt",'r')
o = open("JustIP.txt",'w')
for line in f:
m = pattern.match(line)
if m is not None:
print "Match: %s" %(m.group(0))
o.write(m.group(0) + "\n")
f.close()
o.close()
Which is compiling the Python object, attempting to match the line against the compiled object, and then printing out that current match. I can avoid having to split my matches, but I have to pay attention to matching groups - therefore group(0)
You can also look at re.search() which you can do, but if you're running search enough times with the same regular expression, it becomes more worthwhile to use compile.
Also note that I moved the f.close() to the outside of the for loop.
I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.
My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.
Thanks in advance for any help!
Chris.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line):
if lineEndsWith.match(line) :
print 'Full Line Found'
print line
print "- Record Separator -"
else:
print 'Partial Line Found'
print line
print "- Record Separator -"
print "--- DONE ----"
Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.
I'm no expert so suggestions are always welcome!
UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
multiLine = False
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
lines.append(line.replace("\n", " "))
elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
#Found the start of a multi-line entry
multiLineString = line
multiLine = True
elif multiLine and not lineEndsWith.match(line):
multiLineString = multiLineString + line
elif multiLine and lineEndsWith.match(line):
multiLineString = multiLineString + line
multiLineString = multiLineString.replace("\n", " ")
lines.append(multiLineString)
multiLine = False
for line in lines:
print line
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
There are two options here.
You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…
Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:
with open('bigfile', 'rb') as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for match in compiled_re.finditer(m):
do_stuff(match)
In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):
>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and
(exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']
As you can see the second .*? matched the newline just as easily as a space.
If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.
For example:
>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']
You can read an entire file into a string and then you can use re.split to make a list of all the entries separated by times. Here's an example:
f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)
I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.
"thing1,blah","thing2,blah","thing3,blah",thing4
needs to become
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
my code:
inFile = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
pg = p.search(line)
# found comment block
if pg:
q = re.compile(r'[^\\],')
# found comma within comment block
qg = q.search(pg.group(0))
if qg:
# Here I want to reconstitute the line and print it with the replaced text
#print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))
I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.
How can I do this in Python?
The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.
import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()
result:
"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"
General Edit
There was
"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4
in the question, and now it is not there anymore.
Moreover, I hadn't remarked r'[^\\],'.
So, I completely rewrite my answer.
"thing1,blah","thing2,blah","thing3,blah",thing4
and
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
being displays of strings (I suppose)
import re
ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\\\),') ):
return ri.sub('\\\\',mat.group())
print ss
print repr(ss)
print
print regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
result
"thing1,blah","thing2,blah","thing3\,blah",thing4
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '
"thing1\blah","thing2\blah","thing3\,blah",thing4
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '
You can try this regex.
>>> re.sub('(?<!"),(?!")', r"\\,",
'"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4
The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "
I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?
outfile = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
pg = p.finditer(line)
pglen = len(p.findall(line))
if pglen > 0:
mpgstart = 0;
mpgend = 0;
for i,mpg in enumerate(pg):
if i == 0:
outfile.write(line[:mpg.start()])
qg = q.finditer(mpg.group(0))
qglen = len(q.findall(mpg.group(0)))
if i > 0 and i < pglen:
outfile.write(line[mpgend:mpg.start()])
if qglen > 0:
for j,mqg in enumerate(qg):
if j == 0:
outfile.write( mpg.group(0)[:mqg.start()] )
outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )
if j == (qglen-1):
outfile.write( mpg.group(0)[mqg.end():] )
else:
outfile.write(mpg.group(0))
if i == (pglen-1):
outfile.write(line[mpg.end():])
mpgstart = mpg.start()
mpgend = mpg.end()
else:
outfile.write(line)
outfile.close()
have you looked into str.replace()?
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old
replaced by new. If the optional argument count is given, only the
first count occurrences are replaced.
here is some documentation
hope this helps