I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)
Related
A file has n number of lines in blocks of logically defined strings. I'm parsing each line and capturing the required data based on some matching conditions.
I have read through each line and finding the blocks with this code:
#python
for lines in file.readlines():
if re.match(r'block.+',lines)!= None:
block_name = re.match(r'block.+', lines).group(0)
# string matching code to be added here
Input File:
line1 select KT_TT=$TMTL/$SYSNAME.P1
line2 . $dhe/ISFUNC sprfl tm/tm1032 int 231
line3 select IT_TT=$TMTL/$SYSNAME.P2
line4 . $DHE/ISFUNC ptoic ca/ca256 tli 551
.....
.....
line89 CALLING IK02=$TMTL/$SYSNAME.P2
line90 CALLING KK01=$TMTL/$SYSNAME.P1
Matching conditions & expected output of each step:
While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231, ca256 tli 551 (matching string found in line2 & line 4, etc)
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 & $SYSNAME.P2(line 1 & line 3, etc)
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1 & $SYSNAME.P2). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90) & IK02(line 89)
final output should be like
FUNC SYS CALL
tm1032 int 231 $SYSNAME.P1 KK01
ca256 tli 551 $SYSNAME.P2 IK02
If all you need is the text next to the last slash, you need not go for regex at all .
Simply use the .split("/") on each line and you can get the last part next to the slash
sample = "$dhe/ISFUNC sprfl tm/tm1032 int 231"
sample.split("/")
will result in
['$dhe', 'ISFUNC sprfl tm', 'tm1032 int 231']
and then just access the last element of the list using -1 indexing to get the value
PS : Use the split function once you have found the corresponding line
While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231 (matching string found in line2)
char_list = re.findall(r'/ISFUNC.*/(.*)$', line)
if char_list:
chars = char_list[0]
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 (line 1)
The ideal approach here is to either (a) iterate through the list indices rather than the lines themselves (i.e. for i in range(len(file.readlines()): ... file.readlines()[i]) or (b) maintain a copy of the last line (say, put last_line = line at the end of your for loop. Then, reference that last line for this expression:
data_list = re.findall(r'/([^/]*)$', last_line)
if data_list:
data = data_list[0]
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90)
Assuming, from your example, you mean "just the data immediately after (i.e. up until the equals sign):
calling_list = re.findall(r'CALLING(.*)=.*/' + re.escape(data) + '$', line)
if calling_list:
calling = calling_list[0]
You can move the parentheses around to change what from that line exactly you want to capture. re.findall() will output a list of matches, including only the bits inside the parentheses that were matched.
In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)
I am trying to open a text file. Parse the text file for specific regex patterns then when if I find that pattern I write the regex returned pattern to another text file.
Specifically a list of IP Addresses which I want to parse specific ones out of.
So the file may have
10.10.10.10
9.9.9.9
5.5.5.5
6.10.10.10
And say I want just the IPs that end in 10 (the regex I think I am good with) My example looks for the 10.180.42, o4 41.XX IP hosts. But I will adjust as needed.
I've tried several method and fail miserably at them all. It's days like this I know why I just never mastered any language. But I'm committed to Python so here goes.
import re
textfile = open("SymantecServers.txt", 'r')
matches = re.findall('^10.180\.4[3,1].\d\d',str(textfile))
print(matches)
This gives me empty backets. I had to encase the textfile in the str function or it just puked. I don't know if this is right.
This just failed all over the place no matter how I fine tuned it.
f = open("SymantecServers.txt","r")
o = open("JustIP.txt",'w', newline="\r\n")
for line in f:
pattern = re.compile("^10.180\.4[3,1].\d\d")
print(pattern)
#o.write(pattern)
#o.close()
f.close()
I did get one working but it just returned the entire line (including netmask and other test like hostname which are all on the same line in the text file. I just want IP)
Any help on how to read a text file and if it has a pattern of IP grab the full IP and write that into another text file so I end up with a text file with a list of just the IPs I want. I am 3 hours into it and behind on work so going to do the first file by hand...
I am just at a loss what I am missing. Sorry for being a newbie
here is it working:
>>> s = """10.10.10.10
... 9.9.9.9
... 5.5.5.5
... 10.180.43.99
... 6.10.10.10"""
>>> re.findall(r'10\.180\.4[31]\.\d\d', s)
['10.180.43.99']
you do not really need to add line boundaries, as you're matching a very specific IP address, if your file does not have weird things like '123.23.234.10.180.43.99.21354' that you don't want to match, it should be ok!
your syntax of [3,1] is matching either 3, 1 or , and you don't want to match against a comma ;-)
about your function:
r = re.compile(r'10\.180\.4[31]\.\d\d')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
o.write(match)
though if I were you, I'd extract IPs using:
r = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
a, b, c, d = match.split('.')
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
or another way to do it:
r = re.compile(r'(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
m = r.match(line)
if m:
a, b, c, d = m.groups()
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
which uses the regex to split the IP address into groups.
What you're missing is that you're doing a re.compile() which creates a Regular Expression object in Python. You're forgetting to match.
You could try:
# This isn't the best way to match IP's, but if it fits for your use-case keep it for now.
pattern = re.compile("^10.180\.4[13].\d\d")
f = open("SymantecServers.txt",'r')
o = open("JustIP.txt",'w')
for line in f:
m = pattern.match(line)
if m is not None:
print "Match: %s" %(m.group(0))
o.write(m.group(0) + "\n")
f.close()
o.close()
Which is compiling the Python object, attempting to match the line against the compiled object, and then printing out that current match. I can avoid having to split my matches, but I have to pay attention to matching groups - therefore group(0)
You can also look at re.search() which you can do, but if you're running search enough times with the same regular expression, it becomes more worthwhile to use compile.
Also note that I moved the f.close() to the outside of the for loop.
I am trying to parse a particular text file. I am trying to open the text file and line by line ask if a particular string is there (In the following example case its the presence of the number 01 in the curly brackets), then manipulate a particular string either forwards backwards, or keep it the same. Here's that example, with one line named arbitrarily "go"... (other lines in the full file have similar format but have {01}, {00} etc...
go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
go = go.replace(go[22:24],go[23:21:-1])
>>> go
'USC_45774-1111-0 <khxkh> {10} ; 78'
I am trying to manipulate the first "hk" (go[22:24]) by replacing it with the same letters but backwards (go[23:21:-1).What I want is to see khxhk but as you can see, the result I am getting is that both are turned backwards to khxkh.
I am also having a problem of executing the specific if statement for each line. Many lines that dont have {01} are being manipulated as if they were....
with open('c:/LG 1A.txt', 'r') as rfp:
with open('C:/output5.txt', 'w') as wfp:
for line in rfp.readlines():
if "{01}" or "{-1}" in line:
line = line.replace(line[25:27],line[26:24:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{10}" or "{1-}" in line:
line = line.replace(line[22:24],line[23:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{11}" in line:
line = line.replace(line[22:27],line[26:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
wfp.close()
Am I missing something simple?
The string replace method does not replace characters by position, it replaces them by what characters they are.
>>> 'apple aardvark'.replace('a', '!')
'!pple !!rdv!rk'
So in your first case, you are telling to replace "hk" with "kh". It doesn't "know" that you want to only replace one of the occurrences; it just knows you want to replace "hk" with "kh", so it replaces all occurrences.
You can use the count argument to replace to specify that you only want to replace the first occurrence:
>>> go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
... go.replace(go[22:24],go[23:21:-1],1)
'USC_45774-1111-0 <khxhk> {10} ; 78'
Note, though, that this will always replace the first occurrence, not necessarily the occurrence at the position in the string you specified. In this case I guess that's what you want, but it may not work directly for other similar tasks. (That is, there is no way to use this method as-is to replace the second occurrence or the third occurrence; you can only replace the first, or the first two, or the first three, etc. To replace the second or third occurrence you'd need to do a bit more.)
As for the second part of your question, you are misunderstanding what if "{01}" or "{-1}" in line means. It means, in layman's terms, if "{01}" or if "{-1}" in line. Since if "{01}" is always true (i.e., the string "{01}" is not a false value), the whole condition is always true. What you want is if "{01}" in line or "{-1}" in line".
I don't know what it is about Python, but your problem is one that gets posted here at least a couple times every day.
if "{01}" or "{-1}" in line:
This doesn't do what you think it does. It asks, "is "{01}" true"? Because it's a non-zero-length string, it is. Because or short-circuits, the rest of the condition is not tested because the first argument is true. Therefore the body of your if statement is always executed.
In other words, Python evaluates as if you'd written this:
if ("{01}") or ("{-1}" in line):
You want something like:
if "{01}" in line or "{-1}" in line:
Or if you have a lot of similar conditions:
if any(x in line for x in ("{01}", "{-1}")):
you can use count argument of replace():
'USC_45774-1111-0 <hkxhk> {10} ; 78'.replace("hk","kh",1)
For your second question, you need change the condition to:
if "{01}" in line or "{-1}" in line:
...
I am trying to use textwrap to format an import file that is quite particular in how it is formatted. Basically, it is as follows (line length shortened for simplicity):
abcdef <- Ok line
abcdef
ghijk <- Note leading space to indicate wrapped line
lm
Now, I have got code to work as follows:
wrapper = TextWrapper(width=80, subsequent_indent=' ', break_long_words=True, break_on_hyphens=False)
for l in lines:
wrapline=wrapper.wrap(l)
This works nearly perfectly, however, the text wrapping code doesn't do a hard break at the 80 character mark, it tries to be smart and break on a space (at approx 20 chars in).
I have got round this by replacing all spaces in the string list with a unique character (#), wrapping them and then removing the character, but surely there must be a cleaner way?
N.B Any possible answers need to work on Python 2.4 - sorry!
A generator-based version might be a better solution for you, since it wouldn't need to load the entire string in memory at once:
def hard_wrap(input, width, indent=' '):
for line in input:
indent_width = width - len(indent)
yield line[:width]
line = line[width:]
while line:
yield '\n' + indent + line[:indent_width]
line = line[indent_width:]
Use it like this:
from StringIO import StringIO # Makes strings look like files
s = """abcdefg
abcdefghijklmnopqrstuvwxyz"""
for line in hard_wrap(StringIO(s), 12):
print line,
Which prints:
abcdefg
abcdefghijkl
mnopqrstuvw
xyz
It sounds like you are disabling most of the functionality of TextWrapper, and then trying to add a little of your own. I think you'd be better off writing your own function or class. If I understand you right, you're simply looking for lines longer than 80 chars, and breaking them at the 80-char mark, and indenting the remainder by one space.
For example, this:
s = """\
This line is fine.
This line is very long and should wrap, It'll end up on a few lines.
A short line.
"""
def hard_wrap(s, n, indent):
wrapped = ""
n_next = n - len(indent)
for l in s.split('\n'):
first, rest = l[:n], l[n:]
wrapped += first + "\n"
while rest:
next, rest = rest[:n_next], rest[n_next:]
wrapped += indent + next + "\n"
return wrapped
print hard_wrap(s, 20, " ")
produces:
This line is fine.
This line is very lo
ng and should wrap,
It'll end up on a
few lines.
A short line.