Loop the remaining elements within a loop - python

I have the following text:
ERROR: <C:\Includes\Library1.inc:123> This is the Error
Call Trace:
<C:\Includes\Library2.inc:456>
<C:\Includes\Library2.inc:789>
<C:\Code\Main.ext:12>
<Line:1>
ERROR: <C:\Includes\Library2.inc:2282> Another Error
Call Trace:
<C:\Code\Main.ext:34>
<C:\Code\Main.ext:56>
<C:\Code\Main.ext:78>
<Line:1>
ERROR: <C:\Code\Main.ext:90> Error Three
I would like to extract the following information:
line, Error = 12, This is the Error
line, Error = 34, Another Error
line, Error = 90, Error Three
Here is how far I got:
theText = 'ERROR: ...'
ERROR_RE = re.compile(r'^ERROR: <(?P<path>.*):(?P<line>[0-9]+)> (?P<error>.*)$')
mainName = '\Main.ext'
# Go through each line
for fullline in theText.splitlines():
match = self.ERROR_RE.match(fullline)
if match:
path, line, error = match.group('path'), match.group('line'), match.group('error')
if path.endswith(mainName):
callSomething(line, error)
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
What is the pythonic way to loop the remaining elements within a loop?
Solution:
http://codepad.org/BcYmybin

The direct answer to your question, regarding how to loop over remaining lines, is: change the first line of the loop to
lines = theText.splitlines()
for (linenum, fullline) in enumerate(lines):
Then, after a match, you can get at the remaining lines by looking at lines[j] in an inner loop where j starts at linenum+1 and runs until the next match.
However, a slicker way to solve the problem is to first split the text into blocks. There are many ways to do this, however, being a former perl user, my impulse is to use regular expressions.
# Split into blocks that start with /^ERROR/ and run until either the next
# /^ERROR/ or until the end of the string.
#
# (?m) - lets '^' and '$' match the beginning/end of each line
# (?s) - lets '.' match newlines
# ^ERROR - triggers the beginning of the match
# .*? - grab characters in a non-greedy way, stopping when the following
# expression matches
# (?=^ERROR|$(?!\n)) - match until the next /^ERROR/ or the end of string
# $(?!\n) - match end of string. Normally '$' suffices but since we turned
# on multiline mode with '(?m)' we have to use '(?!\n)$ to prevent
# this from matching end-of-line.
blocks = re.findall('(?ms)^ERROR.*?(?=^ERROR|$(?!\n))', theText)

Replace this:
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
With this:
match = stackframe_re.match(fullline)
if match and error: # if error is defined from earlier when you matched ERROR_RE
path, line = match.group('path'), match.group('line')
if path.endsWith(mainName):
callSomething(line, error)
error = None # don't report this error again if you see main again
Note the indentation. Also initialize error = None before the loop begins and set error = None after the first call to callSomething. In general, the code I've suggested should work for properly-formatted data, but you might want to improve it so it doesn't give misleading results if the data does not match the format you expect.
You will have to write stackframe_re, but it should be an RE that matches, for example,
<C:\Includes\Library2.inc:789>
I don't really understand what you mean when you say "loop the remaining elements within a loop". A loop continues to the remaining elements by default.

Related

substring extract in a file using Python Regex

A file has n number of lines in blocks of logically defined strings. I'm parsing each line and capturing the required data based on some matching conditions.
I have read through each line and finding the blocks with this code:
#python
for lines in file.readlines():
if re.match(r'block.+',lines)!= None:
block_name = re.match(r'block.+', lines).group(0)
# string matching code to be added here
Input File:
line1 select KT_TT=$TMTL/$SYSNAME.P1
line2 . $dhe/ISFUNC sprfl tm/tm1032 int 231
line3 select IT_TT=$TMTL/$SYSNAME.P2
line4 . $DHE/ISFUNC ptoic ca/ca256 tli 551
.....
.....
line89 CALLING IK02=$TMTL/$SYSNAME.P2
line90 CALLING KK01=$TMTL/$SYSNAME.P1
Matching conditions & expected output of each step:
While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231, ca256 tli 551 (matching string found in line2 & line 4, etc)
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 & $SYSNAME.P2(line 1 & line 3, etc)
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1 & $SYSNAME.P2). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90) & IK02(line 89)
final output should be like
FUNC SYS CALL
tm1032 int 231 $SYSNAME.P1 KK01
ca256 tli 551 $SYSNAME.P2 IK02
If all you need is the text next to the last slash, you need not go for regex at all .
Simply use the .split("/") on each line and you can get the last part next to the slash
sample = "$dhe/ISFUNC sprfl tm/tm1032 int 231"
sample.split("/")
will result in
['$dhe', 'ISFUNC sprfl tm', 'tm1032 int 231']
and then just access the last element of the list using -1 indexing to get the value
PS : Use the split function once you have found the corresponding line
While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231 (matching string found in line2)
char_list = re.findall(r'/ISFUNC.*/(.*)$', line)
if char_list:
chars = char_list[0]
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 (line 1)
The ideal approach here is to either (a) iterate through the list indices rather than the lines themselves (i.e. for i in range(len(file.readlines()): ... file.readlines()[i]) or (b) maintain a copy of the last line (say, put last_line = line at the end of your for loop. Then, reference that last line for this expression:
data_list = re.findall(r'/([^/]*)$', last_line)
if data_list:
data = data_list[0]
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90)
Assuming, from your example, you mean "just the data immediately after (i.e. up until the equals sign):
calling_list = re.findall(r'CALLING(.*)=.*/' + re.escape(data) + '$', line)
if calling_list:
calling = calling_list[0]
You can move the parentheses around to change what from that line exactly you want to capture. re.findall() will output a list of matches, including only the bits inside the parentheses that were matched.

Regex vs readline for text processing

I have a text to process (router output) and generate useful data structure (dictionary having keys as iface name and values as packet counts) from it. I have two approaches to do the same task. I would like to know which one should I use for efficiency and which one looks more prone to fail for bigger data samples.
Readline1 gets a list from readline and processes output and writes into the dictionary with key as interface name and values as next three items.
Readline2 uses re module and match the groups and from groups it writes to dictionary keys and values.
input self.output to these functions will be something like this:
message =
"""
Interface 1/1\n\t
input : 1234\n\t
output : 3456\n\t
dropped : 12\n
\n
Interface 1/2\n\t
input : 7123\n\t
output : 2345\n\t
dropped : 31\n\t
"""
def ReadLine1(self):
lines = self.output.splitlines()
for index, line in enumerate(lines):
if "Interface" in line:
valuelist = []
for i in [1,2,3]:
valuelist.append((lines[index+i].split(':'))[1].strip())
self.IFlist[line.split()[1]] = valuelist
return self.IFlist
def Readline2(self):
#print repr(self.output)
n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
blocks = self.output.split('\n\n')
for block in blocks:
m_object = re.match(n, block)
self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]
Both of your methods use specific aspects of the format to achieve the parsing you are trying to do, and if that format was changed / broken one of the methods could also break...
For example if you added a space in the empty line between the two entries (which you cannot see) then the blocks = self.output.split('\n\n') would fail to find two consecutive newline characters and the regex version would miss out on the second entry:
{'1/1': ['1234', '3456', '13']}
Or if you added an extra newline between input and output like this:
Interface 1/2
input : 7123
output : 2345
dropped : 31
The regex \s* would deal with the extra space fine but the non-regex parsing would assume that lines[index+i].split(':') has an indice [1] so it would raise an IndexError with that data
Or if you added some extra space at the end of any line then the regex would fail to see the newline right after the content and re.match(n, lock) would return None so the next line would raise an AttributeError: 'NoneType' object has no attribute 'group'
Or if you changed Interface to interface for one of the entries (no longer capital I) then the regex would raise the same error as above but the non-regex would simply ignore that entry.
While I was testing it I found that the regex was easier to mess up with small edits to the sample message, but I also found that the version I made using a generator expression and str.partition was significantly more robust then both of them:
def readline3():
gen_lines = (line for line in self.output.splitlines()
if line and not line.isspace())
try:
while True: #ended when next() throws a StopIteration
start,_,key = next(gen_lines).partition(" ")
if start == "Interface":
IFlist[key] = [next(gen_lines).rpartition(" : ")[2]
for _ in "123"]
except StopIteration: # reached end of output
return self.IFlist
This succeeded in every case mentioned above and a few more, and since the only method this is relying on is str.partition which alway returns a 3 item tuple there is nothing to raise any unexpected errors unless self.output is something other then a string.
Also running a benchmark using timeit your readline1 consistently was faster then readline2 and my readline3 was usually slightly more then readline1:
#using the default 1000000 loops using 'message'
<function readline1 at 0x100756f28>
11.225649802014232
<function readline2 at 0x1057e3950>
14.838601427007234
<function readline3 at 0x1057e39d8>
11.693351223017089

converting matrix from logfile

I have a matrix written in this format inside a log file:
2014-09-08 14:10:20,107 - root - INFO - [[ 8.30857546 0.69993454 0.20645551
77.01797674 13.76705776]
[ 8.35205432 0.53417203 0.19969048 76.78598173 14.12810144]
[ 8.37066492 0.64428449 0.18623849 76.4181809 14.3806312 ]
[ 8.50493296 0.5110043 0.19731849 76.45838604 14.32835821]
[ 8.18900791 0.4955451 0.22524777 76.96966663 14.12053259]]
...some text
2014-09-08 14:12:22,211 - root - INFO - [[ 3.25142253e+01 1.11788106e+00 1.51065008e-02 6.16496299e+01
4.70315726e+00]
[ 3.31685887e+01 9.53522041e-01 1.49767860e-02 6.13449154e+01
4.51799710e+00]
[ 3.31101827e+01 1.09729703e+00 5.03347259e-03 6.11818594e+01
4.60562742e+00]
[ 3.32506957e+01 1.13837592e+00 1.51783456e-02 6.08651657e+01
4.73058437e+00]
[ 3.26809490e+01 1.06617279e+00 1.00110121e-02 6.17429172e+01
4.49994994e+00]]
I am writing this matrix using the python logging package:
logging.info(conf_mat)
However, logging.info does not show me a method to write the matrix in a float %.3f format. So I decided to parse the log file this way:
conf_mat = [[]]
cf = '[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
with open(sys.argv[1]) as f:
for line in f:
epoch = re.findall(ep, line) # find lines starting with epoch for other stuff
if epoch:
error_line = next(f) # grab the next line, which is the error line
error_value = error_line[error_line.rfind('=')+1:]
data_points.append(map(float,epoch[0]+(error_value,))) #get the error value for the specific epoch
for i in range(N):
cnf_mline = next(f)
match = re.findall(cf, cnf_mline)
if match:
conf_mat[count].append(map(float,match))
else:
conf_mat.append([])
count += 1
However, the regex does not catch the break in the line when looking at the matrix, when I try to convert the matrix using
conf_mtx = np.array(conf_mat)
Your regex string cf needs to be a raw string literal:
cf = r'[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
in order to work properly. Backslash \ characters are interpreted as escape sequences in "regular" strings, but should not be in regexes. You can read about raw string literals at the top of the re module's documentation, and in this excellent SO answer. Alex Martelli explains them quite well, so I won't repeat everything he says here. Suffice it to say that were you not to use a raw literal, you'd have to escape each and every one of your backslashes with another backslash, and that just gets ugly and annoying fast.
As for the rest of your code, it won't run without more information. The N in for i in range(N): is undefined, as is count a few lines later. Calling cnf_mline = next(f) really doesn't make sense at all, because you're going to run out of lines in the file (by calling next repeatedly) before you can iterate over all of them using the for line in f: command. It's unclear whether your data really has that line break in the second half where one of the members of the list is on the next line, I assume that's the case because of the next attempt.
I think you should first try to clean up your input file into a regular format, then you'll have a much easier time running regular expressions on it. In order to work on subsequent lines and not run out your generator expression with excessive uses of next(), check out itertools.tee(). It returns n independent generators from a single iterable, allowing you to advance the second a line ahead of the first. Alternatively, you could read your file's lines into a list, and just operate using indices of i, i+1. Just strip each line, join them together, and write to a new file or list. You can then go ahead and rewrite your matching loop to simply pull each number of the appropriate format out and insert it into your matrix at the correct position. The good news is your regex caught everything I threw at it, so you won't need to modify anything there.
Good luck!

finding and replacing a string within a line with an if statement

I am trying to parse a particular text file. I am trying to open the text file and line by line ask if a particular string is there (In the following example case its the presence of the number 01 in the curly brackets), then manipulate a particular string either forwards backwards, or keep it the same. Here's that example, with one line named arbitrarily "go"... (other lines in the full file have similar format but have {01}, {00} etc...
go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
go = go.replace(go[22:24],go[23:21:-1])
>>> go
'USC_45774-1111-0 <khxkh> {10} ; 78'
I am trying to manipulate the first "hk" (go[22:24]) by replacing it with the same letters but backwards (go[23:21:-1).What I want is to see khxhk but as you can see, the result I am getting is that both are turned backwards to khxkh.
I am also having a problem of executing the specific if statement for each line. Many lines that dont have {01} are being manipulated as if they were....
with open('c:/LG 1A.txt', 'r') as rfp:
with open('C:/output5.txt', 'w') as wfp:
for line in rfp.readlines():
if "{01}" or "{-1}" in line:
line = line.replace(line[25:27],line[26:24:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{10}" or "{1-}" in line:
line = line.replace(line[22:24],line[23:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{11}" in line:
line = line.replace(line[22:27],line[26:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
wfp.close()
Am I missing something simple?
The string replace method does not replace characters by position, it replaces them by what characters they are.
>>> 'apple aardvark'.replace('a', '!')
'!pple !!rdv!rk'
So in your first case, you are telling to replace "hk" with "kh". It doesn't "know" that you want to only replace one of the occurrences; it just knows you want to replace "hk" with "kh", so it replaces all occurrences.
You can use the count argument to replace to specify that you only want to replace the first occurrence:
>>> go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
... go.replace(go[22:24],go[23:21:-1],1)
'USC_45774-1111-0 <khxhk> {10} ; 78'
Note, though, that this will always replace the first occurrence, not necessarily the occurrence at the position in the string you specified. In this case I guess that's what you want, but it may not work directly for other similar tasks. (That is, there is no way to use this method as-is to replace the second occurrence or the third occurrence; you can only replace the first, or the first two, or the first three, etc. To replace the second or third occurrence you'd need to do a bit more.)
As for the second part of your question, you are misunderstanding what if "{01}" or "{-1}" in line means. It means, in layman's terms, if "{01}" or if "{-1}" in line. Since if "{01}" is always true (i.e., the string "{01}" is not a false value), the whole condition is always true. What you want is if "{01}" in line or "{-1}" in line".
I don't know what it is about Python, but your problem is one that gets posted here at least a couple times every day.
if "{01}" or "{-1}" in line:
This doesn't do what you think it does. It asks, "is "{01}" true"? Because it's a non-zero-length string, it is. Because or short-circuits, the rest of the condition is not tested because the first argument is true. Therefore the body of your if statement is always executed.
In other words, Python evaluates as if you'd written this:
if ("{01}") or ("{-1}" in line):
You want something like:
if "{01}" in line or "{-1}" in line:
Or if you have a lot of similar conditions:
if any(x in line for x in ("{01}", "{-1}")):
you can use count argument of replace():
'USC_45774-1111-0 <hkxhk> {10} ; 78'.replace("hk","kh",1)
For your second question, you need change the condition to:
if "{01}" in line or "{-1}" in line:
...

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Categories