File operation starts again from first while looping through the file - python

I'm trying to find a certain word in a file and want to print the next line when a condition is met.
f = open('/path/to/file.txt','r')
lines = f.readlines()
for line in lines:
if 'P/E' in line:
n = lines.index(line) #get index of current line
print(lines[n+1]) #print the next line
a.close()
The string 'P/E' will be present 4 times in the file, each time in a different line.
When executed, the code prints the next line after the first 2 occurrences of 'P/E' normally. It then again goes back and prints the same first 2 occurrences again and exits. The loop is not proceeding after those first 2 occurrences; it kind of repeats the process and exits.
I checked the data file to see if my output is the actual result, but all next lines are different after 'P/E'.
How can I resolve this? Thanks.

list.index() with just one argument only finds the first occurrence. You'd have to give it a starting point to find elements past the previous index, list.index() takes a second argument that tells it where to start searching from.
However, you don't need to use lines.index(); that's very inefficient; it requires a full scan through the list, testing each line until a match is found.
Just use the enumerate() function to add indices as you loop:
for index, line in enumerate(lines):
if 'P/E' in line:
print(lines[index + 1])
Be careful, there is a chance index + 1 is not a valid index; if you find 'P/E' in the very last line of the lines list you'll get an IndexError. You may have to add a and index + 1 < len(lines) test.
Note that using file.readlines() reads all of the file into memory in one go. Try to avoid this; you could loop directly over the file, and remember the previous line instead:
with open('/path/to/file.txt','r') as f:
previous = ''
for line in f:
if 'P/E' in previous:
print(line) # print this line
previous = line # remember for the next iteration

Related

When iterating through lines in txt file, how can I capture multiple subsequent lines after a regex triggers?

I have a txt file:
This is the first line of block 1. It is always identifiable
Random
Stuff
This is the first line of block 2. It is always identifiable
Is
Always
This is the first line of block 3. It is always identifiable
In
Here!
I want to iterate through each line and look for the following code to trigger and capture a fixed amount of lines following:
for line in lines:
match = re.compile(r'(.*)block 2.(.*)'.search(line)
if match:
#capture current line and the following 2 lines
After parsing the txt file, I want to return:
This is the first line of block 2
Is
Always
In my particular example, the first line of my block is always identifiable. There is a consistent row count per block. The contents of lines >= 2 will always change and cannot reliably be returned when using regex.
You can call the next() function to get the next element in the iterator.
def get_block2(lines):
for line in lines:
match = re.compile(r'(.*)block 2\n').search(line)
if match:
line2 = next(lines)
line3 = next(lines)
return line, line2, line3
assuming lines is an iterator, so you can just grab them from it.
block2 = re.compile(r'(.*)block 2\n')
for l in lines:
if block2.search(l):
res = [l, next(lines), next(lines)]
break
print(res)
if not lines isn't an iterator, you just have to add lines = iter(lines) to the code.

Error with .readlines()[n]

I'm a beginner with Python.
I tried to solve the problem: "If we have a file containing <1000 lines, how to print only the odd-numbered lines? ". That's my code:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
n=1
num_lines=sum(1 for line in f)
while n<num_lines:
if n/2!=0:
a=f.readlines()[n]
print(a)
break
n=n+2
where n is a counter and num_lines calculates how many lines the file contains.
But when I try to execute the code, it says:
"a=f.readlines()[n]
IndexError: list index out of range"
Why it doesn't recognize n as a counter?
You have the call to readlines into a loop, but this is not its intended use,
because readlines ingests the whole of the file at once, returning you a LIST
of newline terminated strings.
You may want to save such a list and operate on it
list_of_lines = open(filename).readlines() # no need for closing, python will do it for you
odd = 1
for line in list_of_lines:
if odd : print(line, end='')
odd = 1-odd
Two remarks:
odd is alternating between 1 (hence true when argument of an if) or 0 (hence false when argument of an if),
the optional argument end='' to the print function is required because each line in list_of_lines is terminated by a new line character, if you omit the optional argument the print function will output a SECOND new line character at the end of each line.
Coming back to your code, you can fix its behavior using a
f.seek(0)
before the loop to rewind the file to its beginning position and using the
f.readline() (look, it's NOT readline**S**) method inside the loop,
but rest assured that proceding like this is. let's say, a bit unconventional...
Eventually, it is possible to do everything you want with a one-liner
print(''.join(open(filename).readlines()[::2]))
that uses the slice notation for lists and the string method .join()
Well, I'd personally do it like this:
def print_odd_lines(some_file):
with open(some_file) as my_file:
for index, each_line in enumerate(my_file): # keep track of the index of each line
if index % 2 == 1: # check if index is odd
print(each_line) # if it does, print it
if __name__ == '__main__':
print_odd_lines('C:\Users\Savina\Desktop\rosalind_ini5.txt')
Be aware that this will leave a blank line instead of the even number. I'm sure you figure how to get rid of it.
This code will do exactly as you asked:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
for i, line in enumerate(f.readlines()): # Iterate over each line and add an index (i) to it.
if i % 2 == 0: # i starts at 0 in python, so if i is even, the line is odd
print(line)
To explain what happens in your code:
A file can only be read through once. After that is has to be closed and reopened again.
You first iterate over the entire file in num_lines=sum(1 for line in f). Now the object f is empty.
If n is odd however, you call f.readlines(). This will go through all the lines again, but none are left in f. So every time n is odd, you go through the entire file. It is faster to go through it once (as in the solutions offered to your question).
As a fix, you need to type
f.close()
f = open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')
everytime after you read through the file, in order to get back to the start.
As a side note, you should look up modolus % for finding odd numbers.

Reset iteration index after using next() Python [duplicate]

This question already has answers here:
How can I iterate over overlapping (current, next) pairs of values from a list?
(12 answers)
Closed last month.
I am trying to edit a text file using fileinput.input(filename, inplace=1)
The text file has say 5 lines:
line 0
line 1
line 2
line 3
line 4
I wish to change data of line 1 based on info in line 2.
So I use a for loop
infile = fileinput.input(filename, inplace=1)
for line in infile:
if(line2Data):
#do something on line1
print line,
else:
line1=next(infile)
line2=next(infile)
#do something with line2
Now my problem is after the 1st iteration the line is set to line2 so in 2nd iteration the line is set to line3. I want line to be set to line1 in 2nd iteration. I have tried line = line but it doesn't work.
Can you please let me know how I am reset the iteration index on line which gets changed due to next
PS: This is a simple example of a huge file and function I am working on.
As far as I know (and that is not much) there is no way in resetting an iterator. This SO question is maybe useful. Since you say the file is huge, what I can think of is to process only part of the data. Following nosklos answer in this SO question, I would try something like this (but that is really just a first guess):
while True:
for line in open('really_big_file.dat')
process_data(line)
if some_condition==True:
break
Ok, your answer that you might want to start from the previous index is not captured with this attempt.
There is no way to reset the iterator, but there is nothing stopping your from doing some of your processing before you start your loop:
infile = fileinput.input("foo.txt")
first_lines = [next(infile) for x in range(3)]
first_lines[1] = first_lines[1].strip() + " this is line2 > " + first_lines[2]
print "\n".join(first_lines)
for line in infile:
print line
This uses next() to read the first 3 lines into a list. It then updates line1 based on line2 and prints all of them. It then continues to print the rest of the file using a normal loop.
For your sample, the output would be:
line 0
line 1 this is line2 > line 2
line 2
line 3
line 4
Note, if your are trying to modify the first lines of the file itself, rather than just display it, you would need to write the whole file to a new file. Writing to a file does not work like in a Word processor where all the lines move down when a line or character is added. It works as if you were in overwrite mode.

Python 3.4 - Capture block of text based on single string

I have searched far and wide and I hope someone can either point me to the link I missed or help me out with this logic.
We have a script the goes out and collects logs from various devices and places them in text files. Within these text files there is a time stamp and we need to collect the few lines of text before and after this time stamp.
I already have a script that matches the time stamps and removes them for certain reports (included below) but I cannot figure out how to match the time stamp and then capture the surrounding lines.
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
with open(filename, 'r') as f:
h = f.readlines()
for line in h:
if regex_time_stamp.search(line) is not None:
new_line = re.sub(regex_time_stamp, '', line)
pre_list.append(new_line)
else:
pre_list.append(line)
Any assistance would be greatly appreciated! Thanks for taking the time to read this.
The basic algorithm is to remember the three most recently read lines. When you match a header, read the next two lines and the combine it with the header and the last three lines that you've saved.
Alternately, since you're saving all of the lines in a list, simply keep track of which element is the current element, and when you find a header you can go back and get the previous two and next two elements.
Catch with duplicated lines
Agreed with the basic algorithm by #Bryan-Oakley and #TigerhawkT3, however there's a catch:
What if several lines match consecutively?
You could end up duplicating "context" lines by printing the last 2 lines of the first match, and then the last 2 lines of the second match... that would actually also contain the previous matched line.
The solution is to keep track of which line number was last printed, in order to print just enough lines before the current matched line.
Flexible context parameter
What if also you want to print 3 lines before and after instead of 2? Then you need to keep track of more lines.
What if you want only 1 ?
Then your number of lines to print needs to be a parameter and the algorithm needs to use it.
Sample input and output
Here's a file sample that contains the word MATCH instead of your timestamp, for clarity. The other lines contain NOT + line number
==
NOT 0
NOT 1
NOT 2
NOT 3
NOT 4
MATCH LINE 5
NOT 6
NOT 7
NOT 8
NOT 9
MATCH LINE 10
MATCH LINE 11
NOT 12
MATCH LINE 13
NOT 14
==
The output should be:
==
NOT 3
NOT 4
LINE 5
NOT 6
NOT 8
NOT 9
LINE 10
LINE 11
NOT 12
LINE 13
NOT 14
==
Solution
This solution iterates on the file and keeps track of:
what is the last line that was printed? This will take care of not duplicating "context" lines if matched lines come in sequence.
what is the last line that was matched? This will tell the program to print the current line if it is "close" to the last matched line. How close? This is determined by your "number of lines to print" parameter. Then we also set the last_line_printed variable to the current line index.
Here's a simplified algorithm in English:
When matching a line we will:
print the last N lines, from the last_line_printed variable to the current index
print the current line after stripping the timestamp
set the last_line_printed = last_line_matched = current line index
continue
When not matching a line we will:
print the current line if current_index < last_line_matched index + number_of_lines_to_print
Of course we're taking care of whether we're close to the beginning of file by checking limits
Not print but return an array
This solution doesn't print directly but returns an array with all the lines to print. That's just a bit classier.
I like to name my "return" variable result but that's just me. It makes it obvious what is the result variable during the whole algorithm.
Code
You can try this code with the input above, it'll print the same output.
def search_timestamps_context(filename, number_of_lines_to_print=2):
import re
result = []
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
# for my test
regex_time_stamp = re.compile('MATCH')
with open(filename, 'r') as f:
h = f.readlines()
# Remember which is the last line printed and matched
last_line_printed = -1
last_line_matched = -1
for idx, line in enumerate(h):
if regex_time_stamp.search(line) is not None:
# We want to print the last "number_of_lines_to_print" lines
# ...unless they were already printed
# We want to return last lines from idx - number_of_lines_to_print
# print ('** Matched ', line, idx, last_line_matched, last_line_printed)
if last_line_printed == -1:
lines_to_print = max(idx - number_of_lines_to_print, 0)
else:
# Unless we've already printed those lines because of a previous match, then we continue
lines_to_print = max(idx - number_of_lines_to_print, last_line_printed + 1)
for l in h[lines_to_print:idx]:
result.append(l)
# Now print the stripped line
new_line = re.sub(regex_time_stamp, '', line)
result.append(new_line)
# Update the last line printed
last_line_printed = last_line_matched = idx
else:
# If not a match, we still need to print the current line if we had a match N lines before
if last_line_matched != -1 and idx < last_line_matched + number_of_lines_to_print:
result.append(line)
last_line_printed = idx
return result
filename = 'test_match.txt'
lines = search_timestamps_context(filename, number_of_lines_to_print=2)
print (''.join(lines))
Improvements
The usage of getlines() is inefficient: we are reading the whole file before starting.
It would be more efficient to just iterate, but then we need to remember the last lines in case we need to print them. To achieve that, we would maintain a list of the last N lines, and not more.
That's an exercise left to the reader :)

How to grab a chunk of data from a file?

I want to grab a chunk of data from a file. I know the start line and the end line. I wrote the code but its incomplete and I don't know how to solve it further.
file = open(filename,'r')
end_line='### Leave a comment!'
star_line = 'Kill the master'
for line in file:
if star_line in line:
??
startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if line.startswith(startmarker): marking = True
elif line.startswith(endmarker): marking = False
if marking: result.append(line)
if len(result) > 1:
print "".join(result[1:])
Explanation: The with block is a nice way to use files -- it makes sure you don't forget to close() it later. The for walks each line and:
starts outputting when it sees a line that starts with 'ohai' (including that line)
stops outputting when it sees a line that starts with 'meheer?' (without outputting that line).
After the loop, result contains the part of the file that is needed, plus that initial marker. Rather than making the loop more complicated to ignore the marker, I just throw it out using a slice: result[1:] returns all elements in result starting at index 1; in other words, it excludes the first element (index 0).
Update to reflect add partial-line matches:
startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if not marking:
index = line.find(startmarker)
if index != -1:
marking = True
result.append(line[index:])
else:
index = line.rfind(endmarker)
if index != -1:
marking = False
result.append(line[:index + len(endmarker)])
else:
result.append(line)
print "".join(result)
Yet more explanation: marking still tells us whether we should be outputting whole lines, but I've changed the if statements for the start and end markers as follows:
if we're not (yet) marking, and we see the startmarker, then output the current line starting at the marker. The find method returns the position of the first occurrence of startmarker in this case. The line[index:] notation means 'the content of line starting at position index.
while marking, just output the current line entirely unless it contains endmarker. Here, we use rfind to find the rightmost occurrence of endmarker, and the line[...] notation means 'the content of line up to position index (the start of the match) plus the marker itself.' Also: stop marking now :)
if reading the whole file is not a problem, I would use file.readlines() to read in all the lines in a list of strings.
then you can use list_of_lines.index(value) to find the indices of the first and last line, and then select all the lines between these two indices.
First, a test file (assuming Bash shell):
for i in {0..100}; do echo "line $i"; done > test_file.txt
That generates a file a 101 line file with lines line 0\nline 1\n ... line 100\n
This Python script captures the line between and including mark1 up to and not including mark2:
#!/usr/bin/env python
mark1 = "line 22"
mark2 = "line 26"
record=False
error=False
buf = []
with open("test_file.txt") as f:
for line in f:
if mark1==line.rstrip():
if error==False and record==False:
record=True
if mark2==line.rstrip():
if record==False:
error=True
else:
record=False
if record==True and error==False:
buf.append(line)
if len(buf) > 1 and error==False:
print "".join(buf)
else:
print "There was an error in there..."
Prints:
line 22
line 23
line 24
line 25
in this case. If both marks are not found in the correct sequence, it will print an error.
If the size of the file between the marks is excessive, you may need some additional logic. You can also use a regex for each line instead of an exact match if that fits your use case.

Categories