Python 3.4 - Capture block of text based on single string

Python 3.4 - Capture block of text based on single string - python

I have searched far and wide and I hope someone can either point me to the link I missed or help me out with this logic.
We have a script the goes out and collects logs from various devices and places them in text files. Within these text files there is a time stamp and we need to collect the few lines of text before and after this time stamp.
I already have a script that matches the time stamps and removes them for certain reports (included below) but I cannot figure out how to match the time stamp and then capture the surrounding lines.
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
with open(filename, 'r') as f:
h = f.readlines()
for line in h:
if regex_time_stamp.search(line) is not None:
new_line = re.sub(regex_time_stamp, '', line)
pre_list.append(new_line)
else:
pre_list.append(line)
Any assistance would be greatly appreciated! Thanks for taking the time to read this.

The basic algorithm is to remember the three most recently read lines. When you match a header, read the next two lines and the combine it with the header and the last three lines that you've saved.
Alternately, since you're saving all of the lines in a list, simply keep track of which element is the current element, and when you find a header you can go back and get the previous two and next two elements.

Catch with duplicated lines
Agreed with the basic algorithm by #Bryan-Oakley and #TigerhawkT3, however there's a catch:
What if several lines match consecutively?
You could end up duplicating "context" lines by printing the last 2 lines of the first match, and then the last 2 lines of the second match... that would actually also contain the previous matched line.
The solution is to keep track of which line number was last printed, in order to print just enough lines before the current matched line.
Flexible context parameter
What if also you want to print 3 lines before and after instead of 2? Then you need to keep track of more lines.
What if you want only 1 ?
Then your number of lines to print needs to be a parameter and the algorithm needs to use it.
Sample input and output
Here's a file sample that contains the word MATCH instead of your timestamp, for clarity. The other lines contain NOT + line number
==
NOT 0
NOT 1
NOT 2
NOT 3
NOT 4
MATCH LINE 5
NOT 6
NOT 7
NOT 8
NOT 9
MATCH LINE 10
MATCH LINE 11
NOT 12
MATCH LINE 13
NOT 14
==
The output should be:
==
NOT 3
NOT 4
LINE 5
NOT 6
NOT 8
NOT 9
LINE 10
LINE 11
NOT 12
LINE 13
NOT 14
==
Solution
This solution iterates on the file and keeps track of:
what is the last line that was printed? This will take care of not duplicating "context" lines if matched lines come in sequence.
what is the last line that was matched? This will tell the program to print the current line if it is "close" to the last matched line. How close? This is determined by your "number of lines to print" parameter. Then we also set the last_line_printed variable to the current line index.
Here's a simplified algorithm in English:
When matching a line we will:
print the last N lines, from the last_line_printed variable to the current index
print the current line after stripping the timestamp
set the last_line_printed = last_line_matched = current line index
continue
When not matching a line we will:
print the current line if current_index < last_line_matched index + number_of_lines_to_print
Of course we're taking care of whether we're close to the beginning of file by checking limits
Not print but return an array
This solution doesn't print directly but returns an array with all the lines to print. That's just a bit classier.
I like to name my "return" variable result but that's just me. It makes it obvious what is the result variable during the whole algorithm.
Code
You can try this code with the input above, it'll print the same output.
def search_timestamps_context(filename, number_of_lines_to_print=2):
import re
result = []
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
# for my test
regex_time_stamp = re.compile('MATCH')
with open(filename, 'r') as f:
h = f.readlines()
# Remember which is the last line printed and matched
last_line_printed = -1
last_line_matched = -1
for idx, line in enumerate(h):
if regex_time_stamp.search(line) is not None:
# We want to print the last "number_of_lines_to_print" lines
# ...unless they were already printed
# We want to return last lines from idx - number_of_lines_to_print
# print ('** Matched ', line, idx, last_line_matched, last_line_printed)
if last_line_printed == -1:
lines_to_print = max(idx - number_of_lines_to_print, 0)
else:
# Unless we've already printed those lines because of a previous match, then we continue
lines_to_print = max(idx - number_of_lines_to_print, last_line_printed + 1)
for l in h[lines_to_print:idx]:
result.append(l)
# Now print the stripped line
new_line = re.sub(regex_time_stamp, '', line)
result.append(new_line)
# Update the last line printed
last_line_printed = last_line_matched = idx
else:
# If not a match, we still need to print the current line if we had a match N lines before
if last_line_matched != -1 and idx < last_line_matched + number_of_lines_to_print:
result.append(line)
last_line_printed = idx
return result
filename = 'test_match.txt'
lines = search_timestamps_context(filename, number_of_lines_to_print=2)
print (''.join(lines))
Improvements
The usage of getlines() is inefficient: we are reading the whole file before starting.
It would be more efficient to just iterate, but then we need to remember the last lines in case we need to print them. To achieve that, we would maintain a list of the last N lines, and not more.
That's an exercise left to the reader :)

Related

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio

Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

File operation starts again from first while looping through the file

I'm trying to find a certain word in a file and want to print the next line when a condition is met.
f = open('/path/to/file.txt','r')
lines = f.readlines()
for line in lines:
if 'P/E' in line:
n = lines.index(line) #get index of current line
print(lines[n+1]) #print the next line
a.close()
The string 'P/E' will be present 4 times in the file, each time in a different line.
When executed, the code prints the next line after the first 2 occurrences of 'P/E' normally. It then again goes back and prints the same first 2 occurrences again and exits. The loop is not proceeding after those first 2 occurrences; it kind of repeats the process and exits.
I checked the data file to see if my output is the actual result, but all next lines are different after 'P/E'.
How can I resolve this? Thanks.

list.index() with just one argument only finds the first occurrence. You'd have to give it a starting point to find elements past the previous index, list.index() takes a second argument that tells it where to start searching from.
However, you don't need to use lines.index(); that's very inefficient; it requires a full scan through the list, testing each line until a match is found.
Just use the enumerate() function to add indices as you loop:
for index, line in enumerate(lines):
if 'P/E' in line:
print(lines[index + 1])
Be careful, there is a chance index + 1 is not a valid index; if you find 'P/E' in the very last line of the lines list you'll get an IndexError. You may have to add a and index + 1 < len(lines) test.
Note that using file.readlines() reads all of the file into memory in one go. Try to avoid this; you could loop directly over the file, and remember the previous line instead:
with open('/path/to/file.txt','r') as f:
previous = ''
for line in f:
if 'P/E' in previous:
print(line) # print this line
previous = line # remember for the next iteration

Want to skip last and first 5 lines while reading file in python

If I want to see only data between line number 5 to what comes before last 5 rows in a file. While I was reading that particular file.
Code I have used as of now :
f = open("/home/auto/user/ip_file.txt")
lines = f.readlines()[5:] # this will start from line 5 but how to set end
for line in lines:
print("print line ", line )
Please suggest me I am newbie for python.
Any suggestions are most welcome too.

You could use a neat feature of slicing, you can count from the end with negative slice index, (see also this question):
lines = f.readlines()[5:-5]
just make sure there are more than 10 lines:
all_lines = f.readlines()
lines = [] if len(all_lines) <= 10 else all_lines[5:-5]
(this is called a ternary operator)

Slice variable from specified letter to specified letter in line that varies in length

New to the site so I apologize if I format this incorrectly.
So I'm searching a file for lines containing
Server[x] ip.ip.ip.ip response=235ms accepted....
where x can be any number greater than or equal to 0, then storing that information in a variable named line.
I'm then printing this content to a tkinter GUI and its way too much information for the window.
To resolve this I thought I would slice the information down with a return line[15:30] in the function but the info that I want off these lines does not always fall between 15 and 30.
To resolve this I tried to make a loop with
return line[cnt1:cnt2]
checked cnt1 and cnt2 in a loop until cnt1 meets "S" and cnt2 meets "a" from accepted.
The problem is that I'm new to Python and I cant get the loop to work.
def serverlist(count):
try:
with open("file.txt", "r") as f:
searchlines = f.readlines()
if 'f' in locals():
for i, line in enumerate(reversed(searchlines)):
cnt = 90
if "Server["+str(count)+"]" in line:
if line[cnt] == "t":
cnt += 1
return line[29:cnt]
except WindowsError as fileerror:
print(fileerror)
I did a reversed on the line reading because the lines I am looking for repeats over and over every couple of minutes in the text file.
Originally I wanted to scan from the bottom and stop when it got to server[0] but this loop wasn't working for me either.
I gave up and started just running serverlist(count) and specifying the server number I was looking for instead of just running serverlist().
Hopefully when I understand the problem with my original loop I can fix this.
End goal here:
file.txt has multiple lines with
<timestamp/date> Server[x] ip.ip.ip.ip response=<time> accepted <unneeded garbage>
I want to cut just the Server[x] and the response time out of that line and show it somewhere else using a variable.
The line can range from Server[0] to Server[999] and the same response times are checked every few minutes so I need to avoid duplicates and only get the latest entries at the bottom of the log.
Im sorry this is lengthy and confusing.
EDIT:
Here is what I keep thinking should work but it doesn't:
def serverlist():
ips = []
cnt = 0
with open("file.txt", "r") as f:
for line in reversed(f.readlines()):
while cnt >= 0:
if "Server["+str(cnt)+"]" in line:
ips.append(line.split()) # split on spaces
cnt += 1
return ips
My test log file has server[4] through server[0]. I would think that the above would read from the bottom of the file, print server[4] line, then server[3] line, etc and stop when it hits 0. In theory this would keep it from reading every line in the file(runs faster) and it would give me only the latest data. BUT when I run this with while cnt >=0 it gets stuck in a loop and runs forever. If I run it with any other value like 1 or 2 then it returns a blank list []. I assume I am misunderstanding how this would work.

Here is my first approach:
def serverlist(count):
with open("file.txt", "r") as f:
for line in f.readlines():
if "Server[" + str(count) + "]" in line:
return line.split()[1] # split on spaces
return False
print serverlist(30)
# ip.ip.ip.ip
print serverlist(";-)")
# False
You can change the index in line.split()[1] to get the specific space separated string of the line.
Edit: Sure, just remove the if condition to get all ip's:
def serverlist():
ips = []
with open("file.txt", "r") as f:
for line in f.readlines():
if line.strip().startswith("Server["):
ips.append(line.split()[1]) # split on spaces
return ips

How to grab a chunk of data from a file?

I want to grab a chunk of data from a file. I know the start line and the end line. I wrote the code but its incomplete and I don't know how to solve it further.
file = open(filename,'r')
end_line='### Leave a comment!'
star_line = 'Kill the master'
for line in file:
if star_line in line:
??

startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if line.startswith(startmarker): marking = True
elif line.startswith(endmarker): marking = False
if marking: result.append(line)
if len(result) > 1:
print "".join(result[1:])
Explanation: The with block is a nice way to use files -- it makes sure you don't forget to close() it later. The for walks each line and:
starts outputting when it sees a line that starts with 'ohai' (including that line)
stops outputting when it sees a line that starts with 'meheer?' (without outputting that line).
After the loop, result contains the part of the file that is needed, plus that initial marker. Rather than making the loop more complicated to ignore the marker, I just throw it out using a slice: result[1:] returns all elements in result starting at index 1; in other words, it excludes the first element (index 0).
Update to reflect add partial-line matches:
startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if not marking:
index = line.find(startmarker)
if index != -1:
marking = True
result.append(line[index:])
else:
index = line.rfind(endmarker)
if index != -1:
marking = False
result.append(line[:index + len(endmarker)])
else:
result.append(line)
print "".join(result)
Yet more explanation: marking still tells us whether we should be outputting whole lines, but I've changed the if statements for the start and end markers as follows:
if we're not (yet) marking, and we see the startmarker, then output the current line starting at the marker. The find method returns the position of the first occurrence of startmarker in this case. The line[index:] notation means 'the content of line starting at position index.
while marking, just output the current line entirely unless it contains endmarker. Here, we use rfind to find the rightmost occurrence of endmarker, and the line[...] notation means 'the content of line up to position index (the start of the match) plus the marker itself.' Also: stop marking now :)

if reading the whole file is not a problem, I would use file.readlines() to read in all the lines in a list of strings.
then you can use list_of_lines.index(value) to find the indices of the first and last line, and then select all the lines between these two indices.

First, a test file (assuming Bash shell):
for i in {0..100}; do echo "line $i"; done > test_file.txt
That generates a file a 101 line file with lines line 0\nline 1\n ... line 100\n
This Python script captures the line between and including mark1 up to and not including mark2:
#!/usr/bin/env python
mark1 = "line 22"
mark2 = "line 26"
record=False
error=False
buf = []
with open("test_file.txt") as f:
for line in f:
if mark1==line.rstrip():
if error==False and record==False:
record=True
if mark2==line.rstrip():
if record==False:
error=True
else:
record=False
if record==True and error==False:
buf.append(line)
if len(buf) > 1 and error==False:
print "".join(buf)
else:
print "There was an error in there..."
Prints:
line 22
line 23
line 24
line 25
in this case. If both marks are not found in the correct sequence, it will print an error.
If the size of the file between the marks is excessive, you may need some additional logic. You can also use a regex for each line instead of an exact match if that fits your use case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3.4 - Capture block of text based on single string - python

Related

Python: losing nucleotides from fasta file to dictionary

File operation starts again from first while looping through the file

Want to skip last and first 5 lines while reading file in python

Slice variable from specified letter to specified letter in line that varies in length

How to grab a chunk of data from a file?

Categories

Resources