I want to grab a chunk of data from a file. I know the start line and the end line. I wrote the code but its incomplete and I don't know how to solve it further.
file = open(filename,'r')
end_line='### Leave a comment!'
star_line = 'Kill the master'
for line in file:
if star_line in line:
??
startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if line.startswith(startmarker): marking = True
elif line.startswith(endmarker): marking = False
if marking: result.append(line)
if len(result) > 1:
print "".join(result[1:])
Explanation: The with block is a nice way to use files -- it makes sure you don't forget to close() it later. The for walks each line and:
starts outputting when it sees a line that starts with 'ohai' (including that line)
stops outputting when it sees a line that starts with 'meheer?' (without outputting that line).
After the loop, result contains the part of the file that is needed, plus that initial marker. Rather than making the loop more complicated to ignore the marker, I just throw it out using a slice: result[1:] returns all elements in result starting at index 1; in other words, it excludes the first element (index 0).
Update to reflect add partial-line matches:
startmarker = "ohai"
endmarker = "meheer?"
marking = False
result = []
with open("somefile") as f:
for line in f:
if not marking:
index = line.find(startmarker)
if index != -1:
marking = True
result.append(line[index:])
else:
index = line.rfind(endmarker)
if index != -1:
marking = False
result.append(line[:index + len(endmarker)])
else:
result.append(line)
print "".join(result)
Yet more explanation: marking still tells us whether we should be outputting whole lines, but I've changed the if statements for the start and end markers as follows:
if we're not (yet) marking, and we see the startmarker, then output the current line starting at the marker. The find method returns the position of the first occurrence of startmarker in this case. The line[index:] notation means 'the content of line starting at position index.
while marking, just output the current line entirely unless it contains endmarker. Here, we use rfind to find the rightmost occurrence of endmarker, and the line[...] notation means 'the content of line up to position index (the start of the match) plus the marker itself.' Also: stop marking now :)
if reading the whole file is not a problem, I would use file.readlines() to read in all the lines in a list of strings.
then you can use list_of_lines.index(value) to find the indices of the first and last line, and then select all the lines between these two indices.
First, a test file (assuming Bash shell):
for i in {0..100}; do echo "line $i"; done > test_file.txt
That generates a file a 101 line file with lines line 0\nline 1\n ... line 100\n
This Python script captures the line between and including mark1 up to and not including mark2:
#!/usr/bin/env python
mark1 = "line 22"
mark2 = "line 26"
record=False
error=False
buf = []
with open("test_file.txt") as f:
for line in f:
if mark1==line.rstrip():
if error==False and record==False:
record=True
if mark2==line.rstrip():
if record==False:
error=True
else:
record=False
if record==True and error==False:
buf.append(line)
if len(buf) > 1 and error==False:
print "".join(buf)
else:
print "There was an error in there..."
Prints:
line 22
line 23
line 24
line 25
in this case. If both marks are not found in the correct sequence, it will print an error.
If the size of the file between the marks is excessive, you may need some additional logic. You can also use a regex for each line instead of an exact match if that fits your use case.
Related
So I know similar questions have been asked before, but every method I have tried is not working...
Here is the ask: I have a text file (which is a log file) that I am parsing for any occurrence of "app.task2". The following are the 2 scenarios that can occur (As they appear in the text file, independent of my code):
Scenario 1:
Mar 23 10:28:24 dasd[116] <Notice>: app.task2.refresh:556A2D:[
{name: ApplicationPolicy, policyWeight: 50.000, response: {Decision: Can Proceed, Score: 0.45}}
] sumScores:68.785000, denominator:96.410000, FinalDecision: Can Proceed FinalScore: 0.713463}
Scenario 2:
Mar 23 10:35:56 dasd[116] <Notice>: 'app.task2.refresh:C6C2FE' CurrentScore: 0.636967, ThresholdScore: 0.410015 DecisionToRun:1
The problem I am facing is that my current code below, I am not getting the entire log entry for the first case, and it is only pulling the first line in the log, not the remainder of the log entry, and it appears to be stopping at the new line escape character, which is occurring after ":[".
My Code:
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
all.append(line)
print all
How can I get the entire log entry for the first case? I tried stripping escape characters with no luck. From here I should be able to parse the list of results returned for what I truly need, but this will help! ty!
OF NOTE: I need to be able to locate these types of log entries (which will then give us either scenario 1 or scenario 2) by the string "app.task2". So this needs to be incorporated, like in my example...
Before adding the line to all, check if it ends with [. If it does, keep reading and merge the lines until you get to ].
import re
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
if re.search(r'\[\s*$', line): # start of multiline log message
for line2 in f:
line += line2
if re.search(r'^\s*\]', line2): # end of multiline log message
break
all.append(line)
print(all)
You are iterating over each each line individually which is why you only get the first line in scenario 1.
Either you can add a counter like this:
all = []
count = -1
with open(path_to_log) as f:
for line in f:
if count > 0:
all.append(line)
if count == 1:
tmp = all[-count:]
del all[-count:]
all.append("\n".join(tmp))
count -= 1
continue
if "app.task2" in line:
all.append(line)
if line.endswith('[\n'):
count = 3
print all
In this case i think Barmar solution would work just as good.
Or you can (preferably) when storing the log file have some distinct delimiter between each log entry and just split the log file by this delimiter.
I like #Barmar's solution with nested loops on the same file object, and may use that technique in the future. But prior to seeing I would have done it with a single loop, which may or may not be more readable:
all = []
keep = False
for line in open(path_to_log,"rt"):
if "app.task2" in line:
all.append(line)
keep = line.rstrip().endswith("[")
elif keep:
all.append(line)
keep = not line.lstrip().startswith("]")
print (all)
or, you can print it nicer with:
print(*all,sep='\n')
I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio
Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.
I'm trying to find a certain word in a file and want to print the next line when a condition is met.
f = open('/path/to/file.txt','r')
lines = f.readlines()
for line in lines:
if 'P/E' in line:
n = lines.index(line) #get index of current line
print(lines[n+1]) #print the next line
a.close()
The string 'P/E' will be present 4 times in the file, each time in a different line.
When executed, the code prints the next line after the first 2 occurrences of 'P/E' normally. It then again goes back and prints the same first 2 occurrences again and exits. The loop is not proceeding after those first 2 occurrences; it kind of repeats the process and exits.
I checked the data file to see if my output is the actual result, but all next lines are different after 'P/E'.
How can I resolve this? Thanks.
list.index() with just one argument only finds the first occurrence. You'd have to give it a starting point to find elements past the previous index, list.index() takes a second argument that tells it where to start searching from.
However, you don't need to use lines.index(); that's very inefficient; it requires a full scan through the list, testing each line until a match is found.
Just use the enumerate() function to add indices as you loop:
for index, line in enumerate(lines):
if 'P/E' in line:
print(lines[index + 1])
Be careful, there is a chance index + 1 is not a valid index; if you find 'P/E' in the very last line of the lines list you'll get an IndexError. You may have to add a and index + 1 < len(lines) test.
Note that using file.readlines() reads all of the file into memory in one go. Try to avoid this; you could loop directly over the file, and remember the previous line instead:
with open('/path/to/file.txt','r') as f:
previous = ''
for line in f:
if 'P/E' in previous:
print(line) # print this line
previous = line # remember for the next iteration
I have searched far and wide and I hope someone can either point me to the link I missed or help me out with this logic.
We have a script the goes out and collects logs from various devices and places them in text files. Within these text files there is a time stamp and we need to collect the few lines of text before and after this time stamp.
I already have a script that matches the time stamps and removes them for certain reports (included below) but I cannot figure out how to match the time stamp and then capture the surrounding lines.
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
with open(filename, 'r') as f:
h = f.readlines()
for line in h:
if regex_time_stamp.search(line) is not None:
new_line = re.sub(regex_time_stamp, '', line)
pre_list.append(new_line)
else:
pre_list.append(line)
Any assistance would be greatly appreciated! Thanks for taking the time to read this.
The basic algorithm is to remember the three most recently read lines. When you match a header, read the next two lines and the combine it with the header and the last three lines that you've saved.
Alternately, since you're saving all of the lines in a list, simply keep track of which element is the current element, and when you find a header you can go back and get the previous two and next two elements.
Catch with duplicated lines
Agreed with the basic algorithm by #Bryan-Oakley and #TigerhawkT3, however there's a catch:
What if several lines match consecutively?
You could end up duplicating "context" lines by printing the last 2 lines of the first match, and then the last 2 lines of the second match... that would actually also contain the previous matched line.
The solution is to keep track of which line number was last printed, in order to print just enough lines before the current matched line.
Flexible context parameter
What if also you want to print 3 lines before and after instead of 2? Then you need to keep track of more lines.
What if you want only 1 ?
Then your number of lines to print needs to be a parameter and the algorithm needs to use it.
Sample input and output
Here's a file sample that contains the word MATCH instead of your timestamp, for clarity. The other lines contain NOT + line number
==
NOT 0
NOT 1
NOT 2
NOT 3
NOT 4
MATCH LINE 5
NOT 6
NOT 7
NOT 8
NOT 9
MATCH LINE 10
MATCH LINE 11
NOT 12
MATCH LINE 13
NOT 14
==
The output should be:
==
NOT 3
NOT 4
LINE 5
NOT 6
NOT 8
NOT 9
LINE 10
LINE 11
NOT 12
LINE 13
NOT 14
==
Solution
This solution iterates on the file and keeps track of:
what is the last line that was printed? This will take care of not duplicating "context" lines if matched lines come in sequence.
what is the last line that was matched? This will tell the program to print the current line if it is "close" to the last matched line. How close? This is determined by your "number of lines to print" parameter. Then we also set the last_line_printed variable to the current line index.
Here's a simplified algorithm in English:
When matching a line we will:
print the last N lines, from the last_line_printed variable to the current index
print the current line after stripping the timestamp
set the last_line_printed = last_line_matched = current line index
continue
When not matching a line we will:
print the current line if current_index < last_line_matched index + number_of_lines_to_print
Of course we're taking care of whether we're close to the beginning of file by checking limits
Not print but return an array
This solution doesn't print directly but returns an array with all the lines to print. That's just a bit classier.
I like to name my "return" variable result but that's just me. It makes it obvious what is the result variable during the whole algorithm.
Code
You can try this code with the input above, it'll print the same output.
def search_timestamps_context(filename, number_of_lines_to_print=2):
import re
result = []
regex_time_stamp = re.compile('\d{2}:\d{2}:\d{2}|\d{1,2}y\d{1,2}w\d{1,2}d|\d{1,2}w\d{1,2}d|\d{1,2}d\d{1,2}h')
# for my test
regex_time_stamp = re.compile('MATCH')
with open(filename, 'r') as f:
h = f.readlines()
# Remember which is the last line printed and matched
last_line_printed = -1
last_line_matched = -1
for idx, line in enumerate(h):
if regex_time_stamp.search(line) is not None:
# We want to print the last "number_of_lines_to_print" lines
# ...unless they were already printed
# We want to return last lines from idx - number_of_lines_to_print
# print ('** Matched ', line, idx, last_line_matched, last_line_printed)
if last_line_printed == -1:
lines_to_print = max(idx - number_of_lines_to_print, 0)
else:
# Unless we've already printed those lines because of a previous match, then we continue
lines_to_print = max(idx - number_of_lines_to_print, last_line_printed + 1)
for l in h[lines_to_print:idx]:
result.append(l)
# Now print the stripped line
new_line = re.sub(regex_time_stamp, '', line)
result.append(new_line)
# Update the last line printed
last_line_printed = last_line_matched = idx
else:
# If not a match, we still need to print the current line if we had a match N lines before
if last_line_matched != -1 and idx < last_line_matched + number_of_lines_to_print:
result.append(line)
last_line_printed = idx
return result
filename = 'test_match.txt'
lines = search_timestamps_context(filename, number_of_lines_to_print=2)
print (''.join(lines))
Improvements
The usage of getlines() is inefficient: we are reading the whole file before starting.
It would be more efficient to just iterate, but then we need to remember the last lines in case we need to print them. To achieve that, we would maintain a list of the last N lines, and not more.
That's an exercise left to the reader :)
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.