f=open('sequence3.fasta', 'r')
str=''
for line in f:
line2=line.rstrip('\n')
if (line2[0]!='>'):
str=str+line2
elif (len(line)==0):
break
str.rstrip('\n')
f.close()
The script is suppose to read 3 DNA sequences and connect them to one sequence.
The problem is, I get this error:
IndexError: string index out of range
And when I write like this:
f=open('sequence3.fasta', 'r')
str=''
for line in f:
line.rstrip('\n')
if (line[0]!='>'):
str=str+line
elif (len(line)==0):
break
str.rstrip('\n')
f.close()
It runs but there are spaces in between.
Thanks
The second version doesn't crash because the line line.rstrip('\n') is a NOOP. rtrip returns a new string, and doesn't modify the existing one (line).
The first version crashes because probably you have empty lines in your input file so line.rstrip returns an empty line. Try this:
f=open('sequence3.fasta', 'r')
str=''
for line in f:
line2=line.rstrip('\n')
if line2 and line2[0]!='>':
str=str+line2
elif len(line)==0:
break
if line2 is an equivalent of if len(line2) > 0. Similarly, you could replace your elif len(line)==0 with elif not line.
Your empty line condition is in wrong place. Try:
for line in f:
line = line.rstrip('\n')
if len(line) == 0: # or simply: if not line:
break
if line[0] != '>':
str=str+line
Or another solution is to use the .startswith: if not line.startswith('>')
line.rstrip('\n')
Returns copy of line, and you do nothing with it. It doesn't change "line".
Exception "IndexError: string index out of range" means that "line[0]" cannot be referenced -- so "line" must be empty. Perhaps you should make it like this:
for line in f:
line = line.rstrip('\n')
if line:
if (line[0]!='>'):
str=str+line
else:
break
You shouldn't use your second code example where you don't save the return value of rstrip. rstrip doesn't modify the original string that it was used on. RStrip - Return a copy of the string with trailing characters removed..
Also in your if else statement your first condition that you check should be for length 0, otherwise you'll get an error for checking past the strings length.
Additionally, having a break in your if else statements will end your loop early if you have an empty line. Instead of breaking you could just not do anything if there is 0 length.
if (len(line2) != 0):
if (line2[0] != '>'):
str = str+line2
Also your line near the end str.rstrip('\n') isn't doing anything since the return value of rstrip isn't saved.
Related
I have data in a .txt file, in the form a of a comma separated list. For example:
N12345678,B,A,D,D,C,B,D,A,C,C,D,B,A,B,A,C,B,D,A,C,A,A,B,D,D
N12345678,B,A,D,D,C,B,D,A,C,C,D,B,A,B,A,C,B,D,A,C,A,A,B,D,D
I want to be able to split it up, first by line, then by comma, so I'm able to process the data and validate it. I am getting "invalid" for all of lines in my code, even though some of them should be valid because there should be 26 characters per line. Here is my code so far:
(filename+".txt").split("\n")
(filename+".txt").split(",")
with open(filename+".txt") as f:
for line in f:
if len(line) != 26:
print ("invalid")
else:
print ("valid")
This code is quite far from working; it's syntactically valid Python, but it doesn't mean anything sensible.
# These two lines add two strings together, returning a string
# then they split the string into pieces into a list
# because the /filename/ has no newlines in it, and probably no commas
# that changes nothing
# then the return value isn't saved anywhere, so it gets thrown away
(filename+".txt").split("\n")
(filename+".txt").split(",")
# This opens the file and reads from it line by line,
# which means "line" is a string of text for each line in the file.
with open(filename+".txt") as f:
for line in f:
# This checks if the line in the file is not the /number/ 26
# since the file contains /strings/ it never will be the number 26
if line != 26:
print ("invalid")
# so this is never hit
else:
print ("valid")
[Edit: even in your updated code, the line is the whole text "N12345678,B,A,D..." and because of the commas, len(line) will be longer than 26 characters.]
It seems you want something more like: Drop the first two lines of your code completely, read through the file line by line (meaning you normally don't have to care about "\n" in your code). Then split each line by commas.
with open(filename+".txt") as f:
for line in f:
line_parts = line.split(",")
if len(line_parts) != 26:
print ("invalid")
else:
print ("valid")
# line_parts is now a list of strings
# ["N12345678" ,"B", "A", ...]
I think an easier way to do this would be to use csv module.
import csv
with open("C:/text.csv") as input:
reader = csv.reader(input)
for row in reader:
if len(row) == 26:
print("Valid")
else:
print("Invalid")
As far as I understand your question, you want this.
with open(filename, 'r') as f:
for line in f:
if len(line.split(',')) !=26:
print("Invalid")
else:
print("Valid")
All it does is,
Open the file.
Read the file line by line.
For each line, split the line by ,
As str.split() returns a list, check if length of the list is 26 or not.
If length is 26, count as valid; otherwise not.
The test.txt would be
1
2
3
start
4
5
6
end
7
8
9
I would like the result to be
start
4
5
6
end
This is my code
file = open('test.txt','r')
line = file.readline()
start_keyword = 'start'
end_keyword = 'end'
lines = []
while line:
line = file.readlines()
for words_in_line in line:
if start_keyword in words_in_line:
lines.append(words_in_line)
file.close()
print entities
It returns
['start\n']
I have no idea what to add to the above code to achieve the result I want to get. I have been searching and changing the code around but I don't know how to get this to work as I want it to.
Use a flag. Try this:
file = open('test.txt','r')
start_keyword = 'start'
end_keyword = 'end'
in_range = False
entities = []
lines = file.readlines()
for line in lines:
line = line.strip()
if line == start_keyword:
in_range = True
elif line == end_keyword:
in_range = False
elif in_range:
entities.append(line)
file.close()
# If you want to include the start/end tags
#entities = [start_keyword] + entities + [end_keyword]
print entities
About your code, notice that readlines already reads all lines in a file, so calling readline doesn't seem to make much sense, unless you are ignoring the first line. Also use strip to remove EOL characters from the strings. Notice how your code doesn't do what you expect it to:
# Reads ALL lines in the file as an array
line = file.readlines()
# You are not iterating words in a line, but rather all lines one by one
for words_in_line in line:
# If a given line contains 'start', append it. This is why you only get ['start\n'], it's the only line you are adding as no other line contains that string
if start_keyword in words_in_line:
lines.append(words_in_line)
You need a state variable to decide whether you are storing the lines or not. Here is a simplistic example that will always store the line, and then will change its mind and discard it for the cases you don't want:
start_keyword = 'start'
end_keyword = 'end'
lines = []
reading = False
with open('test.txt', 'r') as f:
for line in f:
lines.append(line)
if start_keyword in line:
reading = True
elif end_keyword in line:
reading = False
elif not reading:
lines.pop()
print ''.join(lines)
If the file isn't too big (relative to how much RAM your computer has):
start = 'start'
end = 'end'
with open('test.txt','r') as f:
content = f.read()
result = content[content.index(start):content.index(end)]
You can then print it with print(result), create a list by using result.split(), and so on.
If there are multiple start/stop points, and/or the file is very large:
start = 'start'
end = 'end'
running = False
result = []
with open('test.txt','r') as f:
for line in f:
if start in line:
running = True
result.append(line)
elif end in line:
running = False
result.append(line)
elif running:
result.append(line)
This leaves you with a list, which you can join(), print(), write to a file, and so on.
You can use some kind of a flag that gets set to true when you encounter the start_keyword and if that flag is set you add the lines to lines list, and it gets unset when end_keyword is encountered (but only after end_keyword has been written into the lines list.
Also use .strip() on words_in_line to remove the \n (and other trailing and leading whitespaces) If you do not want them in the list lines , if you do want them, then don't strip it.
Example -
flag = False
for words_in_line in line:
if start_keyword in words_in_line:
flag = True
if flag:
lines.append(words_in_line.strip())
if end_keyword in words_in_line:
flag = False
Please note, this would add multiple start to end blocks into the lines list, I am guessing that is what you want.
A file object is it's own iterator, you don't need a while loop to read a file line by line, you can iterate over the file object itself. To catch the sections just start an inner loopn when you encounter a line with start and break the inner loop when you hit end:
with open("in.txt") as f:
out = []
for line in f:
if start in line:
out.append(line)
for _line in f:
out.append(_line)
if end in _line:
break
Output:
['start\n', '4\n', '5\n', '6\n', 'end\n']
I'd like to read a file in python line by line, but in some cases (based on an if condition) I'd also like to read the next line in the file, and then keep reading it the same way.
Example:
file_handler = open(fname, 'r')
for line in file_handler:
if line[0] == '#':
print line
else:
line2 = file_handler.readline()
print line2
basically in this example I am trying to read it line by line, but when the line does not start with # I'd like to read the next line, print it, and then keep reading the line after line2. This is just an example where I got the error for similar stuff I am doing in my code but my goal is as stated in the title.
But I'd get an error like ValueError: Mixing iteration and read methods would lose data.
Would it be possible to do what I am trying to do in a smarter way?
If you just want to skip over lines not starting with #, there's a much easier way to do this:
file_handler = open(fname, 'r')
for line in file_handler:
if line[0] != '#':
continue
# now do the regular logic
print line
Obviously this kind of simplistic logic won't work in all possible cases. When it doesn't, you have to do exactly what the error implies: either use iteration consistently, or use read methods consistently. This is going to be more tedious and error-prone, but it's not that bad.
For example, with readline:
while True:
line = file_handler.readline()
if not line:
break
if line[0] == '#':
print line
else:
line2 = file_handler.readline()
print line2
Or, with iteration:
lines = file_handler
for line in file_handler:
if line[0] == '#':
print line
else:
print line
print next(file_handler)
However, that last version is sort of "cheating". You're relying on the fact that the iterator in the for loop is the same thing as the iterable it was created from. This happens to be true for files, but not for, say, lists. So really, you should do the same kind of while True loop here, unless you want to add an explicit iter call (or at least a comment explaining why you don't need one).
And a better solution might be to write a generator function that transforms one iterator into another based on your rule, and then print out each value iterated by that generator:
def doublifier(iterable):
it = iter(iterable)
while True:
line = next(it)
if line.startswith('#'):
yield line, next(it)
else:
yield (line,)
file_handler = open(fname, 'r')
for line in file_handler:
if line.startswith('#'): # <<< comment 1
print line
else:
line2 = next(file_handler) # <<< comment 2
print line2
Discussion
Your code used a single equal sign, which is incorrect. It should be double equal sign for comparison. I recommend to use the .startswith() function to enhance code clarity.
Use the next() function to advance to the next line since you are using file_handler as an iterator.
add a flag value:
if flag is True:
print line #or whatever
flag = False
if line[0] == '#':
flag = True
This is versatile version :-)
You can save a bit of state information that tells you what to do with the next line:
want_next = False
for line in open(fname):
if want_next:
print line
want_next = False
elif line[0] == '#':
print line
want_next = True
I think what you are looking for is next rather than readline.
A few things. In your code, you use = rather than ==. I will use startswith instead. If you call next on an iterator, it will return the next item or throw a StopIteration exception.
The file
ewolf#~ $cat foo.txt
# zork zap
# woo hoo
here is
some line
# a line
with no haiku
The program
file_handler = open( 'foo.txt', 'r' )
for line in file_handler:
line = line.strip()
if line.startswith( '#' ):
print "Not Skipped : " + line
elif line is not None:
try:
l2 = file_handler.next()
l2 = l2.strip()
print "Skipping. Next line is : " + l2
except StopIteration:
# End of File
pass
The output
Not Skipped : # zork zap
Not Skipped : # woo hoo
Skipping. Next line is : some line
Not Skipped : # a line
Skipping. Next line is :
try if line[0] == "#" instead of line[0] = "#"
I'm trying to format a tab delimited txt file that has rows and columns. I'm trying to simply ignore the rows that have any empty values in it when I write to the output file. I'm doing this by len(list) method where if the length of the list equals the number of columns, then that line gets written to output file. But when I check the length of the lines, they are all the same, even though I removed the empty strings! Very frustrating...
Here's my code:
import sys, os
inputFileName = sys.argv[1]
outputFileName = os.path.splitext(inputFileName)[0]+"_edited.txt"
try:
infile = open(inputFileName,'r')
outfile = open(outputFileName, 'w')
line = infile.readline()
outfile.write(line)
for line in infile:
lineList = line.split('\t')
#print lineList
if '' in lineList:
lineList.remove('')
#if len(lineList) < 9:
#print len(lineList)
#outfile.write(line)
infile.close()
#outfile.close()
except IOError:
print inputFileName, "does not exist."
Thanks for any help. When I create an experimental list in the interactive window and use the if '' in list: then it removes it. When I run the code, the ' ' is still there!
I dont know any python but i can mention you dont seem to be checking for whitespace characters. What about \r, \n on top of the \t's. Why dont you try trimming the line and checking if its == ''
I think that one of your problems is that list.remove only removes the first occurrence of the element. There could still be more empty strings in your list. From the documentation:
Remove the first item from the list whose value is x. It is an error if there is no such item.
To remove all the empty strings from your list you could use a list comprehension instead.
lineList = [x for x in lineList if x]
or filter with the identity function (by passing None as the first argument):
lineList = filter(None, lineList)
The following does what you're asking with fewer lines of code and removes empty lines of any kind of whitespace thanks to the strip() call.
#!/usr/bin/env python
import sys, os
inputFileName = sys.argv[1]
outputFileName = os.path.splitext(inputFileName)[0]+"_edited.txt"
try:
infile = open(inputFileName,'r')
outfile = open(outputFileName, 'w')
for line in infile.readlines():
if line.strip():
outfile.write(line)
infile.close()
outfile.close()
except IOError:
print inputFileName, "does not exist."
EDIT:
For clarity, this reads each line of the input file then strips the line of leading and trailing whitespace (tabs, spaces, etc.) and writes the non-empty lines to the output file.
Goal is to write a script which will copy a text file and exclude any line beginning with #.
My question is I seem to get an index error which is dependent upon the order of my if elif conditions. The only difference between the nonworking code and the working code (besides the suffix "_bad" to the nonworking function name) is that I test the "" condition first (works) vs testing the "#" condition first (doesn't work)
Base file is created by this script:
>>> testFileObj = open("test.dat","w")
>>> testFileObj.write("#line one\nline one\n#line two\nline two\n")
>>> testFileObj.close()
Code which works:
def copyAndWriteExcludingPoundSigns(origFile, origFileWithOutPounds):
origFileObj = open(origFile,"r")
modFileObj = open(origFileWithOutPounds,"w")
while True:
textObj = origFileObj.readline()
if textObj == "":
break
elif textObj[0] == "#":
continue
else:
modFileObj.write(textObj)
origFileObj.close()
modFileObj.close()
Code which doesn't work:
def copyAndWriteExcludingPoundSigns_Bad(origFile, origFileWithOutPounds):
origFileObj = open(origFile,"r")
modFileObj = open(origFileWithOutPounds,"w")
while True:
textObj = origFileObj.readline()
if textObj[0] == "#":
continue
elif textObj == "":
break
else:
modFileObj.write(textObj)
origFileObj.close()
modFileObj.close()
Which gives me this error:
Traceback (most recent call last):
File "<pyshell#96>", line 1, in <module>
copyAndWriteExcludingPoundSigns_Bad("test.dat","testOutput.dat")
File "<pyshell#94>", line 6, in copyAndWriteExcludingPoundSigns_Bad
if textObj[0] == "#":
IndexError: string index out of range
If you do if textObj[0] == "#": and textObj="" then there is no character at the zero index, because the string is empty, hence the index error.
The alternative is to do
if textObj.startswith("#"): which will work in both cases.
some tips (and please please read PEP8):
use a 'for' instead of a 'while' loop
no need to use readlines after python 2.4
test if the line is empty before testing for the first char
Untested:
def copy_and_write_excluding_pound_signs(original, filtered):
original_file = open(original,"r")
filtered_file = open(filtered,"w")
for line in original_file:
if line and line[0] == '#':
continue
filtered_file.write(line)
original_file.close()
filtered_file.close()
You may also want to filter a line with some white space befor the '#':
import re
def copy_and_write_excluding_pound_signs(original, filtered):
pound_re = re.compile(r'^\s*#')
original_file = open(original,"r")
filtered_file = open(filtered,"w")
for line in original_file:
if pound_re.match(line):
continue
filtered_file.write(line)
original_file.close()
filtered_file.close()
You should use line.startswith('#') to check whether the string line starts with '#'. If the line is empty (such as line = ''), there would be no first character, and you'd get this error.
Also the existence of a line that an empty string isn't guaranteed, so breaking out of the loop like that is inadvisable. Files in Python are iterable, so can simply do a for line in file: loop.
The problem with your non-working code is that it is encountering an empty line, which is causing the IndexError when the statement if textObj[0] == "#": is evaluated ([0] is a reference to the first element of string). The working code avoids doing that when the line is empty.
The simpliest way I can think of to rewrite your function is to use for line in <fileobj> you won't have worry about line ever being empty. Also if you use the Python with statement, your files will also automatically be closed. Anyway here's what I suggest:
def copyAndWriteExcludingPoundSigns(origFile, origFileWithOutPounds):
with open(origFile,"r") as origFileObj:
with open(origFileWithOutPounds,"w") as modFileObj:
for line in origFileObj:
if line[0] != '#':
modFileObj.write(line)
The two with statement could be combine, but that would have made for a very long and harder to read line of code so I broke it up.