I recently asked a question about converting list of values from txt file to dictionary list. You can see it from the link here: See my question here
P883, Michael Smith, 1991
L672, Jane Collins, 1992(added)(empty line here)
L322, Randy Green, 1992
H732, Justin Wood, 1995(/added)
^key ^name ^year of birth
===============
this question has been answered and i used the following code (accepted answer) which works perfectly:
def load(filename):
students = {}
infile = open(filename)
for line in infile:
line = line.strip()
parts = [p.strip() for p in line.split(",")]
students[parts[0]] = (parts[1], parts[2])
return students
however when there is a line space in the values from the txt file.. (see added parts) it doesnt work anymore and gives an error saying that list index is out of range.
Check for empty lines inside your for-loop and skip them:
for line in infile:
line = line.strip()
if not line:
continue
parts = [p.strip() for p in line.split(",")]
students[parts[0]] = (parts[1], parts[2])
Check for an empty line by either counting the elements of parts (if there are zero (or, in general, less than three) elements in parts, the line was empty or at least invalid). or by checking the trimmed value of line against the empty string. (Sorry, I can't code Python, so no code sample here...)
Remember: You should always check the size of a dynamically created array before indexing it.
lines = [line.split(', ') for line in file if line]
result = dict([(list[0], element_list[1:]) for line in lines if line])
It's really straight forward to check the line for emptyness or length 0:
for line in infile:
line = line.strip()
if line:
do_something()
# or
if len(line) > 0:
do_something()
Related
Here is info from the .txt file I am trying to access:
Movies: Drama
Possession, 2002
The Big Chill, 1983
Crimson Tide, 1995
Here is my code:
fp = open("Movies.txt", "r")
lines = fp.readlines()
for line in lines:
values = line.split(", ")
year = int(values[1])
if year < 1990:
print(values[0])
I get an error message "IndexError: list index out of range". Please explain why or how I can fix this. Thank you!
Assuming your .txt file includes the "Movies: Drama" line, as you listed, it's because the first line of the text file has no comma in it. Therefore splitting that first line on a comma only results in 1 element (element 0), NOT 2, and therefore there is no values[1] for the first line.
It's not unusual for data files to have a header line that doesn't contain actual data. Import modules like Pandas will typically handle this automatically, but open() and readlines() don't differentiate.
The easiest thing to do is just slice your list variable (lines) so you don't include the first line in your loop:
fp = open("Movies.txt", "r")
lines = fp.readlines()
for line in lines[1:]:
values = line.split(", ")
year = int(values[1])
if year < 1990:
print(values[0])
Note the "lines[1:]" modification. This way you only loop starting from the second line (the first line is lines[0]) and go to the end.
The first line of the text file does not have a ", ", so when you split on it, you get a list of size 1. When you access the 2nd element with values[1] then you are accessing outside the length of the array, hence the IndexError. You need to do a check on the line before making the assumption about the size of the list. Some options:
Check the length of values and continue if it's too short.
Check that ', ' is in the line before splitting on it.
Use a regex which will ensure the ', ' is there as well as can ensure that the contents after the comma represent a number.
Preemptively strip off the first line in lines if you know that it's the header.
Your first line of your txt file has wrong index
Just simple change your code to:
fp = open("Movies.txt", "r")
lines = fp.readlines()
for line in lines:
try: #<---- Here
values = line.split(", ")
year = int(values[1])
if year < 1990:
print(values[0])
except: #<--------And here
pass
Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.
Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.
Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'
What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"
Very new, please be nice and explain slowly and clearly. Thanks :)
I've tried searching how to extract a single line in python, but all the responses seem much more complicated (and confusing) than what I'm looking for. I have a file, it has a lot of lines, I want to pull out just the line that starts with #.
My file.txt:
"##STUFF"
"##STUFF"
#DATA 01 02 03 04 05
More lines here
More lines here
More lines here
My attempt at a script:
file = open("file.txt", "r")
splitdata = []
for line in file:
if line.startswith['#'] = data
splitdata = data.split()
print splitdata
#expected output:
#splitdata = [#DATA, 1, 2, 3, 4, 5]
The error I get:
line.startswith['#'] = data
TypeError: 'builtin_function_or_method' object does not support item assignment
That seems to mean it doesn't like my "= data", but I'm not sure how to tell it that I want to take the line that starts with # and save it separately.
Correct the if statement and the indentation,
for line in file:
if line.startswith('#'):
print line
Although you're relatively new, you should start learning to use list comprehension, here is an example on how you can use it for your situation. I explained the details in the comments and the comments are matched to the corresponding order.
splitdata = [line.split() for line in file if line.startswith('#')]
# defines splitdata as a list because comprehension is wrapped in []
# make a for loop to iterate through file
#checks if the line "startswith" a '#'
# note: you should call functions/methods using the () not []
# split the line at spaces if the if startment returns True
That's an if condition that expects predicate statement not the assignment.
if line.startswith('#'):
startswith(...)
S.startswith(prefix[, start[, end]]) -> bool
Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
Textfile:
1
2
3
4
5
6
\n
\n
I know lines[-1] gets you the last line, but I want to disregard any new lines and get the last line of text (6 in this case).
The best approach regarding memory is to exhaust the file. Something like this:
with open('file.txt') as f:
last = None
for line in (line for line in f if line.rstrip('\n')):
last = line
print last
It can be done more elegantly though. A slightly different approach:
with open('file.txt') as f:
last = None
for last in (line for line in f if line.rstrip('\n')):
pass
print last
For a small file you can just read all of the lines, discarding any empty ones. Notice that I've used an inner generator to strip the lines before excluding them in the outer one.
with open(textfile) as fp:
last_line = [l2 for l2 in (l1.strip() for l1 in fp) if l2][-1]
with open('file') as f:
print([i for i in f.read().split('\n') if i != ''][-1])
This is just an edit to Avinash Raj's answer (but since I'm a new account, I can't comment on it). This will preserve any None values in your data (i.e. if the data in your last line is "None" it will work, though depending on your input this may not be an issue).
with open('path/to/file') as infile:
for line in infile:
if not line.strip('\n'):
continue
answer = line
print(answer)
This will print 6 with a newline at the end. You can decide how to strip that. Following are some options:
answer.rstrip('\n') removes trailing newlines
answer.rstrip() removes trailing whitespaces
answer.strip() removes any surrounding whitespaces
with open ('file.txt') as myfile:
for num,line in enumerate(myfile):
pass
print num
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.