Meaningful IO error messages without overhead in Python - python

I have the following dilemma. I am parsing huge CSV files, which theoretically can contain invalid records, with python. To be able to fix an issue quickly I would like to see the line numbers in the error messages. However, as I am parsing many files and errors are very rare, I do not want my error handling adding overheads to the main pipeline. That is why I would not like to use enumerate or a similar approach.
In a nutshell, I am looking for a get_line_number function to work like this:
with open('file.csv', 'r') as f:
for line in f:
try:
process(line)
except:
line_no = get_line_number(f)
raise RuntimeError('Error while processing the line ' + line_no)
However, this seems to be complicated, as f.tell() will not work in this loop.
EDIT:
It seems like overheads are quite significant. In my real world case (which is painful, as the files are lists of pretty short records: single floats, int-float pairs or string-int pairs; the file.csv is about 800MB large and has around 80M lines), it is about 2.5 seconds per file read for enumerate. For some reason, fileinput is extremely slow.
import timeit
s = """
with open('file.csv', 'r') as f:
for line in f:
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
with open('file.csv', 'r') as f:
for idx, line in enumerate(f):
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
count = 0
with open('file.csv', 'r') as f:
for line in f:
count += 1
"""
print(timeit.repeat(s, number = 10, repeat = 3))
setup = """
import fileinput
"""
s = """
for line in fileinput.input('file.csv'):
pass
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[45.790788270998746, 44.88589363079518, 44.93949336092919]
[70.25306860171258, 70.28569177398458, 70.2074502906762]
[75.43606997421011, 74.39759518811479, 75.02027251804247]
[325.1898657102138, 321.0400970801711, 326.23809849238023]
EDIT 2:
Getting close to the real-world scenario. The try-except clause is outside of the loop to reduce the overhead.
import timeit
setup = """
def process(line):
if float(line) < 0.5:
outliers += 1
"""
s = """
outliers = 0
with open('file.csv', 'r') as f:
for line in f:
process(line)
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
s = """
outliers = 0
with open('file.csv', 'r') as f:
try:
for idx, line in enumerate(f):
process(line)
except ValueError:
raise RuntimeError('Invalid value in line' + (idx + 1)) from None
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[244.9097429071553, 242.84596176538616, 242.74369075801224
[293.32093235617504, 274.17732743313536, 274.00854821596295]
So, in my case, overhead from the enumerate is around 10%.

Do use enumerate
for line_ref, line in enumerate(f):
line_no = line_ref + 1 # enumerate starts at zero
It's not adding any significant overhead. The work involved in getting records out of a file vastly exceeds the work involved in keeping a counter, and the tuple assignment in the for statement is just a name-binding, not an extra copy of the data referred to by line
Replacement update:
Made a mistake in generating my test file. Have now pretty much confirmed the first timing test added to the question.
Personally I'd regard a 10% overhead on a worst(ish)-case file with 10-byte records as completely acceptable, given that the alternative is not knowing which of 80 million records were in error.

If you are sure adding debugging info is too much overhead (I do not want to argue on that topic), you could implement two versions of the function. High performance one and one with thorough checking and verbose debugging. The basic idea is:
try:
func_quick(args)
except Exception:
func_verbose(args)
The drawback is that the processing will start again when an error occurs. But if you have to manually correct the error, penalty of several seconds wasted in such case should not harm. Also the func_verbose() doesn't have to stop on first error and may check the whole file and list all errors.

The standard library fileinput module memory-efficiently processes large files and provides a built-in line number counter. It also automatically picks up multiple filenames for files to read from the command line arguments. However there doesn't seem to be a (simple?) way to use it with context handlers.
As for performance, you'd need to test it in comparison with other approaches.
import fileinput
for line in fileinput.input():
try:
process(line)
except:
line_no = fileinput.filelineno()
raise RuntimeError('Error while processing the line ' + line_no)
BTW I'd recommend catching only relevant exceptions, probably a custom one, otherwise you'll mask out unanticipated exceptions.

Related

Python filter larger text by quantile

Assume I am process a very large text file,
I have the following pseudocode
xx_valueList = []
lines=[]
with line in file:
xx_value = calc_xxValue(line)
xx_valueList.append(xx_value)
lines.append(lines)
# get_quantile_value is a function return the cutoff value with a specific quantile precent
cut_offvalue = get_quantile_value(xx_valueList, precent=0.05)
for line in lines:
if calc_xxValue(line) > cut_offvalue:
# do someting here
Note that the file is very large and may come from a pipe, so I don't want to read it twice.
We must read the entire file before we can get the cutoff to filter file
The above method can work, but it consumes too much memory, is there some algorithmic optimization that can improve efficiency and reduce memory consumption?
xx_value_list = []
cut_offvalue = 0
with open(file, 'r') as f:
for line in f:
xx_value = calc_xxValue(line)
xx_value_list.append(xx_value)
if len(xx_value_list) % 100 == 0:
cut_offvalue = get_quantile_value(xx_value_list, precent=0.05)
if xx_value < cut_offvalue:
# do something here
pass

Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.
import csv
import itertools
from collections import deque
import time
my_queue = deque()
start_row = 500004
stop_row = start_row + 50000
with open('test.csv', 'rb') as fin:
#load into csv's reader
csv_f = csv.reader(fin)
#start logging time for performance
start = time.time()
for row in itertools.islice(csv_f, start_row, stop_row):
my_queue.append(float(row[4])*float(row[10]))
#stop logging time
end = time.time()
#display performance
print "Initial queue populating time: %.2f" % (end-start)
For example, a start_row of 4 will execute in 1s but a start_row of
500004 will take 11s
That is islice being intelligent. Or lazy, depending on which term you prefer.
Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).
Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.
As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)
In python
lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
for i,j in enumerate(f):
if i >= lines_I_want[0][0]:
if i >= lines_I_want[0][1]:
lines_I_want.pop(0)
if not lines_I_want: #list is empty
break
else:
#j is a line I want. Do something
And if you have any control over making that file, make every line the same length so you can seek. Or use a database.
The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.
If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.
Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.
import csv
from itertools import islice
import os
import sys
def open_binary_mode(filename, mode='r'):
""" Open a file proper way (depends on Python verion). """
kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
dict(mode=mode, newline=''))
return open(filename, **kwargs)
def split(infilename, num_chunks):
infile_size = os.path.getsize(infilename)
chunk_size = infile_size // num_chunks
offset = 0
num_rows = 0
bytes_read = 0
chunks = []
with open_binary_mode(infilename, 'r') as infile:
for _ in range(num_chunks):
while bytes_read < chunk_size:
try:
bytes_read += len(next(infile))
num_rows += 1
except StopIteration: # end of infile
break
chunks.append((infilename, offset, num_rows))
offset += bytes_read
num_rows = 0
bytes_read = 0
return chunks
chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
print('processing: {} rows starting at offset {}'.format(rows, offset))
with open_binary_mode(filename, 'r') as fin:
fin.seek(offset)
for row in islice(csv.reader(fin), rows):
print(row)

Optimizing removing duplicates in large files in Python

I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete.
I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.)
So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction.
Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted.
I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time.
Any suggestions?
def sortedfilter(input,filter,output):
file_input = open(input,'r')
file_filter = open(filter,'r')
file_output = open(output,'w')
inline = file_input.next()
filterline = file_filter.next()
try:
while inline and filterline:
if inline < filterline:
file_output.write(inline)
inline = file_input.next()
continue
if inline > filterline:
filterline = file_filter.next()
continue
if inline == filterline:
filterline = file_filter.next()
inline = file_input.next()
except StopIteration:
file_output.writelines(file_input.readlines())
finally:
file_filter.close()
file_input.close()
file_output.close()
I'd be interested in knowing how this compares; it's basically just playing with the order of iteration a bit.
def sortedfilter(in_fname, filter_fname, out_fname):
with open(in_fname) as inf, open(filter_fname) as fil, open(out_fname, 'w') as outf:
ins = inf.next()
try:
for fs in fil:
while ins < fs:
outf.write(ins)
ins = inf.next()
while ins == fs:
ins = inf.next()
except StopIteration:
# reached end of inf before end of fil
pass
else:
# reached end of fil first, pass rest of inf through
file_output.writelines(file_input.readlines())
You can do the string comparison operation only once per line by executing cmp(inline, filterline):
-1 means inline < filterline
0 means inline == filterline
+1 means inline < filterline.
This might get you a few extra percent. So something like:
while inline and filterline:
comparison = cmp(inline, filterline)
if comparison == -1:
file_output.write(inline)
inline = file_input.next()
continue
if comparison == 1:
filterline = file_filter.next()
continue
if comparison == 0:
filterline = file_filter.next()
inline = file_input.next()
seeing that your input files are sorted, the following should work. heapq's merge produces a sorted stream from sorted inputs upon which groupby operates; groups whose lengths are greater than 1 are discarded. This stream-based approach should have relatively low memory requirements
from itertools import groupby, repeat, izip
from heapq import merge
from operator import itemgetter
with open('input.txt', 'r') as file_input, open('filter.txt', 'r') as file_filter, open('output.txt', 'w') as file_output:
file_input_1 = izip(file_input, repeat(1))
file_filter_1 = izip(file_filter, repeat(2))
gen = merge(file_input_1, file_filter_1)
gen = ((k, list(g)) for (k, g) in groupby(gen, key=itemgetter(0)))
gen = (k for (k, g) in gen if len(g) == 1 and g[0][1] == 1)
for line in gen:
file_output.write(line)
Consider opening your files as binary so no conversion to unicode needs to be made:
with open(in_fname,'rb') as inf, open(filter_fname,'rb') as fil, open(out_fname, 'wb') as outf:

Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Loop over a file and write the next line if a condition is met

Having a hard time fixing this or finding any good hints about it.
I'm trying to loop over one file, modify each line slightly, and then loop over a different file. If the line in the second file starts with the line from the first then the following line in the second file should be written to a third file.
with open('ids.txt', 'rU') as f:
with open('seqres.txt', 'rU') as g:
for id in f:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'w') as h:
for line in g:
if line.startswith('>'+ id):
h.write(g.next())
All the correct files appear, but they are empty. Yes, I am sure the if has true cases. :-)
"seqres.txt" has lines with an ID number in a certain format, each followed by a line with data. The "ids.txt" has lines with the ID numbers of interest in a different format. I want each line of data with an interesting ID number in its own file.
Thanks a million to anyone with a little advice!
Here's a mostly flattened implementation. Depending on how many hits you're going to get for each ID, and how many entries there are in 'seqres' you could redesign it.
# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)
# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()
# iterate through the rest of seqres, staggering
for lineB in iseqres:
lineID = lineA[1:7]
if lineID in ids:
with open("%s.fasta" % lineID, 'a') as h:
h.write(lineB)
lineA = lineB
I think there is still progress to be made from the code you declare as final. You can make the result a little less nested and avoid a couple sort of silly things.
from contextlib import nested
from itertools import tee, izip
# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
for id in f:
id = id.lower()[0:4] + '_' + id[4]
with open(id + '.fasta', 'w') as h:
g.seek(0) # start at the beginning of g each time
for line, next_line in pairwise(g):
if line.startswith('>' + id):
h.write(next_line)
This is an improvement over the final code you posted in that
It does not unnecessarily read the whole files into memory, but simple iterates over the file objects. (This may or may not be the best option for g, really. It definitely scales better.)
It does not contain the crash condition using gl[line+1] if we are already on the last line of gl
Depending on how g actually looks, there might be something more applicable than pairwise.
It is not as deeply nested.
It conforms to PEP8 for things like spaces around operators and indentation depth.
This algorithm is O(n * m), where n and m are the number of lines in f and g. If the length of f is unbounded, you can use a set of its ids to reduce the algorithm to O(n) (linear in the number of lines in g).
For speed, you really want to avoid looping over the same file multiple times. This means you've turned into an O(N*M) algorithm, when you could be a using an O(N+M) one.
To achieve this, read your list of ids into a fast lookup structure, like a set. Since there are only 4600 this in-memory form shouldn't be any problem.
The new solution is also reading the list into memory. Probably not a huge problem with just a few hundred thousand lines, but its wasting more memory than you need, since you can do the whole thing in a single pass, only reading the smaller ids.txt file into memory. You can just set a flag when the previous line was something interesting, which will signal the next line to write it out.
Here's a reworked version:
with open('ids.txt', 'rU') as f:
interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f) # Get all ids in a set.
found_id = None
with open('seqres.txt', 'rU') as g:
for line in g:
if found_id is not None:
with open(found_id+'.fasta','w') as h:
h.write(line)
id = line[:7]
if id in interesting_ids: found_id = id
else: found_id = None
The problem is that you are only looping through file g once - after you have read through it the first time the file index position is left at the end of the file, so any further reads will fail with EOF. You would need to reopen g every time round the loop.
However this will be massively inefficient - you are reading the same file repeatedly, once for every line in f. It will be orders of magnitude faster to read all of g into an array at the start and use that, so long as it will fit in memory.
After the first line in the ids.txt file has been processed, the file seqres.txt has been exhausted. There is something wrong with the nesting of your loops. Also, you're modifying the iterator inside the for line in g loop. Not a good idea.
If you really want to append the line that follows the line whose ID matches, then perhaps something like this might work better:
with open('ids.txt', 'rU') as f:
ids = f.readlines()
with open('seqres.txt', 'rU') as g:
seqres = g.readlines()
for id in ids:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'a') as h:
for line in seqres:
if line.startswith('>'+ id):
h.write(seqres.next())

Categories