I have a csv file with many millions of rows. I want to start iterating from the 10,000,000 row. At the moment I have the code:
with open(csv_file, encoding='UTF-8') as f:
r = csv.reader(f)
for row_number, row in enumerate(r):
if row_number < 10000000:
continue
else:
process_row(row)
This works, however take several seconds to run before the rows of interest appear. Presumably all the unrequired rows are loaded into python unnecessarily, slowing it down. Is there a way of starting the iteration process on a certain row - i.e. without the start of the data read in.
You could use islice:
from itertools import islice
with open(csv_file, encoding='UTF-8') as f:
r = csv.reader(f)
for row in islice(r, 10000000, None):
process_row(row)
It still iterates over all the rows but does it a lot more efficiently.
You could also use the consume recipe which calls functions that consume iterators at C speed, calling it on the file object before you pass it to the csv.reader, so you also avoid needlessly processing those lines with the reader:
import collections
from itertools import islice
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(islice(iterator, n, n), None)
with open(csv_file, encoding='UTF-8') as f:
consume(f, 9999999)
r = csv.reader(f)
for row in r:
process_row(row)
As Shadowranger commented, if a file could conatin embedded newlines then you would have to consume the reader and pass newline="" but if that is not the case then use do consume the file object as the performance difference will be considerable especially if you have a lot of columns.
Related
Due to the huge data size, we used pandas to process data, but a very strange phenomenon occurred. The pseudo code looks like this:
reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
result = []
for line in chunk.tolist():
temp = complicated_process(chunk) # this involves a very complicated processing, so here is just a simplified version
result.append(temp)
chunk['new_series'] = pd.series(result)
chunk.to_csv(OUT_TILE, index=False, mode='a')
We can confirm each loop of result is not empty. But only in the first time of the loop, line chunk['new_series'] = pd.series(result) has result, the rest are empty. Therefore, only the first chunk of the output contains new_series, the rest are empty.
Did we miss anything here? Thanks in advance.
You should declare result above your loop, otherwise you are just re-initializing it with each chunk.
result = []
for chunk in reader:
...
You previous method is functionally equivalent to:
for chunk in reader:
del result # because it is being re-assigned on the following line.
result = []
result.append(something)
print(result) # Only shows result from last chunk in reader (the last loop).
Also, I would recommend:
chunk = chunk.assign(new_series=result) # Instead of `chunk['new_series'] = pd.series(result)`.
I am assuming you are doing something with the line variable in your for loop, even though it is not used in your example above.
A better solution would be this:
reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
result = []
for line in chunk.tolist():
temp = complicated_process(chunk) # this involves a very complicated processing, so here is just a simplified version
result.append(temp)
new_chunk = chunk.reset_index()
new_chunk = new_chunk.assign(new_series=result)
new_chunk.to_csv(OUT_TILE, index=False, mode='a')
Notice: the index of each chunk is not individual, but is derived the whole file. If we append a new series from each loop, the chunk will inherit the index from the whole file. Therefore, the index of each chunk and the new series does not match.
The solution by #Alexander works, but the result might become huge, so it will occupy too much memory.
The new solution here will reset index for each chunk by doing new_chunk = chunk.reset_index(), and result will be reset within each loop. This saves a lot of memory.
With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.
import csv
import itertools
from collections import deque
import time
my_queue = deque()
start_row = 500004
stop_row = start_row + 50000
with open('test.csv', 'rb') as fin:
#load into csv's reader
csv_f = csv.reader(fin)
#start logging time for performance
start = time.time()
for row in itertools.islice(csv_f, start_row, stop_row):
my_queue.append(float(row[4])*float(row[10]))
#stop logging time
end = time.time()
#display performance
print "Initial queue populating time: %.2f" % (end-start)
For example, a start_row of 4 will execute in 1s but a start_row of
500004 will take 11s
That is islice being intelligent. Or lazy, depending on which term you prefer.
Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).
Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.
As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)
In python
lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
for i,j in enumerate(f):
if i >= lines_I_want[0][0]:
if i >= lines_I_want[0][1]:
lines_I_want.pop(0)
if not lines_I_want: #list is empty
break
else:
#j is a line I want. Do something
And if you have any control over making that file, make every line the same length so you can seek. Or use a database.
The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.
If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.
Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.
import csv
from itertools import islice
import os
import sys
def open_binary_mode(filename, mode='r'):
""" Open a file proper way (depends on Python verion). """
kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
dict(mode=mode, newline=''))
return open(filename, **kwargs)
def split(infilename, num_chunks):
infile_size = os.path.getsize(infilename)
chunk_size = infile_size // num_chunks
offset = 0
num_rows = 0
bytes_read = 0
chunks = []
with open_binary_mode(infilename, 'r') as infile:
for _ in range(num_chunks):
while bytes_read < chunk_size:
try:
bytes_read += len(next(infile))
num_rows += 1
except StopIteration: # end of infile
break
chunks.append((infilename, offset, num_rows))
offset += bytes_read
num_rows = 0
bytes_read = 0
return chunks
chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
print('processing: {} rows starting at offset {}'.format(rows, offset))
with open_binary_mode(filename, 'r') as fin:
fin.seek(offset)
for row in islice(csv.reader(fin), rows):
print(row)
i have this code:
import csv
import collections
def do_work():
(data,counter)=get_file('thefile.csv')
b=samples_subset1(data, counter,'/pythonwork/samples_subset3.csv',500)
return
def get_file(start_file):
with open(start_file, 'rb') as f:
data = list(csv.reader(f))
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
return (data,counter)
def samples_subset1(data,counter,output_file,sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b_counter=0
b=[]
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
b_counter+=1
return (b)
i recently started learning python, and would like to start off with good habits. therefore, i was wondering if you can help me get started to turn this code into classes. i dont know where to start.
Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:
import csv
import collections
def do_work():
data, counter = get_file('thefile.csv')
b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)
def get_file(start_file):
with open(start_file, 'rb') as f:
counter = collections.defaultdict(int)
data = list(csv.reader(f))
for row in data:
counter[row[10]] += 1
return (data, counter)
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b
Notes:
No one uses more than 4 spaces to
indent ever. Use 2 - 4. And all
your levels of indentation should
match.
Use a single space after the commas between arguments
to functions ("F(a, b, c)" not
"F(a,b,c)")
Naked return statements at the end of a function
are meaningless. Functions without
return statements implicitly return
None
Single space around all
operators (a = 1, not a=1)
Do not
wrap single values in parentheses.
It looks like a tuple, but it isn't.
b_counter wasn't used at all, so I
removed it.
csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.
Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?
What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.
Some tips on your code:
Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.
Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.
Rewrite
import csv
import collections
def do_work():
with open( 'thefile.csv' ) as f:
# Open the file and count the rows.
data, counter = get_file(f)
# Go back to the start of the file.
f.seek(0)
# Filter to only common rows.
b = samples_subset1(data, counter,
'/pythonwork/samples_subset3.csv', 500)
return b
def get_file(f):
counter = collections.defaultdict(int)
data = csv.reader(f)
for row in data:
counter[row[10]] += 1
return data, counter
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b
I have a csv DictReader object (using Python 3.1), but I would like to know the number of lines/rows contained in the reader before I iterate through it. Something like as follows...
myreader = csv.DictReader(open('myFile.csv', newline=''))
totalrows = ?
rowcount = 0
for row in myreader:
rowcount +=1
print("Row %d/%d" % (rowcount,totalrows))
I know I could get the total by iterating through the reader, but then I couldn't run the 'for' loop. I could iterate through a copy of the reader, but I cannot find how to copy an iterator.
I could also use
totalrows = len(open('myFile.csv').readlines())
but that seems an unnecessary re-opening of the file. I would rather get the count from the DictReader if possible.
Any help would be appreciated.
Alan
rows = list(myreader)
totalrows = len(rows)
for i, row in enumerate(rows):
print("Row %d/%d" % (i+1, totalrows))
You only need to open the file once:
import csv
f = open('myFile.csv', 'rb')
countrdr = csv.DictReader(f)
totalrows = 0
for row in countrdr:
totalrows += 1
f.seek(0) # You may not have to do this, I didn't check to see if DictReader did
myreader = csv.DictReader(f)
for row in myreader:
do_work
No matter what you do you have to make two passes (well, if your records are a fixed length - which is unlikely - you could just get the file size and divide, but lets presume that isn't the case). Opening the file again really doesn't cost you much, but you can avoid it as illustrated here. Converting to a list just to use len() is potentially going to waste tons of memory, and not be any faster.
Note: The 'Pythonic' way is to use enumerate instead of +=, but the UNPACK_TUPLE opcode is so expensive that it makes enumerate slower than incrementing a local. That being said, it's likely an unnecessary micro-optimization that you should probably avoid.
More Notes: If you really just want to generate some kind of progress indicator, it doesn't necessarily have to be record based. You can tell() on the file object in the loop and just report what % of the data you're through. It'll be a little uneven, but chances are on any file that's large enough to warrant a progress bar the deviation on record length will be lost in the noise.
I cannot find how to copy an
iterator.
Closest is itertools.tee, but simply making a list of it, as #J.F.Sebastian suggests, is best here, as itertools.tee's docs explain:
This itertool may require significant
auxiliary storage (depending on how
much temporary data needs to be
stored). In general, if one iterator
uses most or all of the data before
another iterator starts, it is faster
to use list() instead of tee().
As mentioned in the answer https://stackoverflow.com/a/2890569/8056572 you can get the number of lines by taking the length of the reader converted to a list. However, this will have an impact on the RAM consumption and you will loose the benefits of the reader (which is a generator).
The best solution in my opinion is to open the file 2 times:
count the number of lines:
total_rows = sum(1 for _ in open('myFile.csv')) # -1 if you want to remove the header from the count
Note: I am not using .readlines() to avoid to load all the lines in memory
iterate over the lines
According to your snippet you will have something like this:
import csv
totalrows = sum(1 for _ in open('myFile.csv'))
myreader = csv.DictReader(open('myFile.csv'))
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))
Note: the start=1 in the enumerate indicates the first value of i. By default it is 0, if you keep this default value you have to use i + 1 in the print statement
If you really do not want to open the file two times you can use seek as mentioned in the answer https://stackoverflow.com/a/2891061/8056572
import csv
f = open('myFile.csv')
total_rows = sum(1 for _ in f)
f.seek(0)
myreader = csv.DictReader(f)
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))
Having a hard time fixing this or finding any good hints about it.
I'm trying to loop over one file, modify each line slightly, and then loop over a different file. If the line in the second file starts with the line from the first then the following line in the second file should be written to a third file.
with open('ids.txt', 'rU') as f:
with open('seqres.txt', 'rU') as g:
for id in f:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'w') as h:
for line in g:
if line.startswith('>'+ id):
h.write(g.next())
All the correct files appear, but they are empty. Yes, I am sure the if has true cases. :-)
"seqres.txt" has lines with an ID number in a certain format, each followed by a line with data. The "ids.txt" has lines with the ID numbers of interest in a different format. I want each line of data with an interesting ID number in its own file.
Thanks a million to anyone with a little advice!
Here's a mostly flattened implementation. Depending on how many hits you're going to get for each ID, and how many entries there are in 'seqres' you could redesign it.
# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)
# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()
# iterate through the rest of seqres, staggering
for lineB in iseqres:
lineID = lineA[1:7]
if lineID in ids:
with open("%s.fasta" % lineID, 'a') as h:
h.write(lineB)
lineA = lineB
I think there is still progress to be made from the code you declare as final. You can make the result a little less nested and avoid a couple sort of silly things.
from contextlib import nested
from itertools import tee, izip
# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
for id in f:
id = id.lower()[0:4] + '_' + id[4]
with open(id + '.fasta', 'w') as h:
g.seek(0) # start at the beginning of g each time
for line, next_line in pairwise(g):
if line.startswith('>' + id):
h.write(next_line)
This is an improvement over the final code you posted in that
It does not unnecessarily read the whole files into memory, but simple iterates over the file objects. (This may or may not be the best option for g, really. It definitely scales better.)
It does not contain the crash condition using gl[line+1] if we are already on the last line of gl
Depending on how g actually looks, there might be something more applicable than pairwise.
It is not as deeply nested.
It conforms to PEP8 for things like spaces around operators and indentation depth.
This algorithm is O(n * m), where n and m are the number of lines in f and g. If the length of f is unbounded, you can use a set of its ids to reduce the algorithm to O(n) (linear in the number of lines in g).
For speed, you really want to avoid looping over the same file multiple times. This means you've turned into an O(N*M) algorithm, when you could be a using an O(N+M) one.
To achieve this, read your list of ids into a fast lookup structure, like a set. Since there are only 4600 this in-memory form shouldn't be any problem.
The new solution is also reading the list into memory. Probably not a huge problem with just a few hundred thousand lines, but its wasting more memory than you need, since you can do the whole thing in a single pass, only reading the smaller ids.txt file into memory. You can just set a flag when the previous line was something interesting, which will signal the next line to write it out.
Here's a reworked version:
with open('ids.txt', 'rU') as f:
interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f) # Get all ids in a set.
found_id = None
with open('seqres.txt', 'rU') as g:
for line in g:
if found_id is not None:
with open(found_id+'.fasta','w') as h:
h.write(line)
id = line[:7]
if id in interesting_ids: found_id = id
else: found_id = None
The problem is that you are only looping through file g once - after you have read through it the first time the file index position is left at the end of the file, so any further reads will fail with EOF. You would need to reopen g every time round the loop.
However this will be massively inefficient - you are reading the same file repeatedly, once for every line in f. It will be orders of magnitude faster to read all of g into an array at the start and use that, so long as it will fit in memory.
After the first line in the ids.txt file has been processed, the file seqres.txt has been exhausted. There is something wrong with the nesting of your loops. Also, you're modifying the iterator inside the for line in g loop. Not a good idea.
If you really want to append the line that follows the line whose ID matches, then perhaps something like this might work better:
with open('ids.txt', 'rU') as f:
ids = f.readlines()
with open('seqres.txt', 'rU') as g:
seqres = g.readlines()
for id in ids:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'a') as h:
for line in seqres:
if line.startswith('>'+ id):
h.write(seqres.next())