Python Generator memory benefits for large readins? - python

I'm wondering about the memory benefits of python generators in this use case (if any). I wish to read in a large text file that must be shared between all objects. Because it only needs to be used once and the program finishes once the list is exhausted I was planning on using generators.
The "saved state" of a generator I believe lets it keep track of what is the next value to be passed to whatever object is calling it. I've read that generators also save memory usage by not returning all the values at once, but rather calculating them on the fly. I'm a little confused if I'd get any benefit in this use case though.
Example Code:
def bufferedFetch():
while True:
buffer = open("bigfile.txt","r").read().split('\n')
for i in buffer:
yield i
Considering that the buffer is going to be reading in the entire "bigfile.txt" anyway, wouldn't this be stored within the generator, for no memory benefit? Is there a better way to return the next value of a list that can be shared between all objects?
Thanks.

In this case no. You are reading the entire file into memory by doing .read().
What you ideally want to do instead is:
def bufferedFetch():
with open("bigfile.txt","r") as f:
for line in f:
yield line
The python file object takes care of line endings for you (system dependent) and it's built-in iterator will yield lines by simply iterating over it one line at a time (not reading the entire file into memory).

Related

Python3 Dictionary BLANK after first operation

I am reading a CSV file into Dictreader and want to print its contents twice on the terminal. But it is printing only once. Is Dictreader BLANK after first print?
dictreader = csv.DictReader(reader)
for k in dictreader:
print(k) # Prints all keys/values
for i in dictreader:
print(i) # Doesn't print anything
Yes, if you look at the source for DictReader you'll see it's an iterator (has an implementation for __next__ and __iter__ returns self).
After going through it once it will be exhausted; consequent iterations will simply not produce anything. You could create a list from it if you need to iterate through it more times.
All csv wrappers wrap file-like objects. File-like objects have state, specifically, the seek position (and pipes can't seek at all), so the wrappers allow the object to manage position, parsing whatever comes next.
Making iteration work twice in a row would mean the csv wrappers have to cache the file contents (consuming unbounded amounts of memory) or require them to seek back to the beginning of the underlying file (not possible for streaming file-like objects).
Think of the csv wrappers as semi-file-like makes this easier to grasp. You can't do for line in myfile: twice in a row without seeking, and similarly, you can't do for row in mycsv: twice in a row without seeking the underlying file-like object.
Assuming your reader is seekable, you could iterate it twice (without consuming unbounded memory) by doing:
dictreader = csv.DictReader(reader)
for k in dictreader:
print(k) # Prints all keys/values
reader.seek(0) # Restart from beginning
for i in dictreader:
print(i) # Prints all keys/values
Or if the files are known to be small, you could cache:
# Cache reusable values
dictlines = tuple(csv.DictReader(reader))
for k in dictlines:
print(k) # Prints all keys/values
for i in dictlines:
print(i) # Prints all keys/values
You could also use itertools.tee for the same purpose, but that only helps if all iterators are going to be advanced (somewhat) in tandem; if you're running one to completion before starting the next, it's usually faster to just cache to list or tuple.

Is there a way to shallow copy an existing file-object?

The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.
Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.
Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)
I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.
Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?
You are looking for itertools.tee.
from itertools import tee
with open("somefile.txt", "r") as fh:
fh1, fh2, fh3 = tee(fh, 3)
Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.
For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.
fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]
or, if you already have a file object fh:
fh1, fh2, fh3 = [open(fh.name) for i in range(3)]
This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:
for x in fh1, fh2, fh3:
x.seek(fh.tell())

Bufferization in GzipFile

Imagine the following simple script:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
for _s in stream:
fh.write(_s+'\n')
stream = reader()
writer(stream)
So pretty simple - read lines using generators and write some result into a gzip file.
But how to speed it up? The HDD seems to be a bottleneck. I saw I can use buffer size for reads - using open(file, mode, buffer) syntax. But I'm not quite sure it will work in my case (with generators).
Also I didn't find any bufferization parameter for the gzip.GzipFile call. From the code, it's based on some bufferized class, but I don't see any further docs on that.
I have a (crazy?) idea to create an explicit cache and replace open methods with it - so it will read the file in bigger chunks, say, by 8MB, and then perform splitting it by lines. As for writes, I thought to create a list of lines to write, collect them (say, 5000 lines), and then dump into the file.
Am I trying to re-invent the wheel? I'm not satisfied with the performance the script currently has, so I'm trying to speed it up as much as possible.
UPD. I have around 4-5 different parallel workers running. They all perform reads and writes. So I guess the HDD is jumping from one sector to another, and this is the reason why I want to implement some bufferization to dump the data periodically in big chunks.
Thanks!
I can just propose more compact code:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
fh.writelines(stream)
writer(reader())
However, there is no actual speed-up. Python will manage the streams, but if you cannot spare memory for full file write, the speed-up will not be great.
The compression though gzip is the slowest step. The following function will give you only ~3% speed-up (disregarding the generator's part).
def writer():
f = open('logfile.log').read()
gzip.GzipFile('some_output_file.gz', 'w').write(f)
writer()
So, if you need gzip, than you cannot do much.

Delete line after it has been read from file in Python

I have a function that read lines from a file and process them. However, I want to delete every line that I have read, but without using readlines() that reads all of the lines at once and stores them into a list.
If the problem is that you run out of memory, then I suggest you use the for line in file syntax, as this will only load the lines one at a time:
bigFile = open('path/to/file.dat','r')
for line in bigFile:
processLine(line)
If you can construct your system so that it can process the file line-by-line, then it won't run out of memory trying to read the whole file. The program will discard the copy it has made of the file contents when it moves onto the next line.
Why does this work when readlines doesn't?
In Python there are iterators, which provide an interface to supply one item of a collection at a time, iterating over the whole collection if .next() is called repeatedly. Because you rarely need the whole collection at once, this can allow the program to work with a single item in memory instead, and thus allow large files to be processed.
By contrast, the readlines function has to return a whole list, rather than an iterator object, so it cannot delay the processing of later lines like an iterator could. Since Python 2.3, the old xreadlines read iterator was deprecated in favour of using for line in file, because the file object returned by open had been changed to return an iterator rather than a list.
This follows the functional paradigm called 'lazy evaluation', where you avoid doing any actual processing unless and until the result is needed.
More iterators
Iterators can be chained together (process the lines of this file, then that one), or otherwise combined using the excellent itertools module (included in Python). These are very powerful, and can allow you to separate out the way you combine files or inputs from the code that processes them.
First of all, deleting the first line of a file is a costly process. Actually, you are unlikely to be able to do it without rewriting most of the file.
You have multiple approaches that could solve your issue:
1.In python, file objects have an iterator over the lines, may be you can use this to solve your memory issues
document_count = 0
with open(filename) as handler:
for index, line in enumerate(handler):
if line == '.':
document_count += 1
2.Use an index. Reserve a certain part of your file to the index(fixed size, make sure to reserve enough space, let's say the first 100Ko of your file should be reserved for the index, that's about 100K entries) or even another index file, every time you add a document put it's starting position on the index. Once you know the document position, just use the seek function to get there and start reading
3.Read the file once and store every document position, this is very similar to the previous idea, except it's in memory which is better performance-wise but you will have to repeat the process every time you run the application (no persistence)

with as and premature file closure

I have a python function that contains the following code:
with open(modelfilepath, "rb") as modelfile, open(vcffilepath, "rb") as vcffile:
for row in gtf_getrow(modelfile):
print row
#add features as appropriate
if row["feature"] == "transcript":
addfeature(some args...)
if row["feature"] == "exon":
addfeature(some other args..., vcffile=vcffile)
Execution of the addfeature() function passes through several functions before returning to the for loop. In the "exon" case, the vcffile object is passed as an argument to successive functions which eventually write to the vcffile.
The problem is that after a few iterations, the vcffile object seems to close spontaneously, which crashes the program. If I hardcode the function that uses vcffile to access the filename directly, the problem does not occur, but this seems like an undesirable solution since it removes control of the file from the with block. Nor do I want to have to open and close the file each time I access it, since this program is parsing hundreds of megabytes worth of tabular data. Thanks in advance for your suggestions.

Categories