I have a function that read lines from a file and process them. However, I want to delete every line that I have read, but without using readlines() that reads all of the lines at once and stores them into a list.
If the problem is that you run out of memory, then I suggest you use the for line in file syntax, as this will only load the lines one at a time:
bigFile = open('path/to/file.dat','r')
for line in bigFile:
processLine(line)
If you can construct your system so that it can process the file line-by-line, then it won't run out of memory trying to read the whole file. The program will discard the copy it has made of the file contents when it moves onto the next line.
Why does this work when readlines doesn't?
In Python there are iterators, which provide an interface to supply one item of a collection at a time, iterating over the whole collection if .next() is called repeatedly. Because you rarely need the whole collection at once, this can allow the program to work with a single item in memory instead, and thus allow large files to be processed.
By contrast, the readlines function has to return a whole list, rather than an iterator object, so it cannot delay the processing of later lines like an iterator could. Since Python 2.3, the old xreadlines read iterator was deprecated in favour of using for line in file, because the file object returned by open had been changed to return an iterator rather than a list.
This follows the functional paradigm called 'lazy evaluation', where you avoid doing any actual processing unless and until the result is needed.
More iterators
Iterators can be chained together (process the lines of this file, then that one), or otherwise combined using the excellent itertools module (included in Python). These are very powerful, and can allow you to separate out the way you combine files or inputs from the code that processes them.
First of all, deleting the first line of a file is a costly process. Actually, you are unlikely to be able to do it without rewriting most of the file.
You have multiple approaches that could solve your issue:
1.In python, file objects have an iterator over the lines, may be you can use this to solve your memory issues
document_count = 0
with open(filename) as handler:
for index, line in enumerate(handler):
if line == '.':
document_count += 1
2.Use an index. Reserve a certain part of your file to the index(fixed size, make sure to reserve enough space, let's say the first 100Ko of your file should be reserved for the index, that's about 100K entries) or even another index file, every time you add a document put it's starting position on the index. Once you know the document position, just use the seek function to get there and start reading
3.Read the file once and store every document position, this is very similar to the previous idea, except it's in memory which is better performance-wise but you will have to repeat the process every time you run the application (no persistence)
Related
I have a Python program which does the following:
It takes a list of files as input
It iterates through the list several times, each time opening the files and then closing them
What I would like is some way to open each file at the beginning, and then when iterating through the files make a copy of each file handle. Essentially this would take the form of a copy operation on file handles that allows a file to be traversed independently by multiple handles. The reason for wanting to do this is because on Unix systems, if a program obtains a file handle and the corresponding file is then deleted, the program is still able to read the file. If I try reopening the files by name on each iteration, the files might have been renamed or deleted so it wouldn't work. If I try using f.seek(0), then that might affect another thread/generator/iterator.
I hope my question makes sense, and I would like to know if there is a way to do this.
If you really want to get a copy of a file handle, you would need to use POSIX dup system call. In python, that would be accessed by using os.dup - see docs. If you have a file object (e.g. from calling open()), then you need to call fileno() method to get file descriptor.
So the entire code will look like this:
with open("myfile") as f:
fd = f.fileno() # get descriptor
fd2 = os.dup(fd) # duplicate descriptor
f2 = os.fdopen(fd2) # get corresponding file object
I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.
The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.
You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out
Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.
It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?
I'm wondering about the memory benefits of python generators in this use case (if any). I wish to read in a large text file that must be shared between all objects. Because it only needs to be used once and the program finishes once the list is exhausted I was planning on using generators.
The "saved state" of a generator I believe lets it keep track of what is the next value to be passed to whatever object is calling it. I've read that generators also save memory usage by not returning all the values at once, but rather calculating them on the fly. I'm a little confused if I'd get any benefit in this use case though.
Example Code:
def bufferedFetch():
while True:
buffer = open("bigfile.txt","r").read().split('\n')
for i in buffer:
yield i
Considering that the buffer is going to be reading in the entire "bigfile.txt" anyway, wouldn't this be stored within the generator, for no memory benefit? Is there a better way to return the next value of a list that can be shared between all objects?
Thanks.
In this case no. You are reading the entire file into memory by doing .read().
What you ideally want to do instead is:
def bufferedFetch():
with open("bigfile.txt","r") as f:
for line in f:
yield line
The python file object takes care of line endings for you (system dependent) and it's built-in iterator will yield lines by simply iterating over it one line at a time (not reading the entire file into memory).
So I seem to be doing something incredibly dumb and I can't seem to figure it out. I am trying to create script that will search a file for terms defined in another file. This seems pretty basic to me but for some reason the outside loop iteration is empty on the inside loop.
if __name__ == "__main__":
searchfile = open(sys.argv[1],"r")
terms = open(sys.argv[2],"r")
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line
If I print line before the term loop it has the information. If I print line inside the term loop, it doesn't. What am I missing?
The issue here is that files are iterators that get exhausted - this means that once they have been iterated over once, they will not restart from the beginning.
You are probably used to lists - iterables that return a new iterator each time you loop over them, from the beginning.
Files are single-use iterables - once you loop over them, they are exhausted.
You can either use list() to construct a list you can iterate over multiple times, or open the file inside the loop, so that it is reopened each time, creating a new iterator from the beginning.
Which option is best will vary depending on the use case. Opening the file and reading from disk will be slower, but making a list will require all the data being held in memory - if your file is extremely large, this may be a problem.
It's also worth noting that you should use the with statement when opening files in Python.
with open(sys.argv[1], "r") as searchfile, open(sys.argv[2], "r") as terms:
terms = list(terms)
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line
So what are you doing: In the first for-iteration you read the first line of searchfile and compare it with every line in terms, by reading the file terms. After that, the file terms is read completely, so in every next iteration of the searchfile-loop the terms-loop isn't executed any more (terms is 'empty').
I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?
If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()
These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.