Does enumerate() maintain buffered file reading? - python

I am needing to use an index to remember the number of line I'm on in a file to resume an operation if the program is interupted. So far I've been using this:
checkpoint = 15
with open('file.dat', 'rb') as file:
it = iter(file)
for _ in range(checkpoint):
next(it)
try:
while True:
line = next(it)
# do some stuff
checkpoint += 1
except StopIteration:
print("EOF")
But this feel clunky and ineffective. I've been wondering if enumerate() applied to a file, or iterator, maintains the buffered reading property so that the file isn't loaded all at once into memory. I am also now keeping a line index for positions in the file. I've been thinking something like this:
file_offset = 589
with open('file.dat', 'rb') as file:
file.seek(file.offset) # beginning of unprocessed line
for idx, line in enumerate(file):
file_offset = file.tell()
# do stuff
Is this a valid approach and will enumerate work correctly here, without loading the whole into the memory?

As indicated in Memory-efficent way to iterate over part of a large file, based on the answer provided, enumerate() creates a generator, so buffered reading of the file is maintained.
This means that for i, line in enumerate(file) will produce the desired result without loading the whole file into the memory.

Related

Loading the nth line of a txt file in python without loading the whole file

I have a large txt file that is split into lines. I want to load the nth line of this file for use in a for loop in my code. I can't however load the whole array then slice it because of the RAM (the whole file is like 500GB!). Any help would be much appreciated.
You can use a for-loop to iterate over the lines.
with open("file.txt") as f:
for line in f:
print(line)
When iterating over the file, you only have the current line in memory.
The python enumerate function is your go to here. This is because it goes through all of the data without loading everything into memory.
Example code to load the data from line 26
line = 26
f = open(“file”)
data = None
for i, item in enumerate(f):
if i == line - 1:
data = item
break
if data is not None:
print(data)

Loop over the list many times

Let's say I have a file source.txt containing a few rows.
I want to print rows over and over until I break the program manually.
file_source = 'source.txt'
source = open(file_source,'r')
while 1:
for line in source:
print line
source.close()
The easiest solution is fut open and close into while loop. By my feeling is that's not the best solution.
Can you suggest something better?
How to loop over variable source many times?
Regards
I wasn't sure this would work, but it appears you can just seek to the beginning of the file and then continue iterating:
file_source = 'source.txt'
source = open(file_source,'r')
while 1:
for line in source:
print line
source.seek(0)
source.close()
And obviously if the file is small you could simply read the whole thing into a list in memory and iterate over that instead.
You can read the lines at first and save them into a list. So your file is closed after reading. Then you can proceed with your infinite loop:
lines = []
with open(file_source, 'rb') as f:
lines = f.readlines()
while 1:
for line in lines:
print line
But, this is not advised if your file is very large since everything from the file will be read into the memory:
file.readlines([sizehint]):
Read until EOF using readline() and return a list containing the lines thus read.

Python huge file reading [duplicate]

This question already has answers here:
What is the idiomatic way to iterate over a binary file?
(5 answers)
Closed 8 years ago.
I need to read a big datafile (~200GB) , line by line using a Python script.
I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?
Instead of reading it all at once, try reading it line by line:
with open("myFile.txt") as f:
for line in f:
#Do stuff with your line
Or, if you want to read N lines in at a time:
with open("myFile.txt") as myfile:
head = [next(myfile) for x in xrange(N)]
print head
To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).
try:
head = [next(myfile) for x in xrange(N)]
except StopIteration:
rest_of_lines = [line for line in myfile]
Or you can read those last lines in however you want.
To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :
with open(the_path, 'r') as the_file:
for line in the_file:
# Do stuff with the line
To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:
with open(the_path, 'r') as the_file:
the_lines = []
done = False
for i in range(number_of_lines): # Use xrange on Python 2
try:
the_lines.append(next(the_file))
except StopIteration:
done = True # Reached end of file
# Do stuff with the lines
if done:
break # No data left
Of course, you can also load the file in chunks of a specified byte count:
with open(the_path, 'r') as the_file:
while True:
data = the_file.read(the_byte_count)
if len(data) == 0:
# All data is gone
break
# Do stuff with the data chunk

Read file Again Python

I would like to read a file again and again when it arrives at the end.
The file is only numbers separate by comma.
I use python and I read on the doc that file.seek(0) can be use for this but doesn't work for me.
This is my script:
self.users = []
self.index = -1
infile = open(filename, "r")
for line in infile.readlines():
if line != None:
self.users.append(String.split((line),','))
else:
infile.seek(0)
infile.read()
infile.close()
self.index= self._index +1
return self.users[self.index]
Thank you for your help
infile.read() will read in the whole of the file and then throw away the result. Why are you doing it?
When you call infile.readlines you have already read in the whole file. Then your loop iterates over the result, which is just a Python list. Moving to the start of the file will have no effect on that.
If your code did in fact move to the start of the file after reaching the end, it would simply loop for ever until it ran out of memory (because of the endlessly growing users list).
You could get the behaviour you're asking for by storing the result of readlines() in a variable and then putting the whole for line in all_lines: loop inside another while True:. (Or closing, re-opening and re-reading every time, if (a) you are worried that the file might be changed by another program or (b) you want to avoid reading it all in in a single gulp. For (b) you would replace for line in infile.readlines(): with for line in infile:. For (a), note that trying to read a file while something else might be writing to it is likely to be a bad idea no matter how you do it.)
I strongly suspect that the behaviour you're asking for is not what you really want. What's the goal you're trying to achieve by making your program keep reading the file over and over?
The 'else' branch will never be pursued because the for loop will iterate over all the lines of the files and then exit.
If you want the seek operation to be executed you will have to put it outside the for loop
self.users = []
self.index = -1
infile = open(filename, "r")
while True:
for line in infile.readlines():
self.users.append(String.split((line),','))
infile.seek(0)
infile.close()
self.index= self._index +1
return self.users[self.index]
The problem is, if you will loop for ever you will exhaust the memory. If you want to read it only twice then copy and paste the for loop, otherwise decide an exit condition and use a break operation.
readlines is already reading the entire file contents into an in-memory list, which you are free to iterate over again and again!
To re-read the file do:
infile = file('whatever')
while True:
content = infile.readlines()
# do something with list 'content'
# re-read the file - why? I do not know
infile.seek(0)
infile.close()
You can use itertools.cycle() here.
Here's an example :
import itertools
f = open(filename)
lines = f.readlines()
f.close()
for line in itertools.cycle(lines):
print line,

How to jump to a particular line in a huge text file?

Are there any alternatives to the code below:
startFromLine = 141978 # or whatever line I need to jump to
urlsfile = open(filename, "rb", 0)
linesCounter = 1
for line in urlsfile:
if linesCounter > startFromLine:
DoSomethingWithThisLine(line)
linesCounter += 1
If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.
You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
linecache:
The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...
You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.
You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.
0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8 kB, i.e. 8192, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.
I am suprised no one mentioned islice
line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line
or if you want the whole rest of the file
rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
print line
or if you want every other line from the file
rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
print odd_line
I'm probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.
Since there is no way to determine the length of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is make it look nice. If the file is really huge then you might want to use a generator-based approach:
from itertools import dropwhile
def iterate_from_line(f, start_from_line):
return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))
for line in iterate_from_line(open(filename, "r", 0), 141978):
DoSomethingWithThisLine(line)
Note: the index is zero-based in this approach.
I have had the same problem (need to retrieve from huge file specific line).
Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.
I found out next decision:
Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).
t = open(file,’r’)
dict_pos = {}
kolvo = 0
length = 0
for each in t:
dict_pos[kolvo] = length
length = length+len(each)
kolvo = kolvo+1
ultimately, aim function:
def give_line(line_number):
t.seek(dict_pos.get(line_number))
line = t.readline()
return line
t.seek(line_number) – command that execute pruning of file up to line inception.
So, if you next commit readline – you obtain your target line.
Using such approach I have saved significant part of time.
If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.
of course it all depends on what you're trying to do, and how often you will jump across the file.
For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005
and so on
You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file
example:
with open('input_file', "r+b") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
i = 1
for line in iter(mapped.readline, ""):
if i == Line_I_want_to_jump:
offsets = mapped.tell()
i+=1
then use f.seek(offsets) to move to the line you need
None of the answers are particularly satisfactory, so here's a small snippet to help.
class LineSeekableFile:
def __init__(self, seekable):
self.fin = seekable
self.line_map = list() # Map from line index -> file position.
self.line_map.append(0)
while seekable.readline():
self.line_map.append(seekable.tell())
def __getitem__(self, index):
# NOTE: This assumes that you're not reading the file sequentially.
# For that, just use 'for line in file'.
self.fin.seek(self.line_map[index])
return self.fin.readline()
Example usage:
In: !cat /tmp/test.txt
Out:
Line zero.
Line one!
Line three.
End of file, line four.
In:
with open("/tmp/test.txt", 'rt') as fin:
seeker = LineSeekableFile(fin)
print(seeker[1])
Out:
Line one!
This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.
I offer the snippet above under the MIT or Apache license at the discretion of the user.
If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.
Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.
What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.
Which line do you want?.
Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
Use seek or whatever to directly jump to get the line from index file.
Parse to get byte offset for corresponding line of actual file.
Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.
Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().
If you're dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!
import commands
def read_line(path, line=1):
return commands.getoutput('head -%s %s | tail -1' % (line, path))
line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)
Here's an example using readlines(sizehint) to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.
def getlineno(filename, lineno):
if lineno < 1:
raise TypeError("First line is line 1")
f = open(filename)
lines_read = 0
while 1:
lines = f.readlines(100000)
if not lines:
return None
if lines_read + len(lines) >= lineno:
return lines[lineno-lines_read-1]
lines_read += len(lines)
print getlineno("nci_09425001_09450000.smi", 12000)
#george brilliantly suggested mmap, which presumably uses the syscall mmap. Here's another rendition.
import mmap
LINE = 2 # your desired line
with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
for i,line in enumerate(iter(data.readline, '')):
if i!=LINE: continue
pos = data.tell() - len(line)
break
# optionally copy data to `chunk`
i_file.seek(pos)
chunk = i_file.read(len(line))
print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')
Can use this function to return line n:
def skipton(infile, n):
with open(infile,'r') as fi:
for i in range(n-1):
fi.next()
return fi.next()

Categories