Python huge file reading [duplicate] - python

This question already has answers here:
What is the idiomatic way to iterate over a binary file?
(5 answers)
Closed 8 years ago.
I need to read a big datafile (~200GB) , line by line using a Python script.
I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Instead of reading it all at once, try reading it line by line:
with open("myFile.txt") as f:
for line in f:
#Do stuff with your line
Or, if you want to read N lines in at a time:
with open("myFile.txt") as myfile:
head = [next(myfile) for x in xrange(N)]
print head
To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).
try:
head = [next(myfile) for x in xrange(N)]
except StopIteration:
rest_of_lines = [line for line in myfile]
Or you can read those last lines in however you want.

To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :
with open(the_path, 'r') as the_file:
for line in the_file:
# Do stuff with the line
To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:
with open(the_path, 'r') as the_file:
the_lines = []
done = False
for i in range(number_of_lines): # Use xrange on Python 2
try:
the_lines.append(next(the_file))
except StopIteration:
done = True # Reached end of file
# Do stuff with the lines
if done:
break # No data left
Of course, you can also load the file in chunks of a specified byte count:
with open(the_path, 'r') as the_file:
while True:
data = the_file.read(the_byte_count)
if len(data) == 0:
# All data is gone
break
# Do stuff with the data chunk

Related

Does enumerate() maintain buffered file reading?

I am needing to use an index to remember the number of line I'm on in a file to resume an operation if the program is interupted. So far I've been using this:
checkpoint = 15
with open('file.dat', 'rb') as file:
it = iter(file)
for _ in range(checkpoint):
next(it)
try:
while True:
line = next(it)
# do some stuff
checkpoint += 1
except StopIteration:
print("EOF")
But this feel clunky and ineffective. I've been wondering if enumerate() applied to a file, or iterator, maintains the buffered reading property so that the file isn't loaded all at once into memory. I am also now keeping a line index for positions in the file. I've been thinking something like this:
file_offset = 589
with open('file.dat', 'rb') as file:
file.seek(file.offset) # beginning of unprocessed line
for idx, line in enumerate(file):
file_offset = file.tell()
# do stuff
Is this a valid approach and will enumerate work correctly here, without loading the whole into the memory?
As indicated in Memory-efficent way to iterate over part of a large file, based on the answer provided, enumerate() creates a generator, so buffered reading of the file is maintained.
This means that for i, line in enumerate(file) will produce the desired result without loading the whole file into the memory.

Python refuses to iterate through lines in a file more than once [duplicate]

This question already has answers here:
Iterating on a file doesn't work the second time [duplicate]
(4 answers)
Closed 4 years ago.
I am writing a program that requires me to iterate through each line of a file multiple times:
loops = 0
file = open("somefile.txt")
while loops < 5:
for line in file:
print(line)
loops = loops + 1
For the sake of brevity, I am assuming that I always need to loop through a file and print each line 5 times. That code has the same issue as the longer version I have implemented in my program: the file is only iterated through one time. After that the print(line) file does nothing. Why is this?
It's because the file = open("somefile.txt") line occurs only once, before the loop. This creates one cursor pointing to one location in the file, so when you reach the end of the first loop, the cursor is at the end of the file. Move it into the loop:
loops = 0
while loops < 5:
file = open("somefile.txt")
for line in file:
print(line)
loops = loops + 1
file.close()
for loop in range(5):
with open('somefile.txt') as fin:
for line in fin:
print(fin)
This will re-open the file five times. You could seek() to beginning instead, if you like.
for line in file reads each line once. If you want to start over from the first line, you could for example close and reopen the file.
Python file objects are iterators. Like other iterators, they can only be iterated on once before becoming exhausted. Trying to iterate again results in the iterator raising StopIteration (the signal it has nothing left to yield) immediately.
That said, file objects do let you cheat a bit. Unlike most other iterators, you can rewind them using their seek method. Then you can iterate their contents again.
Another option would be to reopen the file each time you need to iterate on it. This is simple enough, but (ignoring the OS's disk cache) it might be a bit wasteful to read the file repeatedly.
A final option would be to read the whole contents of the file into a list at the start of the program and then do the iteration over the list instead of over the file directly. This is probably the most efficient option as long as the file is small enough that fitting its whole contents in memory at one time is not a problem.
when you iterate once the pointer points to the last line in the file so try to use
file.seek(0) instead of opening the file again and again in the loop
with open('a.txt','r+')as f:
for i in range(0,5):
for line in f:
print(line)
f.seek(0)
Files are treated as generator expressions by default when you iterate through them. If you want to iterate over the file multiple times line by line, you might want to convert the file to something like a list first.
lines = open("somefile.txt").read().splitlines()
for line in lines:
print(line)

Reading in a file, one chunk at a time [duplicate]

This question already has answers here:
Read multiple block of file between start and stop flags
(4 answers)
Closed 6 years ago.
I have a VERY large file formatted like this:
(mydelimiter)
line
line
(mydelimiter)
line
line
(mydelimiter)
Since the file is so large I can't read it all into memory at once. So I would like to read each chunk between "(mydelimiter)" at a time, perform some operations on it, then read in the next chunk.
This is the code I have so far:
with open(infile,'r') as f:
chunk = []
for line in f:
chunk.append(line)
Now, I'm not sure how to tell python "keep appending lines UNTIL you hit another line with '(mydelimiter)' in it", and then save the line where it stopped abd start there in the next iteration of the for loop.
Note: it's also not possible to read in a certain number of lines at a time since each chunk is variable length.
Aren't you perhaps over thinking this? Something as simple as the following code can do the trick for you
with open(infile,'r') as f:
chunk = []
for line in f:
if line == 'my delimiter':
call_something(chunk)
chunk=[]
else :
chunk.append(line)

Loop over the list many times

Let's say I have a file source.txt containing a few rows.
I want to print rows over and over until I break the program manually.
file_source = 'source.txt'
source = open(file_source,'r')
while 1:
for line in source:
print line
source.close()
The easiest solution is fut open and close into while loop. By my feeling is that's not the best solution.
Can you suggest something better?
How to loop over variable source many times?
Regards
I wasn't sure this would work, but it appears you can just seek to the beginning of the file and then continue iterating:
file_source = 'source.txt'
source = open(file_source,'r')
while 1:
for line in source:
print line
source.seek(0)
source.close()
And obviously if the file is small you could simply read the whole thing into a list in memory and iterate over that instead.
You can read the lines at first and save them into a list. So your file is closed after reading. Then you can proceed with your infinite loop:
lines = []
with open(file_source, 'rb') as f:
lines = f.readlines()
while 1:
for line in lines:
print line
But, this is not advised if your file is very large since everything from the file will be read into the memory:
file.readlines([sizehint]):
Read until EOF using readline() and return a list containing the lines thus read.

How to jump to a particular line in a huge text file?

Are there any alternatives to the code below:
startFromLine = 141978 # or whatever line I need to jump to
urlsfile = open(filename, "rb", 0)
linesCounter = 1
for line in urlsfile:
if linesCounter > startFromLine:
DoSomethingWithThisLine(line)
linesCounter += 1
If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.
You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
linecache:
The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...
You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.
You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.
0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8 kB, i.e. 8192, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.
I am suprised no one mentioned islice
line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line
or if you want the whole rest of the file
rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
print line
or if you want every other line from the file
rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
print odd_line
I'm probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.
Since there is no way to determine the length of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is make it look nice. If the file is really huge then you might want to use a generator-based approach:
from itertools import dropwhile
def iterate_from_line(f, start_from_line):
return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))
for line in iterate_from_line(open(filename, "r", 0), 141978):
DoSomethingWithThisLine(line)
Note: the index is zero-based in this approach.
I have had the same problem (need to retrieve from huge file specific line).
Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.
I found out next decision:
Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).
t = open(file,’r’)
dict_pos = {}
kolvo = 0
length = 0
for each in t:
dict_pos[kolvo] = length
length = length+len(each)
kolvo = kolvo+1
ultimately, aim function:
def give_line(line_number):
t.seek(dict_pos.get(line_number))
line = t.readline()
return line
t.seek(line_number) – command that execute pruning of file up to line inception.
So, if you next commit readline – you obtain your target line.
Using such approach I have saved significant part of time.
If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.
of course it all depends on what you're trying to do, and how often you will jump across the file.
For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005
and so on
You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file
example:
with open('input_file', "r+b") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
i = 1
for line in iter(mapped.readline, ""):
if i == Line_I_want_to_jump:
offsets = mapped.tell()
i+=1
then use f.seek(offsets) to move to the line you need
None of the answers are particularly satisfactory, so here's a small snippet to help.
class LineSeekableFile:
def __init__(self, seekable):
self.fin = seekable
self.line_map = list() # Map from line index -> file position.
self.line_map.append(0)
while seekable.readline():
self.line_map.append(seekable.tell())
def __getitem__(self, index):
# NOTE: This assumes that you're not reading the file sequentially.
# For that, just use 'for line in file'.
self.fin.seek(self.line_map[index])
return self.fin.readline()
Example usage:
In: !cat /tmp/test.txt
Out:
Line zero.
Line one!
Line three.
End of file, line four.
In:
with open("/tmp/test.txt", 'rt') as fin:
seeker = LineSeekableFile(fin)
print(seeker[1])
Out:
Line one!
This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.
I offer the snippet above under the MIT or Apache license at the discretion of the user.
If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.
Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.
What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.
Which line do you want?.
Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
Use seek or whatever to directly jump to get the line from index file.
Parse to get byte offset for corresponding line of actual file.
Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.
Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().
If you're dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!
import commands
def read_line(path, line=1):
return commands.getoutput('head -%s %s | tail -1' % (line, path))
line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)
Here's an example using readlines(sizehint) to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.
def getlineno(filename, lineno):
if lineno < 1:
raise TypeError("First line is line 1")
f = open(filename)
lines_read = 0
while 1:
lines = f.readlines(100000)
if not lines:
return None
if lines_read + len(lines) >= lineno:
return lines[lineno-lines_read-1]
lines_read += len(lines)
print getlineno("nci_09425001_09450000.smi", 12000)
#george brilliantly suggested mmap, which presumably uses the syscall mmap. Here's another rendition.
import mmap
LINE = 2 # your desired line
with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
for i,line in enumerate(iter(data.readline, '')):
if i!=LINE: continue
pos = data.tell() - len(line)
break
# optionally copy data to `chunk`
i_file.seek(pos)
chunk = i_file.read(len(line))
print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')
Can use this function to return line n:
def skipton(infile, n):
with open(infile,'r') as fi:
for i in range(n-1):
fi.next()
return fi.next()

Categories