Reading large compressed files - python

This might be a simple question but I can't seem to find the answer to this or why it is not working on this specific case.
I want to read large files, they can be compressed or not. I used contextlib to write a contextmanager function to handle this. Then using the with statement I read the files in the main script.
My problem here is that the script uses a lot of memory then gets killed (testing using a compressed file). What am I doing wrong? Should I approach this differently?
def process_vcf(location):
logging.info('Processing vcf')
logging.debug(location)
with read_compressed_or_not(location) as vcf:
for line in vcf.readlines():
if line.startswith('#'):
logging.debug(line)
#contextmanager
def read_compressed_or_not(location):
if location.endswith('.gz'):
try:
file = gzip.open(location)
yield file
finally:
file.close()
else:
try:
file = open(location, 'r')
yield file
finally:
file.close()

The lowest impact solution is just to skip the use of the readlines function. readlines returns a list containing every line in the file, so it does have the entire file in memory. Using the filename by itself reads one line at a time using a generator, so it doesn't have to have the whole file in memory.
with read_compressed_or_not(location) as vcf:
for line in vcf:
if line.startswith('#'):
logging.debug(line)

Instead of using for line in vcf.readlines(), you can do:
line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()
This will only load one single line into memory at once

The file opening function is the main difference between reading a gzip file and a non-gzip file. So one can dynamically assign the opener and then read the file. Then there is no need for a custom context manager.
import gzip
open_fn = gzip.open if location.endswith(".gz") else open
with open_fn(location, mode="rt") as vcf:
for line in vcf:
...

Related

How to modify and overwrite large files?

I want to make several modifications to some lines in the file and overwrite the file. I do not want to create a new file with the changes, and since the file is large (hundreds of MB), I don't want to read it all at once in memory.
datfile = 'C:/some_path/text.txt'
with open(datfile) as file:
for line in file:
if line.split()[0] == 'TABLE':
# if this is true, I want to change the second word of the line
# something like: line.split()[1] = 'new'
Please note that an important part of the problem is that the file is big. There are several solutions on the site that address the similar problems but do not account for the size of the files.
Is there a way to do this in python?
You can't replace the contents of a portion of a file without rewriting the remainder of the file regardless of python. Each byte of a file lives in a fixed location on a disk or flash memory. If you want to insert text into the file that is shorter or longer than the text it replaces, you will need to move the remainder of the file. If your replacement is longer than the original text, you will probably want to write a new file to avoid overwriting the data.
Given how file I/O works, and the operations you are already performing on the file, making a new file will not be as big of a problem as you think. You are already reading in the entire file line-by-line and parsing the content. Doing a buffered write of the replacement data will not be all that expensive.
from tempfile import NamedTemporaryFile
from os import remove, rename
from os.path import dirname
datfile = 'C:/some_path/text.txt'
try:
with open(datfile) as file, NamedTemporaryFile(mode='wt', dir=dirname(datfile), delete=False) as output:
tname = output.name
for line in file:
if line.startswith('TABLE'):
ls = line.split()
ls[1] = 'new'
line = ls.join(' ') + '\n'
output.write(line)
except:
remove(tname)
else:
rename(tname, datfile)
Passing dir=dirname(datfile) to NamedTemporaryFile should guarantee that the final rename does not have to copy the file from one disk to another in most cases. Using delete=False allows you to do the rename if the operation succeeds. The temporary file is deleted by name if any problem occurs, and renamed to the original file otherwise.

How do I read data into Python but not entirely into memory?

I need to parse through a file of about 100,000 records. Is there a way to do this without loading the whole file into memory? Does the csv module already do this (i.e., not load the entire file into memory)? If it matters, I plan on doing this in IDLE.
I've never used the cvs module, but you'll want to look into using a generator, this will allow you to process a record at a time without reading the entire file in at once. For example, with a file, you can do something like...
def read_file(some_file):
for line in open(some_file):
yield line
all_lines = read_file("foo")
results = process(all_lines)
The all_lines will be a generator and will return one line each time it is referenced, as in:
for line in all_lines:
...
I'd imagine you can do this with the cvs module as well.

Re-read an open file Python

I have a script that reads a file and then completes tests based on that file however I am running into a problem because the file reloads after one hour and I cannot get the script to re-read the file after or at that point in time.
So:
GETS NEW FILE TO READ
Reads file
performs tests on file
GET NEW FILE TO READ (with same name - but that can change if it is part of a solution)
Reads new file
perform same tests on new file
Can anyone suggest a way to get Python to re-read the file?
Either seek to the beginning of the file
with open(...) as fin:
fin.read() # read first time
fin.seek(0) # offset of 0
fin.read() # read again
or open the file again (I'd prefer this way since you are otherwise keeping the file open for an hour doing nothing between passes)
with open(...) as fin:
fin.read() # read first time
with open(...) as fin:
fin.read() # read again
Putting this together
while True:
with open(...) as fin:
for line in fin:
# do something
time.sleep(3600)
You can move the cursor to the beginning of the file the following way:
file.seek(0)
Then you can successfully read it.

How to efficiently append a new line to the starting of a large file?

I want to append a new line in the starting of 2GB+ file. I tried following code but code OUT of MEMORY
error.
myfile = open(tableTempFile, "r+")
myfile.read() # read everything in the file
myfile.seek(0) # rewind
myfile.write("WRITE IN THE FIRST LINE ")
myfile.close();
What is the way to write in a file file without getting the entire file in memory?
How to append a new line at starting of the file?
Please note, there's no way to do this with any built-in functions in Python.
You can do this easily in LINUX using tail / cat etc.
For doing it via Python we must use an auxiliary file and for doing this with very large files, I think this method is the possibility:
def add_line_at_start(filename,line_to_be_added):
f = fileinput.input(filename,inplace=1)
for xline in f:
if f.isfirstline():
print line_to_be_added.rstrip('\r\n') + '\n' + xline,
else:
print xline
NOTE:
Never try to use read() / readlines() functions when you are dealing with big files. These methods tried load the complete file into your memory
In your given code, seek function is going to take you the starting point but then everything you write would overwrite the current content
If you can afford having the entire file in memory at once:
first_line_update = "WRITE IN THE FIRST LINE \n"
with open(tableTempFile, 'r+') as f:
lines = f.readlines()
lines[0] = first_line_update
f.writelines(lines)
otherwise:
from shutil import copy
from itertools import islice, chain
# TODO: use a NamedTemporaryFile from the tempfile module
first_line_update = "WRITE IN THE FIRST LINE \n"
with open("inputfile", 'r') as infile, open("tmpfile", 'w+') as outfile:
# replace the first line with the string provided:
outfile.writelines(
(line for line in chain((first_line_update,), islice(infile,1,None)))
# if you don't want to replace the first line but to insert another line before
# this simplifies to:
#outfile.writelines(line for line in chain((first_line_update,), infile))
copy("tmpfile", "infile")
# TODO: remove temporary file
Generally, you can't do that. A file is a sequence of bytes, not a sequence of lines. This data model doesn't allow for insertions in arbitrary points - you can either replace a byte by another or append bytes at the end.
You can either:
Replace the first X bytes in the file. This could work for you if you can make sure that the first line's length will never vary.
Truncate the file, write the first line, then rewrite all the rest after it. If you can't fit all your file into the memory, then:
create a temporary file (the tempfile module will help you)
write your line to it
open your base file in r and copy its contents after the first line to the temporary file, piece-wise
close both files, then replace the input file by the temporary file
(Note that appending to the end of a file is much easier - all you need to do is open the file in the append a mode.)

Re-open files in Python?

Say I have this simple python script:
file = open('C:\\some_text.txt')
print file.readlines()
print file.readlines()
When it is run, the first print prints a list containing the text of the file, while the second print prints a blank list. Not completely unexpected I guess. But is there a way to 'wind back' the file so that I can read it again? Or is the fastest way just to re-open it?
You can reset the file pointer by calling seek():
file.seek(0)
will do it. You need that line after your first readlines(). Note that file has to support random access for the above to work.
For small files, it's probably much faster to just keep the file's contents in memory
file = open('C:\\some_text.txt')
fileContents = file.readlines()
print fileContents
print fileContents # This line will work as well.
Of course, if it's a big file, this could put strain on your RAM.
Remember that you can always use the with statement to open and close files:
from __future__ import with_statement
with open('C:\\some_text.txt') as file:
data = file.readlines()
#File is now closed
for line in data:
print line

Categories