Limiting Python file space in memory - python

When you open a file in python (e.g., open(filename, 'r') does it load the entire file into memory? More importantly, is there a way to partially load a file into memory to save memory space (for larger systems) or am I overthinking this? Particularly, I'm trying to optimize this in a cloud environment where I only need ~1-2 lines of a large file and would prefer not inputting all of that into memory as we pay for computation time.
General question, nothing was tested. looking for opinions and such

You can't add any more arguments into the open() function, but you can change how you read the lines from the file. For example:
# open the sample file used
file = open('test.txt')
# read the content of the file opened
content = file.readlines()
# read 10th line from the file
print("tenth line")
print(content[9])
# print first 3 lines of file
print("first three lines")
print(content[0:3])
You could also use the file.readline() method to read individual lines from a file.
Although this still means that your entire file will be read into memory, this is a compressed version of the full file, so doesn't take up the same amount of space as the full file in memory.

Related

How to modify and overwrite large files?

I want to make several modifications to some lines in the file and overwrite the file. I do not want to create a new file with the changes, and since the file is large (hundreds of MB), I don't want to read it all at once in memory.
datfile = 'C:/some_path/text.txt'
with open(datfile) as file:
for line in file:
if line.split()[0] == 'TABLE':
# if this is true, I want to change the second word of the line
# something like: line.split()[1] = 'new'
Please note that an important part of the problem is that the file is big. There are several solutions on the site that address the similar problems but do not account for the size of the files.
Is there a way to do this in python?
You can't replace the contents of a portion of a file without rewriting the remainder of the file regardless of python. Each byte of a file lives in a fixed location on a disk or flash memory. If you want to insert text into the file that is shorter or longer than the text it replaces, you will need to move the remainder of the file. If your replacement is longer than the original text, you will probably want to write a new file to avoid overwriting the data.
Given how file I/O works, and the operations you are already performing on the file, making a new file will not be as big of a problem as you think. You are already reading in the entire file line-by-line and parsing the content. Doing a buffered write of the replacement data will not be all that expensive.
from tempfile import NamedTemporaryFile
from os import remove, rename
from os.path import dirname
datfile = 'C:/some_path/text.txt'
try:
with open(datfile) as file, NamedTemporaryFile(mode='wt', dir=dirname(datfile), delete=False) as output:
tname = output.name
for line in file:
if line.startswith('TABLE'):
ls = line.split()
ls[1] = 'new'
line = ls.join(' ') + '\n'
output.write(line)
except:
remove(tname)
else:
rename(tname, datfile)
Passing dir=dirname(datfile) to NamedTemporaryFile should guarantee that the final rename does not have to copy the file from one disk to another in most cases. Using delete=False allows you to do the rename if the operation succeeds. The temporary file is deleted by name if any problem occurs, and renamed to the original file otherwise.

How to open/edit a file in Python for Windows without locking it?

I've installed pypiwin32 already, so I can use the win32file command, but I don't have much experience with Python.
How would I change my code below that opens a couple of files (I'm not worried about locking the first one), reads a line then replaces/writes a part of a line in the second file? I don't want the second file to get locked while it's open/writing, hence utilizing Win32 API.
with open("C:\\Temp\\Fileorg.txt", "rt") as fin:
with open("C:\\Temp\\File2.txt", "wt") as fout:
for line in fin:
fout.write(line.replace('part/list.txt', 'part/list.txt?id='+text))
New line symbol is just the character '\n'. If the file is "line1\nline2" and it is changed to "line1X\nline2" then everything after '\n' has to be re-written. Output file has to be locked.
In this scenario, it is possible to use file sharing to read characters before X. But it is a big challenge to read characters in the right sequence, on and after X.
The best option is to use a database
A second option is to write to a temporary file "temp.tmp". Once the operation is
finished, copy the whole file from "temp.tmp" to "File2.txt". Copying files is fast. The reader can check if the file is available, if it is not, it should wait 1 second for CopyFile to finish, and try again, up to 5 times (I made up the numbers 1 and 5)
A third option is to use formatted text. For example, the output file is
0line1####
0line2####
0line3####
You can modify this data only by changing the # characters. The 0 at the start of each line indicates that line is not busy. If the writer is updating the line, it changes 0 to 1, and back.
This way the file size doesn't change when data is changed, you can use file sharing and find data in the right place. You can add data but deleting data would be harder. This would be a big project.

Modify only a specific part of a file

I have a python script that is supposed to read a file. The issue is that that file is very large so for efficiency I decided that my script should only read from line 650000 and onward, since previous line does not contain relevant information.
Is there any way to only modify lines 650000 till eof, so for example, if i read() this file only those specific lines would appear?
Files are not line-oriented, they are blocks of bytes.
There's no way, short of reading the data in, to figure out how many bytes make up those first 650,000 lines, so you'd have to do that just in order to skip them.
Starting modifying a file at a certain offset is possible, but that offset will be in bytes which is the addressing unit used by files.
Skipping lines can be done easily enough:
with open("myfile.txt", "w+t") as f:
for i in xrange(650000):
f.readline() # Read a line and throw it away
f.write("hello")
This will truncate the file so that there will be no data after the hello (but 650,000 lines before it, of course).

MemoryError when trying to load 5GB text file

I want to read data stored in text format in a 5GB file. when I try to read the content of file using this code:
file = open('../data/entries_en.txt', 'r')
data = file.readlines()
an error occurred:
data = file.readlines()
MemoryError
My laptop has 8GB memory and at least 4GB is empty when I want to run the program. but when I monitor the system performance, when python uses about 1.5GB of memory, this error happens.
I'm using python 2.7, but if it matters please tell me solution for 2.x and 3.x
What should I do to read this file?
The best way for you to handle large files would be -
with open('../file.txt', 'r') as f:
for line in f:
# do stuff
readlines() would error because you are trying to load too large a file directly into the memory. The above code will automatically close your file once you are done processing on it.
If you want to process lines in the file, you should rather use:
for line in file:
# do something with the line
It will read the file line by line, instead of reading it all to the RAM at once.

How do I remove lines from a big file in Python, within limited environment

Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.
Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?
How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...
I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.
Also, you'll need to define the function isRequired(line).
writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
while True:
line = f.readline()
#manual EOF check; not sure of the correct
#Python way to do this manually...
if line == "":
break
#save how far we've read
readLoc = f.tell()
#if we need this line write it and
#update the write location
if isRequired(line):
f.seek( writeLoc )
f.write( line )
writeLoc = f.tell()
f.seek( readLoc )
#finally, chop off the rest of file that's no longer needed
f.truncate( writeLoc )
Try this:
currentReadPos = 0
removedLinesLength = 0
for line in file:
currentReadPos = file.tell()
if remove(line):
removedLinesLength += len(line)
else:
file.seek(file.tell() - removedLinesLength)
file.write(line + "\n")
file.flush()
file.seek(currentReadPos)
I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.
Update:
I have tried fileinput with inplace by creating a 1GB file. What I expected was different from what happened. I read the documentation properly this time.
Optional in-place filtering: if the
keyword argument inplace=1 is passed
to fileinput.input() or to the
FileInput constructor, the file is
moved to a backup file and standard
output is directed to the input file
(if a file of the same name as the
backup file already exists, it will be
replaced silently).
from docs/fileinput
So, this doesn't seem to be an option now for you. Please check other answers.
Before Edit:
If you are looking for editing the file inplace, then check out Python's fileinput module - Docs.
I am really not sure about its efficiency when used with a 10gb file. But, to me, this seemed to be the only option you have using Python.
Just sequentially read and write to the files.
f.readlines() returns a list
containing all the lines of data in
the file. If given an optional
parameter sizehint, it reads that many
bytes from the file and enough more to
complete a line, and returns the lines
from that. This is often used to allow
efficient reading of a large file by
lines, but without having to load the
entire file in memory. Only complete
lines will be returned.
Source
Process the file getting 10/20 or more MB of chunks.
This would be the fastest way.
Other way of doing this is to stream this file and filter it using AWK for example.
example pseudo code:
file = open(rw)
linesCnt=50
newReadOffset=0
tmpWrtOffset=0
rule=1
processFile()
{
while(rule)
{
(lines,newoffset)=getLines(file, newReadOffset)
if lines:
[x for line in lines if line==cool: line]
tmpWrtOffset = writeBackToFile(file, x, tmpWrtOffset) #should return new offset to write for the next time
else:
rule=0
}
}
To resize file at the end use truncate(size=None)

Categories