How to modify and overwrite large files?

How to modify and overwrite large files? - python

I want to make several modifications to some lines in the file and overwrite the file. I do not want to create a new file with the changes, and since the file is large (hundreds of MB), I don't want to read it all at once in memory.
datfile = 'C:/some_path/text.txt'
with open(datfile) as file:
for line in file:
if line.split()[0] == 'TABLE':
# if this is true, I want to change the second word of the line
# something like: line.split()[1] = 'new'
Please note that an important part of the problem is that the file is big. There are several solutions on the site that address the similar problems but do not account for the size of the files.
Is there a way to do this in python?

You can't replace the contents of a portion of a file without rewriting the remainder of the file regardless of python. Each byte of a file lives in a fixed location on a disk or flash memory. If you want to insert text into the file that is shorter or longer than the text it replaces, you will need to move the remainder of the file. If your replacement is longer than the original text, you will probably want to write a new file to avoid overwriting the data.
Given how file I/O works, and the operations you are already performing on the file, making a new file will not be as big of a problem as you think. You are already reading in the entire file line-by-line and parsing the content. Doing a buffered write of the replacement data will not be all that expensive.
from tempfile import NamedTemporaryFile
from os import remove, rename
from os.path import dirname
datfile = 'C:/some_path/text.txt'
try:
with open(datfile) as file, NamedTemporaryFile(mode='wt', dir=dirname(datfile), delete=False) as output:
tname = output.name
for line in file:
if line.startswith('TABLE'):
ls = line.split()
ls[1] = 'new'
line = ls.join(' ') + '\n'
output.write(line)
except:
remove(tname)
else:
rename(tname, datfile)
Passing dir=dirname(datfile) to NamedTemporaryFile should guarantee that the final rename does not have to copy the file from one disk to another in most cases. Using delete=False allows you to do the rename if the operation succeeds. The temporary file is deleted by name if any problem occurs, and renamed to the original file otherwise.

Related

Python: read a line and write back to that same line

I am using python to make a template updater for html. I read a line and compare it with the template file to see if there are any changes that needs to be updated. Then I want to write any changes (if there are any) back to the same line I just read from.
Reading the file, my file pointer is positioned now on the next line after a readline(). Is there anyway I can write back to the same line without having to open two file handles for reading and writing?
Here is a code snippet of what I want to do:
cLine = fp.readline()
if cLine != templateLine:
# Here is where I would like to write back to the line I read from
# in cLine

Updating lines in place in text file - very difficult
Many questions in SO are trying to read the file and update it at once.
While this is technically possible, it is very difficult.
(text) files are not organized on disk by lines, but by bytes.
The problem is, that read number of bytes on old lines is very often different from new one, and this mess up the resulting file.
Update by creating a new file
While it sounds inefficient, it is the most effective way from programming point of view.
Just read from file on one side, write to another file on the other side, close the files and copy the content from newly created over the old one.
Or create the file in memory and finally do the writing over the old one after you close the old one.

At the OS level the things are a bit different from how it looks from Python - from Python a file looks almost like a list of strings, with each string having arbitrary length, so it seems to be easy to swap a line for something else without affecting the rest of the lines:
l = ["Hello", "world"]
l[0] = "Good bye"
In reality, though, any file is just a stream of bytes, with strings following each other without any "padding". So you can only overwrite the data in-place if the resulting string has exactly the same length as the source string - otherwise it'll simply overwrite the following lines.
If that is the case (your processing guarantees not to change the length of strings), you can "rewind" the file to the start of the line and overwrite the line with new data. The below script converts all lines in file to uppercase in-place:
def eof(f):
cur_loc = f.tell()
f.seek(0,2)
eof_loc = f.tell()
f.seek(cur_loc, 0)
if cur_loc >= eof_loc:
return True
return False
with open('testfile.txt', 'r+t') as fp:
while True:
last_pos = fp.tell()
line = fp.readline()
new_line = line.upper()
fp.seek(last_pos)
fp.write(new_line)
print "Read %s, Wrote %s" % (line, new_line)
if eof(fp):
break
Somewhat related: Undo a Python file readline() operation so file pointer is back in original state
This approach is only justified when your output lines are guaranteed to have the same length, and when, say, the file you're working with is really huge so you have to modify it in place.
In all other cases it would be much easier and more performant to just build the output in memory and write it back at once. Another option is to write to a temporary file, then delete the original and rename the temporary file so it replaces the original file.

Replacing a line on Python

I'm trying to convert PHP code to Python, and I have problems with replacing lines. Although I find it easier to do using Python, I'm absolutely lost; I can find the line to replace, I can add something to the end of the line, but I can't write the line again on the file.
file = open("cache.ucb", 'rb')
for line in file:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
# Write on the file here!
Basically, the file uses ~! as a separator, and I read each line. If the first token separated with ~! of the line starts with ex[4], which could be for example Catbuntu, I want to append mask[0], which could be Bousie, on the end of that line. Then I remove the new line characters and add one to the end.
And there's the problem. I want to write the file as it was, but changing only that line. Is that possible?

Assuming you're on python >=2.7, the following should work a treat
original = open(filename)
newfile = []
for line in original:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
newfile.append(line)
original.close()
amended.open(filename, "w")
amended.writeLines(newfile)
amended.close()
If for whatever reason you are on python 2.6 or lower, replace the second to last line with:
amended.write("".join(newfile))
EDIT: Fixed to replace a mistake copied from the question, factor out a filename.

You cannot modify a file in-place, at least not if you want to insert characters to a line. You'll just end up overwriting the start of the next line.
There are two different ways to do this:
Read the file into memory, close it, then write back the new version.
Write a new temporary file as you go along, then move it over the original version.
So, how do you choose between them? I'll try to summarize the differences, ordered so that each one typically trumps the ones below if it's important (but that's just "typically"—you have to think through your own use case):
2 doesn't require holding the entire thing in memory. If your file is, say, 20GB long, this is obviously a huge win; if it's 16KB, it doesn't matter.
2 makes the entire operation atomic. Even if it fails halfway through, or some other process tries to read the file while you're in the middle of changing it, there is no way anyone can see some invalid half-modified file; they will see either the original file, or the new one.
2 requires some free disk space (because there are, temporarily, two copies of the file at the same time).
2 is a huge pain in the neck if you care about both Windows and POSIX.
2 can involve copying across filesystems if the original file and the temp directory are on different filesystems, unless you're careful about it.
2 is simpler if neither of the above two are an issue.
Drakekin's answer tells you how to do #1.
Here's how to do #2 if you don't care about Windows or about cross-filesystem issues:
infile = open("cache.ucb", 'rb')
outfile = tempfile.NamedTemporaryFile(delete=False)
for line in infile:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
outfile.write(line)
infile.close()
os.rename(outfile.name, "cache.ucb")
outfile.close()
You can solve the cross-filesystem problem by, e.g., passing dir=os.path.dirname(original path) to the NamedTemporaryFile constructor, but only if you're sure you'll always have permissions to create a new file alongside the original (which isn't always guaranteed, just because you have permission to rewrite the original—UNIX permissions, Windows ACLs, the OS X sandbox, etc. all give ways that can be false).
To solve the Windows problem… well, start with Is an atomic file rename (with overwrite) possible on Windows, and similar discussions all over the internet.

Open the file in mode 'wb' and put file.write(line) at the end of your loop.

You don't have your file open for writing.
file = open("cache.ucb", 'rb')
This line opens a file for reading in binary mode. You need to open it for writing also.
Try opening the file in write mode, 'w' and writing the line back.
Or you can simply open your file for read/write at the beginning and write inside your loop:
file = open("cache.ucb", 'a+')

How to efficiently append a new line to the starting of a large file?

I want to append a new line in the starting of 2GB+ file. I tried following code but code OUT of MEMORY
error.
myfile = open(tableTempFile, "r+")
myfile.read() # read everything in the file
myfile.seek(0) # rewind
myfile.write("WRITE IN THE FIRST LINE ")
myfile.close();
What is the way to write in a file file without getting the entire file in memory?
How to append a new line at starting of the file?

Please note, there's no way to do this with any built-in functions in Python.
You can do this easily in LINUX using tail / cat etc.
For doing it via Python we must use an auxiliary file and for doing this with very large files, I think this method is the possibility:
def add_line_at_start(filename,line_to_be_added):
f = fileinput.input(filename,inplace=1)
for xline in f:
if f.isfirstline():
print line_to_be_added.rstrip('\r\n') + '\n' + xline,
else:
print xline
NOTE:
Never try to use read() / readlines() functions when you are dealing with big files. These methods tried load the complete file into your memory
In your given code, seek function is going to take you the starting point but then everything you write would overwrite the current content

If you can afford having the entire file in memory at once:
first_line_update = "WRITE IN THE FIRST LINE \n"
with open(tableTempFile, 'r+') as f:
lines = f.readlines()
lines[0] = first_line_update
f.writelines(lines)
otherwise:
from shutil import copy
from itertools import islice, chain
# TODO: use a NamedTemporaryFile from the tempfile module
first_line_update = "WRITE IN THE FIRST LINE \n"
with open("inputfile", 'r') as infile, open("tmpfile", 'w+') as outfile:
# replace the first line with the string provided:
outfile.writelines(
(line for line in chain((first_line_update,), islice(infile,1,None)))
# if you don't want to replace the first line but to insert another line before
# this simplifies to:
#outfile.writelines(line for line in chain((first_line_update,), infile))
copy("tmpfile", "infile")
# TODO: remove temporary file

Generally, you can't do that. A file is a sequence of bytes, not a sequence of lines. This data model doesn't allow for insertions in arbitrary points - you can either replace a byte by another or append bytes at the end.
You can either:
Replace the first X bytes in the file. This could work for you if you can make sure that the first line's length will never vary.
Truncate the file, write the first line, then rewrite all the rest after it. If you can't fit all your file into the memory, then:
create a temporary file (the tempfile module will help you)
write your line to it
open your base file in r and copy its contents after the first line to the temporary file, piece-wise
close both files, then replace the input file by the temporary file
(Note that appending to the end of a file is much easier - all you need to do is open the file in the append a mode.)

How to erase the file contents of text file in Python?

I have text file which I want to erase in Python. How do I do that?

In python:
open('file.txt', 'w').close()
Or alternatively, if you have already an opened file:
f = open('file.txt', 'r+')
f.truncate(0) # need '0' when using r+

Opening a file in "write" mode clears it, you don't specifically have to write to it:
open("filename", "w").close()
(you should close it as the timing of when the file gets closed automatically may be implementation specific)

Not a complete answer more of an extension to ondra's answer
When using truncate() ( my preferred method ) make sure your cursor is at the required position.
When a new file is opened for reading - open('FILE_NAME','r') it's cursor is at 0 by default.
But if you have parsed the file within your code, make sure to point at the beginning of the file again i.e truncate(0)
By default truncate() truncates the contents of a file starting from the current cusror position.
A simple example

As #jamylak suggested, a good alternative that includes the benefits of context managers is:
with open('filename.txt', 'w'):
pass

When using with open("myfile.txt", "r+") as my_file:, I get strange zeros in myfile.txt, especially since I am reading the file first. For it to work, I had to first change the pointer of my_file to the beginning of the file with my_file.seek(0). Then I could do my_file.truncate() to clear the file.

Writing and Reading file content
def writeTempFile(text = None):
filePath = "/temp/file1.txt"
if not text: # If not provided return file content
f = open(filePath, "r")
slug = f.read()
return slug
else:
f = open(filePath, "a") # Create a blank file
f.seek(0) # sets point at the beginning of the file
f.truncate() # Clear previous content
f.write(text) # Write file
f.close() # Close file
return text
It Worked for me

If security is important to you then opening the file for writing and closing it again will not be enough. At least some of the information will still be on the storage device and could be found, for example, by using a disc recovery utility.
Suppose, for example, the file you're erasing contains production passwords and needs to be deleted immediately after the present operation is complete.
Zero-filling the file once you've finished using it helps ensure the sensitive information is destroyed.
On a recent project we used the following code, which works well for small text files. It overwrites the existing contents with lines of zeros.
import os
def destroy_password_file(password_filename):
with open(password_filename) as password_file:
text = password_file.read()
lentext = len(text)
zero_fill_line_length = 40
zero_fill = ['0' * zero_fill_line_length
for _
in range(lentext // zero_fill_line_length + 1)]
zero_fill = os.linesep.join(zero_fill)
with open(password_filename, 'w') as password_file:
password_file.write(zero_fill)
Note that zero-filling will not guarantee your security. If you're really concerned, you'd be best to zero-fill and use a specialist utility like File Shredder or CCleaner to wipe clean the 'empty' space on your drive.

You have to overwrite the file. In C++:
#include <fstream>
std::ofstream("test.txt", std::ios::out).close();

You can also use this (based on a few of the above answers):
file = open('filename.txt', 'w')
file.close()
of course this is a really bad way to clear a file because it requires so many lines of code, but I just wrote this to show you that it can be done in this method too.
happy coding!

You cannot "erase" from a file in-place unless you need to erase the end. Either be content with an overwrite of an "empty" value, or read the parts of the file you care about and write it to another file.

Assigning the file pointer to null inside your program will just get rid of that reference to the file. The file's still there. I think the remove() function in the c stdio.h is what you're looking for there. Not sure about Python.

Since text files are sequential, you can't directly erase data on them. Your options are:
The most common way is to create a new file. Read from the original file and write everything on the new file, except the part you want to erase. When all the file has been written, delete the old file and rename the new file so it has the original name.
You can also truncate and rewrite the entire file from the point you want to change onwards. Seek to point you want to change, and read the rest of file to memory. Seek back to the same point, truncate the file, and write back the contents without the part you want to erase.
Another simple option is to overwrite the data with another data of same length. For that, seek to the exact position and write the new data. The limitation is that it must have exact same length.
Look at the seek/truncate function/method to implement any of the ideas above. Both Python and C have those functions.

This is my method:
open the file using r+ mode
read current data from the file using file.read()
move the pointer to the first line using file.seek(0)
remove old data from the file using file.truncate(0)
write new content and then content that we saved using file.read()
So full code will look like this:
with open(file_name, 'r+') as file:
old_data = file.read()
file.seek(0)
file.truncate(0)
file.write('my new content\n')
file.write(old_data)
Because we are using with open, file will automatically close.

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.

Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)

To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.

Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.

I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to modify and overwrite large files? - python

Related

Python: read a line and write back to that same line

Replacing a line on Python

How to efficiently append a new line to the starting of a large file?

How to erase the file contents of text file in Python?

How do I modify the last line of a file?

Categories

Resources