How can I securely remove a file using python? The function os.remove(path) only removes the directory entry, but I want to securely remove the file, similar to the apple feature called "Secure Empty Trash" that randomly overwrites the file.
What function securely removes a file using this method?
You can use srm to securely remove files. You can use Python's os.system() function to call srm.
You can very easily write a function in Python to overwrite a file with random data, even repeatedly, then delete it. Something like this:
import os
def secure_delete(path, passes=1):
with open(path, "ba+") as delfile:
length = delfile.tell()
with open(path, "br+") as delfile:
for i in range(passes):
delfile.seek(0)
delfile.write(os.urandom(length))
os.remove(path)
Shelling out to srm is likely to be faster, however.
You can use srm, sure, you can always easily implement it in Python. Refer to wikipedia for the data to overwrite the file content with. Observe that depending on actual storage technology, data patterns may be quite different. Furthermore, if you file is located on a log-structured file system or even on a file system with copy-on-write optimisation, like btrfs, your goal may be unachievable from user space.
After you are done mashing up the disk area that was used to store the file, remove the file handle with os.remove().
If you also want to erase any trace of the file name, you can try to allocate and reallocate a whole bunch of randomly named files in the same directory, though depending on directory inode structure (linear, btree, hash, etc.) it may very tough to guarantee you actually overwrote the old file name.
So at least in Python 3 using #kindall's solution I only got it to append. Meaning the entire contents of the file were still intact and every pass just added to the overall size of the file. So it ended up being [Original Contents][Random Data of that Size][Random Data of that Size][Random Data of that Size] which is not the desired effect obviously.
This trickery worked for me though. I open the file in append to find the length, then reopen in r+ so that I can seek to the beginning (in append mode it seems like what caused the undesired effect is that it was not actually possible to seek to 0)
So check this out:
def secure_delete(path, passes=3):
with open(path, "ba+", buffering=0) as delfile:
length = delfile.tell()
delfile.close()
with open(path, "br+", buffering=0) as delfile:
#print("Length of file:%s" % length)
for i in range(passes):
delfile.seek(0,0)
delfile.write(os.urandom(length))
#wait = input("Pass %s Complete" % i)
#wait = input("All %s Passes Complete" % passes)
delfile.seek(0)
for x in range(length):
delfile.write(b'\x00')
#wait = input("Final Zero Pass Complete")
os.remove(path) #So note here that the TRUE shred actually renames to file to all zeros with the length of the filename considered to thwart metadata filename collection, here I didn't really care to implement
Un-comment the prompts to check the file after each pass, this looked good when I tested it with the caveat that the filename is not shredded like the real shred -zu does
The answers implementing a manual solution did not work for me. My solution is as follows, it seems to work okay.
import os
def secure_delete(path, passes=1):
length = os.path.getsize(path)
with open(path, "br+", buffering=-1) as f:
for i in range(passes):
f.seek(0)
f.write(os.urandom(length))
f.close()
Related
I have a python script that takes in two arguments, the name of the input and output files, i.e. it starts of like
inputFile=open(sys.argv[1],'r')
outFile=open(sys.argv[2],'w')
Then performs whatever operation reading from inputFile and writing to the outFile.
Now a few times through human error I've accidentally given the same argument twice, the result being that my input file is replaced with a blank line. Is there are a straight-forward way to stop this happening?
I thought it might be as simple as adding
if sys.argv[1]==sys.argv[2]:
inputFile.close()
outFile.close()
immediately after the first lines above, but this already leaves the file blank.
Simply do :
import os
if os.path.realpath(sys.argv[1]) != os.path.realpath(sys.argv[2]):
inputFile=open(sys.argv[1],'r')
outFile=open(sys.argv[2],'w')
else:
raise ValueError('Input and output files are the same')
This will prevent human mistakes by raising a welcomed error that won't destroy your input file.
os.path.realpath will transform any relative path to an absolute path, so that, even if the strings are different, you can raise the error when absolute paths are identical (thanks #Jean-François Fabre for reminding me this)
opening the file for writing immediately truncates the file, so the damage is already done when you compare the strings.
That said:
on windows filesystems, the protection is "built-in" since if the file is open as read mode, it cannot be open as write mode at the same time: good (there's a "grey area" for networked filesystems, though)
on Linux/Unix, the risk is there. But comparing the name isn't enough. What if both different paths point on the same file after all? (consider: foo/bar and /mydrive/foo/bar or foo/../bar and bar)
You could use os.path.realpath() on both files prior to comparing for instance to resolve relative paths that could be different (that wouldn't solve symbolic link problems, but it's better than nothing)
And for the windows "gray area" I was mentionning, comparing the lowercase version of the names would be a good idea.
The input file is becoming blank because open(filename, 'w') overwrites a file with whatever needs to be placed in it. 'w' is useful for file creation and then writing to that file. I'd suggest trying open(filename, 'a') for appending a pre-existing file. I can't quite remember if this creates a file if it's not already existing, but it sounds like you have 2 existing files already, so append should be what you need.
If you decide to go the if sys.argv[1] == sys.argv[2] method, try placing str() around each item you're comparing, just to be certain it's comparing them properly.
Im trying to replace the zero's with a value. So far this is my code, but what do i do next?
g = open("January.txt", "r+")
for i in range(3):
dat_month = g.readline()
Month: January
Item: Lawn
Total square metres purchased:
0
monthly value = 0
You could do that -
but that is not the usual approach, and certainly is not the correct approach for text files.
The correct way to do it is to write another file, with the information you want updated in place, and then rename the new file to the old one. That is the only sane way of doing this with text files, since the information size in bytes for the fields is variable.
As for the impresion that you are "writing 200 bytes to the disk" instead of a single byte, changing your value, don't let that fool you: at the Operating system level, all file access has to be done in blocks, which are usually a couple of kilobytes long (in special cases, and tunned filesystems it could be a couple hundred bytes). Anyway, you will never, in a user-space program, much less in a high level language like Python, trigger a diskwrite of less than a few hundred bytes.
Now, for the code:
import os
my_number = <number you want to place in the line you want to rewrite>
with open("January.txt", "r") as in_file, open("newfile.txt", "w") as out_file:
for line in in_file:
if line.strip() == "0":
out_file.write(str(my_number) + "\n")
else:
out_file.write(line)
os.unlink("January.txt")
os.rename("newfile.txt", "January.txt")
So - that is the general idea -
of course you should not write code with all values hardcoded in that way (i.e. the values to be checked and written fixed in the program code, as are the filenames).
As for the with statement - it is a special construct of the language wich is very appropriate to oppening files and manipulating then in a block, like in this case - but it is not needed.
Programing apart, the concept you have to keep in mind is this:
when you use an application that lets you edit a text file, a spreadsheet, an image, you, as user, may have the impression that after you are done and have saved your work, the updates are comitted to the same file. In the vast, vast majority of use cases, that is not what happens: the application uses internally a pattern like the one I presented above - a completly new file is written to disk and the old one is deleted, or renamed. The few exceptions could be simple database applications, which could replace fixed width fields inside the file itself on updates. Modern day databases certainly do not do that, resorting to appending the most recent, updated information, to the end of the file. PDF files are another kind that were not designed to be replaced entirely on each update, when being created: but also in that case, the updated information is written at the end of the file, even if the update is to take place in a page in the beginning of the rendered document.
dat_month = dat_month.replace("0", "45678")
To write to a file you do:
with open("Outfile.txt", "wt") as outfile:
And then
outfile.write(dat_month)
Try this:
import fileinput
import itertools
import sys
with fileinput.input('January.txt', inplace=True) as file:
beginning = tuple(itertools.islice(file, 3))
sys.stdout.writelines(beginning)
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.writelines(file)
I want to copy the contents of a JSON file in another JSON file, with Python
Any ideas ?
Thank you :)
Given the lack of research effort, I normally wouldn't answer, but given the poor suggestions in comments, I'll bite and give a better option.
Now, this largely depends on what you mean, do you wish to overwrite the contents of one file with another, or insert? The latter can be done like so:
with open("from.json", "r") as from, open("to.json", "r") as to:
to_insert = json.load(from)
destination = json.load(to)
destination.append(to_insert) #The exact nature of this line varies. See below.
with open("to.json", "w") as to:
json.dump(to, destination)
This uses python's json module, which allows us to do this very easily.
We open the two files for reading, then open the destination file again in writing mode to truncate it and write to it.
The marked line depends on the JSON data structure, here I am appending it to the root list element (which could not exist), but you may want to place it at a particular dict key, or somesuch.
In the case of replacing the contents, it becomes easier:
with open("from.json", "r") as from, open("to.json", "w") as to:
to.write(from.read())
Here we literally just read the data out of one file and write it into the other file.
Of course, you may wish to check the data is JSON, in which case, you can use the JSON methods as in the first solution, which will throw exceptions on invalid data.
Another, arguably better, solution to this could also be shutil's copy methods, which would avoid actually reading or writing the file contents manually.
Using the with statement gives us the benefit of automatically closing our files - even if exceptions occur. It's best to always use them where we can.
Note that in versions of Python before 2.7, multiple context managers are not handled by the with statement, instead you will need to nest them:
with open("from.json", "r") as from:
with open("to.json", "r+") as to:
...
Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.
Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?
How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...
I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.
Also, you'll need to define the function isRequired(line).
writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
while True:
line = f.readline()
#manual EOF check; not sure of the correct
#Python way to do this manually...
if line == "":
break
#save how far we've read
readLoc = f.tell()
#if we need this line write it and
#update the write location
if isRequired(line):
f.seek( writeLoc )
f.write( line )
writeLoc = f.tell()
f.seek( readLoc )
#finally, chop off the rest of file that's no longer needed
f.truncate( writeLoc )
Try this:
currentReadPos = 0
removedLinesLength = 0
for line in file:
currentReadPos = file.tell()
if remove(line):
removedLinesLength += len(line)
else:
file.seek(file.tell() - removedLinesLength)
file.write(line + "\n")
file.flush()
file.seek(currentReadPos)
I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.
Update:
I have tried fileinput with inplace by creating a 1GB file. What I expected was different from what happened. I read the documentation properly this time.
Optional in-place filtering: if the
keyword argument inplace=1 is passed
to fileinput.input() or to the
FileInput constructor, the file is
moved to a backup file and standard
output is directed to the input file
(if a file of the same name as the
backup file already exists, it will be
replaced silently).
from docs/fileinput
So, this doesn't seem to be an option now for you. Please check other answers.
Before Edit:
If you are looking for editing the file inplace, then check out Python's fileinput module - Docs.
I am really not sure about its efficiency when used with a 10gb file. But, to me, this seemed to be the only option you have using Python.
Just sequentially read and write to the files.
f.readlines() returns a list
containing all the lines of data in
the file. If given an optional
parameter sizehint, it reads that many
bytes from the file and enough more to
complete a line, and returns the lines
from that. This is often used to allow
efficient reading of a large file by
lines, but without having to load the
entire file in memory. Only complete
lines will be returned.
Source
Process the file getting 10/20 or more MB of chunks.
This would be the fastest way.
Other way of doing this is to stream this file and filter it using AWK for example.
example pseudo code:
file = open(rw)
linesCnt=50
newReadOffset=0
tmpWrtOffset=0
rule=1
processFile()
{
while(rule)
{
(lines,newoffset)=getLines(file, newReadOffset)
if lines:
[x for line in lines if line==cool: line]
tmpWrtOffset = writeBackToFile(file, x, tmpWrtOffset) #should return new offset to write for the next time
else:
rule=0
}
}
To resize file at the end use truncate(size=None)
I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)