I'm attempting to identify the blocks that are tied to a specific file and write zeros to them. I've found several methods that do this to the free space on a disk, but so far I haven't found any slid suggestions for doing the following:
identify the blocks for a file
Write zeros to those blocks.
The purpose of this is for a virtualized system. This system has the ability to dedupe blocks that are identified as being the same. This is used to reduce space used by the guest OSes on the drive.
Currently this is being done using dd to write zeros to the free space on the drive. However this has the side effect on VMWare systems to cause the guest OS drive to use the entire disk space it has been allocated as from that point on the system things all the bytes have been written to.
Writing code that can safely modify even an unmounted filesystem will require significant effort. It is to be avoided unless there is no other option.
You basically have two choices to make modifying the filesystem easy:
Run python in the virtual environment.
Mount the virtualized filesystem on the host. Most UNIX-like systems can do that, e.g. with the help of FUSE (which support a lot of filesystem types) and loop devices.
This way you can use the (guest or host) OS's filesystem code instead of having to roll your own. :-) If you can use one of those options, the code fragment listed below will fill a file with zeroes:
import os
def overwrite(f):
"""Overwrite a file with zeroes.
Arguments:
f -- name of the file
"""
stat = os.stat(f)
with open(f, 'r+') as of:
of.write('\0' * stat.st_size)
of.flush()
Related
Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks
Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.
If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.
The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.
Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.
Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.
I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.
You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html
Suppose I do:
with open("temp.txt", "w" as f):
while True:
f.write(1)
What shall happen when I come close to completely using up my disk space? It seems like a problem that might have been asked before, but unfortunately I didn't find anything. Thanks...
In case it matters, I'm on ubuntu.
When the disk will be full (or when you will exhaust your quota if the filesystem supports them), the write will raise an IOError. If that exception is not filtered in a try block, it will terminate the script.
But bad things could happen. Most tools currently expect to have enough disk and memory, and most system implement multi-tasking. That means that if you exhaust the system disk, various system components could start to malfunction, especially if you are running under an admin account. Long story made short: avoid that unless you are experimenting on a dedicated file system...
Is there no limit as to how many files can be created in a code in python, like there is a recursion limit? I have this bit of code here,
each = 0
while True:
with open('filenamewhtever'+str(each)+'.txt', 'a') as file:
file.write(str(each))
each += 1
which seems to work just fine, although it quickly filled a lot of memory in the folder. Could this,if unchecked, potentially have crashed my pc? Also, shouldnt the compiler have a failsafe switch to prevent this?
There is typically an operating system defined limit to how many files you can have open at the same time. But because the with statement closes each file after you've written it, you don't run into this limit.
There might also be limits imposed by the file system on how many files can be in a single directory. You might see certain operations (like listing the files) become slow, well before you even come close to that limit.
And finally, you are obviously limited by disk space.
I've been working on a project in PHP which requires mmap'ing /dev/mem to gain access to the hardware peripheral registers. As there's no native mmap, the simplest way I could think of to achieve this was to construct a python subprocess, which communicated with the PHP app via stdin/stdout.
I have run into a strange issue which only occurs while reading addresses, not writing them. The subprocess functions correctly (for reading) with the following:
mem.write(sys.stdin.read(length))
So, I expected that I could conversely write memory segments back to the parent using the following:
sys.stdout.write(mem.read(length))
If I mmap a standard file, both commands work as expected (irrelevant of the length of read/write). If I map the /dev/mem "file," I get nonsense back during the read. It's worth noting that the area I'm mapping is outside the physical memory address space and is used to access the peripheral registers.
The work-around I have in place is the following:
for x in range(0, length / 4):
sys.stdout.write(str(struct.pack('L', struct.unpack_from('L', mem, mem.tell())[0])))
mem.seek(4, os.SEEK_CUR)
This makes the reads behave as expected.
What I can't understand is why reading from the address using unpack_from should see anything different to reading it directly. The same (non-working) thing occurs if I try to just assign a read to a variable.
In case additional context is helpful, I'm running this on a Raspberry Pi/Debian 8. The file that contains the above issue is here. The project that uses it is here.
I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.