When does Python write a file to disk? - python

I have a library that interacts with a configuration file. When the library is imported, the initialization code reads the configuration file, possibly updates it, and then writes the updated contents back to the file (even if nothing was changed).
Very occasionally, I encounter a problem where the contents of the configuration file simply disappear. Specifically, this happens when I run many invocations of a short script (using the library), back-to-back, thousands of times. It never occurs during the same directories, which leads me to believe it's a somewhat random problem--specifically a race condition with IO.
This is a pain to debug, since I can never reliably reproduce the problem and it only happens on some systems. I have a suspicion about what might happen, but I wanted to see if my picture of file I/O in Python is correct.
So the question is, when does a Python program actually write file contents to a disk? I thought that the contents would make it to disk by the time that the file closed, but then I can't explain this error. When python closes a file, does it flush the contents to the disk itself, or simply queue it up to the filesystem? Is it possible that file contents can be written to disk after Python terminates? And can I avoid this issue by using fp.flush(); os.fsync(fp.fileno()) (where fp is the file handle)?
If it matters, I'm programming on a Unix system (Mac OS X, specifically). Edit: Also, keep in mind that the processes are not running concurrently.
Appendix: Here is the specific race condition that I suspect:
Process #1 is invoked.
Process #1 opens the configuration file in read mode and closes it when finished.
Process #1 opens the configuration file in write mode, erasing all of its contents. The erasing of the contents is synced to the disk.
Process #1 writes the new contents to the file handle and closes it.
Process #1: Upon closing the file, Python tells the OS to queue writing these contents to disk.
Process #1 closes and exits
Process #2 is invoked
Process #2 opens the configuration file in read mode, but new contents aren't synced yet. Process #2 sees an empty file.
The OS finally finishes writing the contents to disk, after process 2 reads the file
Process #2, thinking the file is empty, sets defaults for the configuration file.
Process #2 writes its version of the configuration file to disk, overwriting the last version.

It is almost certainly not python's fault. If python closes the file, OR exits cleanly (rather than killed by a signal), then the OS will have the new contents for the file. Any subsequent open should return the new contents. There must be something more complicated going on. Here are some thoughts.
What you describe sounds more likely to be a filesystem bug than a Python bug, and a filesystem bug is pretty unlikely.
Filesystem bugs are far more likely if your files actually reside in a remote filesystem. Do they?
Do all the processes use the same file? Do "ls -li" on the file to see its inode number, and see if it ever changes. In your scenario, it should not. Is it possible that something is moving files, or moving directories, or deleting directories and recreating them? Are there symlinks involved?
Are you sure that there is no overlap in the running of your programs? Are any of them run from a shell with "&" at the end (i.e. in the background)? That could easily mean that a second one is started before the first one is finished.
Are there any other programs writing to the same file?
This isn't your question, but if you need atomic changes (so that any program running in parallel only sees either the old version or the new one, never the empty file), the way to achieve it is to write the new content to another file (e.g. "foo.tmp"), then do os.rename("foo.tmp", "foo"). Rename is atomic.

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks
Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.
If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.
The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.
Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.
Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.
I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.
You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

Pickup directories: How not to pickup files that are still being written?

I have a Python script that checks on a pickup directory and processes any files that it finds, and then deletes them.
How can I make sure not to pickup a file that is still being written by the process that drops files in that directory?
My test case is pretty simple. I copy-paste 300MB of files into the pickup directory, and frequently the script will grab a file that's still being written. It operates on only the partial file, then delete it. This fires off a file operation error in the OS as the file it was writing to disappeared.
I've tried acquiring a lock on the file (using the FileLock module) before I open/process/delete it. But that hasn't helped.
I've considered checking the modification time on the file to avoid anything within X seconds of now. But that seems clunky.
My test is on OSX, but I'm trying to find a solution that will work across the major platforms.
I see a similar question here (How to check if a file is still being written?), but there was no clear solution.
Thank you
As a workaround, you could listen to file modified events (watchdog is cross-platform). The modified event (on OS X at least) isn't fired for each write, it's only fired on close. So when you detect a modified event you can assume all writes are complete.
Of course, if the file is being written in chunks, and being saved after each chunk this won't work.
One solution to this problem would be to change the program writing the files to write the files to a temporary file first, and then move that temporary file to the destination when it is done. On most operating systems, when the source and destination are on the same file system, move is atomic.
If you have no control over the writing portion, about all you can do is watch the file yourself, and when it stops growing for a certain amount of time, call it good. I have to use that method myself, and found 40 seconds is safe for my conditions.
Each OS will have a different solution, because file locking mechanisms are not portable.
On Windows, you can use OS locking.
On Linux you can have a peek at open files (similarily how lsof does) and if file is open, leave it.
Have you tried opening the file before coping it? If the file is still in use, then open() should throw exception.
try:
with open(filename, "rb") as fp:
pass
# Copy the file
except IOError:
# Dont copy

Effectively reading a large, active Python log file

When my Python script is writing a large amount of logs to a text file line by line using the Python built-in logging library, in my Delphi-powered Windows program I want to effectively read all newly added logs (lines).
When the Python scripting is logging
to the file, my Windows program will
keep a readonly file handle to
that log file;
I'll use the Windows API to get
informed when the log file is
changed; Once the file is changed, it'll read the newly appended lines.
I'm new to Python, do you see any possible problem with this approach? Does the Python logging lib lock the entire log? Thanks!
It depends on the logging handler you use, of course, but as you can see from the source code, logging.FileHandler does not currently create any file locks. By default, it opens files in 'a' (append) mode, so as long as your Windows calls can handle that, you should be fine.
As ʇsәɹoɈ commented, the standard FileHandler logger does not lock the file, so it should work. However, if for some reason you cannot keep you lock on the file - then I'd recommend having your other app open the file periodically, record the position it's read to and then seek back to that point later. I know the Linux DenyHosts program uses this approach when dealing with log files that it has to monitor for a long period of time. In those situations, simply holding a lock isn't feasible, since directories may move, the file get rotated out, etc. Though it does complicate things in that then you have to store filename + read position in persistent state somewhere.

File copy completion?

In Linux, how can we know if a file has completed copying before reading it? In Windows, an OSError is raised.
You can use the inotify mechanisms (via pyinotify) to catch events like CREATE, WRITE, CLOSE and based on them you can assume wether the copy has finished or not.
However, since you provided no details on what are you trying to do, I can't tell if inotify would be suitable for you (btw, inotify is Linux specific so you can't use it on Windows or other platforms)
In Linux, you can open a file while another process is writing to it without Python throwing an OSError, so in general, you cannot know for sure whether the other side has finished writing into that file. You can try some hacks, though:
You can check the file size regularly to see whether it increased since the last check. If it hasn't increased in, say, five seconds, you might be safe to assume that the copy has finished. I'm saying might since this is not true in all circumstances. If the other process that is writing the file is blocked for whatever reason, it might temporarily stop writing to the file and resume it later. So this is not 100% fool-proof, but might work for local file copies if the system is never under a heavy load that would stall the writing process.
You can check the output of fuser (this is a shell command), which will list the process IDs for all the files that are holding a file handle to a given file name. If this list includes any process other than yours, you can assume that the copying process hasn't finished yet. However, you will have to make sure that fuser is installed on the target system in order to make it work.

which inotify event signals the completion of a large file operation?

for large files or slow connections, copying files may take some time.
using pyinotify, i have been watching for the IN_CREATE event code. but this seems to occur at the start of a file transfer. i need to know when a file is completely copied - it aint much use if it's only half there.
when a file transfer is finished and completed, what inotify event is fired?
IN_CLOSE probably means the write is complete. This isn't for sure since some applications are bad actors and open and close files constantly while working with them, but if you know the app you're dealing with (file transfer, etc.) and understand its' behaviour, you're probably fine. (Note, this doesn't mean the transfer completed successfully, obviously, it just means that the process that opened the file handle closed it).
IN_CLOSE catches both IN_CLOSE_WRITE and IN_CLOSE_NOWRITE, so make your own decisions about whether you want to just catch one of those. (You probably want them both - WRITE/NOWRITE have to do with file permissions and not whether any writes were actually made).
There is more documentation (although annoyingly, not this piece of information) in Documentation/filesystems/inotify.txt.
For my case I wanted to execute a script after a file was fully uploaded. I was using WinSCP which writes large files with a .filepart extension till done.
I first started modifying my script to ignore files if they're themselves ending with .filepart or if there's another file existing in the same directory with the same name but .filepart extension, hence that means the upload is not fully completed yet.
But then I noticed at the end of the upload, when all the parts have been finished, I have a IN_MOVED_IN notification getting triggered which helped me run my script exactly when I wanted it.
If you want to know how your file uploader behaves, add this to the incrontab:
/your/directory/ IN_ALL_EVENTS echo "$$ $# $# $% $&"
and then
tail -F /var/log/cron
and monitor all the events getting triggered to find out which one suits you best.
Good luck!
Why don't you add a dummy file at the end of the transfer? You can use the IN_CLOSE or IN_CREATE event code on the dummy. The important thing is that the dummy as to be transfered as the last file in the sequence.
I hope it'll help.

Categories