sftp write operation atomicity with Python

sftp write operation atomicity with Python - python

When appending lines to a remote file via SFTP with pysftp:
import pysftp
with pysftp.Connection('192.168.0.2', username='root', password='') as sftp:
with sftp.cd('/home/www/test'):
with sftp.open('test.txt', 'a+') as f:
for i in range(100):
s = (("%04d" % i).encode()*10000) + b'\n' # 40'001 bytes
f.write(s)
if I terminate the process in the middle of the operation, sometimes (if I'm lucky), the whole line s is written on the distant file.
On other cases, the last line is truncated in the middle, at the time the process has been interrupted.
Is there a way to make the SFTP f.write(s) operation atomic? i.e. either it fails in the middle, then nothing is written, or it succeeds and the full 40'001 byte-line is written?

I don't believe this is possible. First of all, in order for it to be possible at all, the remote system's write(2) syscall would have to guarantee that, and POSIX does not require that behavior. There are many reasons a write may be non-atomic, such as if the remote disk is full and you can only write part of the data to disk, or if the remote user has a quota and your full write would exceed that.
Additionally, you're trying to write over 40 kB over a network connection, and it's likely that doesn't fit in one packet. Consequently, it wouldn't make sense for any network software to write a packet that large.
If it's important to you to write a file completely or not at all, you can write to another file on the same disk and then rename over the original file. This is the way programs like Git guarantee atomic file updates. I believe for SFTP that requires that both sides support the posix-rename#openssh.com extension; OpenSSH does, but I don't know if pysftp does, so you'd need to consult the documentation.

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks

Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.

If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.

The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.

Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.

Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.

I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.

You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

How to resume file transferring with paramiko

I'm working on a Python project that is required some file transferring. One side of the connection is highly available ( REHL 6 ) and always online. But the other side is going on and off ( Windows 7 ) and the connection period is not guaranteed. The files are transporting on both directions and sizes are between 10MB to 2GB.
Is it possible to resume the file transferring with paramiko instead of transferring the entire file from the beginning.
I would like to use rSync but one side is windows and I would like to avoid cwRsync and DeltaCopy

Paramiko doesn't offer an out of the box 'resume' function however, Syncrify, DeltaCopy's big successor has a retry built in and if the backup goes down the server waits up to six hours for a reconnect. Pretty trusty, easy to use and data diff by default.

paramiko.sftp_client.SFTPClient contains an open function, which functions exactly like python's built-in open function.
You can use this to open both a local and remote file, and manually transfer data from one to the other, all the while recording how much data has been transferred. When the connection is interrupted, you should be able to pick up right where you left off (assuming that neither file has been changed by a 3rd party) by using the seek method.
Keep in mind that a naive implementation of this is likely to be slower than paramiko's get and put functions.

Why does select.select() work with disk files but not epoll()?

The following code essentially cats a file with select.select():
f = open('node.py')
fd = f.fileno()
while True:
r, w, e = select.select([fd], [], [])
print '>', repr(os.read(fd, 10))
time.sleep(1)
When I try a similar thing with epoll I get an error:
self._impl.register(fd, events | self.ERROR)
IOError: [Errno 1] Operation not permitted
I've also read that epoll does not support disk files -- or perhaps that it doesn't make sense.
Epoll on regular files
But why does select() support disk files then? I looked at the implementation in selectmodule.c and it seems to just be going through to the operating system, i.e. Python is not adding any special support.
On a higher level I'm experimenting with the best way to serve static files in a nonblocking server. I guess I will try creating I/O threads that read from disk and feed data to the main event loop thread that writes to sockets.

select allows filedescriptors pointing to regular files to be monitored, however it will always report a file as readable/writable (i.e. it's somewhat useless, as it doesn't tell you whether a read/write would actually block).
epoll just disallows monitoring of regular files, as it has no mechanism (on linux at least) available to tell whether reading/writing a regular file would block

Abort a slow flush to disk after write?

Is there a way to abort a python write operation in such a way that the OS doesn't feel it's necessary to flush the unwritten data to the disc?
I'm writing data to a USB device, typically many megabytes. I'm using 4096 bytes as my block size on the write, but it appears that Linux caches up a bunch of data early on, and write it out to the USB device slowly. If at some point during the write, my user decides to cancel, I want the app to just stop writing immediately. I can see that there's a delay between when the data stops flowing from the application, and the USB activity light stops blinking. Several seconds, up to about 10 seconds typically. I find that the app is holding in the close() method, I'm assuming, waiting for the OS to finish writing the buffered data. I call flush() after every write, but that doesn't appear to have any impact on the delay. I've scoured the python docs for an answer but have found nothing.

It's somewhat filesystem dependent, but in some filesystems, if you delete a file before (all of) it is allocated, the IO to write the blocks will never happen. This might also be true if you truncate it so that the part which is still being written is chopped off.
Not sure that you can really abort a write if you want to still access the data. Also the kinds of filesystems that support this (e.g. xfs, ext4) are not normally used on USB sticks.
If you want to flush data to the disc, use fdatasync(). Merely flushing your IO library's buffer into the OS one will not achieve any physical flushing.

Assuming I am understanding this correct, you want to be able to 'abort' and NOT flush the data. This IS possible using a ctype and a little pokery. This is very OS dependent so I'll give you the OSX version and then what you can do to change it to Linux:
f = open('flibble1.txt', 'w')
f.write("hello world")
import ctypes
x = ctypes.cdll.LoadLibrary('/usr/lib/libc.dylib')
x.close(f.fileno())
try:
del f
catch IOError:
pass
If you change /usr/lib/libc.dylib to the libc.so.6 in /usr/lib for Linux then you should be good to go. Basically by calling close() instead of fclose(), no call to fsync() is done and nothing is flushed.
Hope that's useful.

When you abort the write operation, trying doing file.truncate(0); before closing it.

Does python have hooks into EXT3

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.

no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.

There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sftp write operation atomicity with Python - python

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

How to resume file transferring with paramiko

Why does select.select() work with disk files but not epoll()?

Abort a slow flush to disk after write?

Does python have hooks into EXT3

Categories

Resources