I have previously written a script using python that monitors a windows directory and uploads any new files to a remote server offsite. The intent is to run it at all times and allow users to dump their files there to sync with the cloud directory.
When a file added is large enough that it is not transferred to the local drive all at once, Watchdog "sees" it as it is partially uploaded and tries to upload the partial file, which fails. How can I ensure that these files are "complete" before they are uploaded? Again, I am on Windows, and cannot use anything but Windows to complete this task, or I would have used inotify. Is it even possible to check the "state" of a file in this way on Windows?
It looks like there is no easy way to do this. I think you can put in place something that checks the stats on the directory when it triggers and only actions after a given amount of time that the folder size hasn't changed:
https://github.com/gorakhargosh/watchdog/issues/184
As a side note, I would check out Apache Nifi. I have used it with a lot of success and it was pretty easy to get up and running
https://nifi.apache.org/
Related
I need to use python to configure the file's access rights (I don't know exactly what it's called attached a screenshot). I have an application installer and it creates a folder in the system, this folder needs to be protected from the user, that is, to make it so that he could not edit, view and delete it. I've read a lot of articles, tried a lot of options, but unfortunately everything is even. os.chmod was mentioned a lot, but all the result I got from it was setting up a "read-only" file. How can I prevent the user from interacting with the folder and its files so that he could not remove the lock? But at the same time we should be able to interact with files using python scripts?
Our product has a file that was not versions properly deleted from the server (ftp crash). Thing is, the cloud processes are running, and I can actually submit python jobs to them (we have a process management framework.)
Is there anyway to get the code from an in-memory module? If so, I can run that code and recover the file.
Is this even possible?
I'm writing a python script which copies files from a server, performs a few operations on them, and delete the files locally after processing.
The script is not supposed to modify the files on the server in any way.
However, since bugs may occur, I would like to make sure that I'm not modifying\deleting the original server files.
Is there a way to prevent a python script from having writing permissions to a specific folder? I work on Windows OS.
That is unrelated to Python, but to the filesystem security provided by the OS. The key is that permissions are not given to programs but to the user under which they run.
Windows provides the command runas that allows to run a command (whatever the language is uses) under a different user. There is even a /savecred option that allows not to provide the password on each activation but instead save in in the current user's profile.
So if you setup a dedicated user to run the scrip, give it only read permissions on the server folder and run the scrip under that user, then even a bug in the script could not tamper that folder.
BTW, if the script is runned as a scheduled task, you can directly say what user should be used and give its password at config time.
I'm trying to write a script to take video files (ranging from several MB to several GB) written to a shared folder on a Windows server.
Ideally, the script will run on a Linux machine watching the Windows shared folder at an interval of something like every 15-120 seconds, and upload any files that have fully finished writing to the shared folder to an FTP site.
I haven't been able to determine any criteria that allows me to know for certain whether a file has been fully written to the share. It seems like Windows reserves a spot on the share for the entire size of the file (so the file size does not grow incrementally), and the modified date seems to be the time the file started writing, but it is not incremented as the file continues to grow. LSOF and fuser do not seem to be aware of the file, and even Samba tools don't seem to indicate it's locked, but I'm not sure if that's because I haven't mounted with the correct options. I've tried things like trying to open the file or rename it, and the best I've been able to come up with is a "Text File Busy" error code, but this seems to cause major delays in file copying. Naively uploading the file without checking to see if it has finished copying not only does not throw any kind of error, but actually seems to upload nul or random bytes from the allocated space to the FTP resulting in a totally corrupt file (if the network writing process is slower than the FTP) .
I have zero control over the writing process. It will take place on dozens of machines and consist pretty much exclusively of Windows OS file copies to a network share.
I can control the share options on the Windows server, and I have full control over the Linux box. Is there some method of checking locks on a Windows CIFS share that would allow me to be sure that the file has completely finished writing before I try to upload it via FTP? Or is the only possible solution to have the Linux server locally own the share?
Edit
The tldr, I'm really looking for the equivalent of something like 'lsof' that works for a cifs mounted share. I don't care how low level, though it would be ideal if it was something I could call from Python. I can't move the share or rename the files before they arrive.
I had this problem before, i'm not sure my way is the best way and it's most deffinatley a hacky fix, but i used a sleep interval and file size check, (i would expect the file to have grown if it was being written to...)
In my case i wanted to know that not only was the file not being written to but also that the windows share was not being written to...
my code is;
while [ "$(ls -la "$REMOTE_CSV_DIR"; sleep 15)" != "$(ls -la "$REMOTE_CSV_DIR")" ]; do
echo "File writing seems to be ocuring, waiting for files to finish copying..."
done
(ls -la includes file sizes in bits...)
What about this?:
Change the windows share to point to an actual Linux directory reserved for the purpose. Then, with simple Linux scripts, you can readily determine if any files there have any writers. Once there is a file not being written to, copy it to the windows folder—if that is where it needs to be.
I was wondering whether there was a best practice for checking if an upload to your ftp server was successful.
The system I'm working with has an upload directory which contains subdirectories for every user where the files are uploaded.
Files in these directories are only temporary, they're disposed of once handled.
The system loops through each of these subdirectories and new files in them and for each file checks whether it's been modified for 10 seconds. If it hasn't been modified for 10 seconds the system assumed the file was uploaded successfully.
I don't like the way the system currently handles these situations, because it will try and handle the file and fail if the file upload was incomplete, instead of waiting and allowing the user to resume the upload until it's complete.
It might be fine for small files which doesn't take a lot of time to upload, but if the file is big I'd like to be able to resume the upload.
I also don't like the loops of directories and files, the system idles at a high cpu usage, so I've implemented pyinotify to trigger an action when a file is written. I haven't really looked at the source code, I can only assume it is more optimized than the current implementation (which does more than I've described).
However I still need to check whether the file was successfully uploaded.
I know I can parse the xferlog to get all complete uploads. Like:
awk '($12 ~ /^i$/ && $NF ~ /^c$/){print $9}' /var/log/proftpd/xferlog
This would make pyinotify unnecessary since I can get the path for complete and incomplete uploads if I only tail the log.
So my solution would be to check the xferlog in my run-loop and only handle complete files.
Unless there's a best practice or simply a better way to do this?
What would the disadvantages be with this method?
I run my app on a debian server and proftpd is installed on the same server. Also, I have no control over clients sending the file.
Looking at the proftpd docs, I see http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html
The HiddenStores directive enables two-step file uploads: files are
uploaded as ".in.filename." and once the upload is complete, renamed
to just "filename". This provides a degree of atomicity and helps
prevent 1) incomplete uploads and 2) files being used while they're
still in the progress of being uploaded.
This should be the "better way" to solve the problem when you have control of proftpd as it handles all the work for you - you can assume that any file which doesn't start .in. is a completed upload. You can also safely delete any orphan .in.* files after some arbitrary period of inactivity in a tidy-up script somewhere.
You can use pure-uploadscript if your pure-ftpd installation was compiled with
--with-uploadscript option.
It is used to launch a specified script after every upload is completely finished.
Set CallUploadScript to "yes"
Make a script with a command like touch /tmp/script.sh
Write the code in it. In my example the script renames the file and adds ".completed" before the file name:
#!/bin/bash
fullpath=$1
filename=$(basename "$1")
dirname=${fullpath%/*}
mv "$fullpath" "$dirname/completed.$filename"
Run chmod 755 /tmp/script.shto make the script executable by pure-uploadscript
Then run a command pure-uploadscript -B -r /etc/pure-ftpd/uploadscript.sh
Now /tmp/script.sh will be launched after each completed upload.