I'm writing a Python-based service that scans a specified drive for files changes and backs them up to a storage service. My concern is handling files which are open and being actively written to (primarily database files).
I will be running this cross-platform so Windows/Linux/OSX.
I do not want to have to tinker with volume shadow copy services. I am perfectly happy with throwing a notice to the user/log that a file had to be skipped or even retrying a copy operation x number of times in the event of an intermittent write lock on a small document or similar type of file.
Successfully copying out a file in an inconsistent state and not failing would certainly be a Bad Thing(TM).
The users of this service will be able to specify the path(s) they want backed-up so I have to be able to determine at runtime what to skip.
I am thinking I could just identify any file which has a read/write handle and try to obtain exclusive access to it during the archival process, but I think this might be too intrusive(?) if the user was actively using the system.
Ideas?
You could look for the file being closed and archive it. The phi notify library allows you to watch given files or directories for a number of events, including CLOSE-WRITE which allows you to detect those files which have closed with changes.
Related
I am attempting to download a small image file (e.g. https://cdn4.telesco.pe/file/some_long_string.jpg) in the shortest possible time.
My machine pings 200ms to the server, but I'm unable to achieve better than 650ms.
What's the science behind fast-downloading of a single file? What are the factors? Is a multipart download possible?
I find many resources for parallelizing downloads of multiple files, but nothing on optimizing for download-time on a single file.
It is not so easy to compare those two types of response time...
The commandline "machine ping" is a much more "lowlevel" and fast type of response in the network architecture between two devices, computers or servers.
With a python-script that asks for a file on a remote webserver you have much more "overhead" in the request where every instance consumes some milliseconds, like the speed of your local python runtime, the operating system from you and the remote server (win/osx/linux), the used webserver and its configuration (apache/iis/ngix) etc.
My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
We are working on a web app using django that will allow for modification of files which are stored in a vcs repo (currently git).
Writing to the file in the local workspace will be done as long as edit runs in browser.
Adding/committing will happen when user is finished with editing (save) or after a given time lapse.
Because there will be parallel web sessions running, I am concerned by the concurrent access to the versioned files :
when reading / writing in the local workspace
as we want commit messages to contain messages specific to each file's modification, we also need some kind of lock to prevent add - commit operations to interlace.
So I guess we should use some kind of locking and am looking for a mechanism which is robust and compatible with the web app architecture :
I read about flocks, but I guess it is not adapted to a stateless application; I probably cannot hold a filehandle easily, can I ?
I could create some kinds of filename.ext.lock files to programmatically handle mutual access exclusion
Or I could have a dedicated table in db for the same goal
Other solution would be to delegate vcs accesses (file and repo) to a dedicated process, but I couldn't find anything yet; searching for git daemon only return results that deal with operating on whole repos for clone / push / pull / ..., not for file level operations
Do you see other means than the ones above ?
Do you know of aspects we should specially take care of ?
I agree with Magnus Bäck. Unlike SVN you can create different copies of git-repos and merge them into one another without a server, just on your local machine. So you can have a bunch of copies of the same repository and each client/process gets its own folder. You can delete and copy them when you need more or less.
Also you can do
git remote add local_original file:///var/git/project.git
(source)
in each copy and use this for
git push local_original
to push the changes to this local repository to synchronize all users working on it.
I would go in favor of copying because it can scale better than locks.
Here is something that I implemented that does not have to scale but that I would implement with copying if it had to: https://github.com/niccokunzmann/gh-pages-edit
I have just noticed that when I have a running instance of my GAE application, there nothing happens with the datastore file when I add or remove entries using Python code or in admin console. I can even remove the file and still have all data safe and sound in admin area and accessible from code. But when I restart my application, all data obviously goes away and I have a blank datastore. So, the question - does GAE reads all data from the file only when it starts and then deals with it in the memory, saving the data after I stop the application? Does it make any requests to the datastore file when the application is running? If it doesn't save anything to the file while it's running, then, possibly, data may be lost if the application unexpectedly stops? Please make it clear for me if you know how it works in this aspect.
How the datastore reads and writes its underlying files varies - the standard datastore is read on startup, and written progressively, journal-style, as the app modifies data. The SQLite backend uses a SQLite database.
You shouldn't have to care, though - neither backend is designed for robustness in the face of failure, as they're development backends. You shouldn't be modifying or deleting the underlying files, either.
By default the dev_appserver will store it's data in a temporary location (which is why it disappears and you can't see anything changing)
If you don't want your data to disappear on restart set --datastore_path when running your dev server like:
dev_appserver.py --datastore_path /path/to/app/myapp.db /path/to/app
As nick said, the dev server is not built to be bulletproof, it's designed to help you quickly develop your app. The production setup is very different and will not do anything unexpected when you are dealing with exceptional circumstances.