Python ftplib: Best way to check if a file has changed?

Python ftplib: Best way to check if a file has changed? - python

I'm looking to schedule FTP file transfers, but to conserve bandwidth, I would like to only upload files that have changed. What's a good reliable way to do this that will work on a variety of different hosting providers?

First, checking to see whether a local file has changed really doesn't have anything to do with FTP. You're stating that you're only going to open an FTP connection to upload a file if/when it has changed.
At a high level, the basic strategy you're going to need to employ is by keeping track of when your application last checked for changes (previous execution timestamp), and compare that to the timestamps of the files you are interested in uploading. If the timestamp on the files is more recent, they will most likely have changed. I say most likely because it is possible to update only the timestamp (e.g. touch on unix/linux).
Here's a quick example showing you how to check the modification time for all of the items in a specific directory:
import os, time
checkdir="./"
for item in os.listdir(checkdir):
item_path = "%s%s"%(checkdir,item)
mtime = os.path.getmtime(item_path)
print "%s: %s" %(item_path,mtime)
Note that this does not differentiate between file types (e.g. regular file, directory, symlink). Read the docs on os.path to discover how to determine file type so you can skip certain types, if you so choose.
You'll still need to come up with the logic to store the time of the previous 'scan' so that you refer to it in subsequent scans. A really simple way to do this would be to store a value in a file.
Make sure you use a locking strategy in case two 'scans' overlap. FTP uploads will take some time to complete.

Related

Finding the last modified/write time of a disk partition windows

I'm currently working on a python tkinter windows-based application where I need to get the last modified time of a disk partition. My main aim is to get the latest updated time of a partition, where the host system user might have created a file\folder and also deleted some files and there might be some other changes made by the user to some files. I have tried this using python os.stat() but it only provides the modified date of existing files, it fails in case of a deleted file. The same is the case with the PowerShell command Get-ChildItem| Sort-Object -Descending -Property LastWriteTime | select -First 1, it provides the last time with respect to the contents present in the main directory but does not handle the changes made for file\folder deletion.
In the application, I want to get the comparison of the partition state change, i.e. if the user has made some changes to the disk partition since the last use of the application. Another option to get this result is by calculating the hash value for the disk partition but that is much time consuming, I need to get the result in just a few seconds.
This is my first interaction on StackOverflow as a questioner. Looking forward to getting helpful answers from the community.

Treating a directory like a file in Python

We have a tool which is designed to allow vendors to deliver files to a company and update their database. These files (generally of predetermined types) use our web-based transport system, a new record is created in the db for each one, and the files are moved into a new structure when delivered.
We have a new request from a client to use this tool to be able to pass through entire directories without parsing every record. Imagine if the client made digital cars then this tool allows the delivery of the digital nuts and bolts and tracks each part, but they want to also deliver a directory with all of the assets which went into creating a digital bolt without adding each asset as a new record.
The issue is that the original code doesn't have a nice way to handle these passthrough folders, and would require a lot of rewriting to make it work. We'd obviously need to create a new function which happens around the time of the directory walk, which takes out each folder which matches this passthrough and then handles it separately. The problem is that all the tools which do the transport, db entry, and delivery all expect files, not folders.
My thinking: what if we could treat that entire folder as a file? That way the current file-level tools don't need to be modified, we'd just need to add the "conversion" step. After generating the manifest, what if we used a library to turn it into a "file", send that, and then turn it back into a "folder" after ingest. The most obvious way to do that is ZIP files - and the current delivery tool does handle ZIPs - but that is slow and some of these deliveries are very large, which means when transporting if something goes wrong the entire ZIP would fail.
Is there a method which we can use which doesn't necessarily compress the files but just somehow otherwise can treat a directory and all its contents like a file, so the rest of the code doesn't need to be rewritten? Or something else I'm missing entirely?
Thanks!

You could use tar files. Python has great support for it, and it is customary in *nix environments to use them as backup files. For compression you could use Gzip (also supported by the standard library and great for streaming).

How to keep track of the files I read into a database in python?

Situation
I get a ton of json files from a remote data source. I organize these files into an archive, then read them into a database. The archive exists to rebuild the database, if necessary.
The json files are generated remotely and sent to my server periodically and the reading-in process happens continuously. One more than one occasion, we had a power loss to our severs overnight or over the weekend, this was a huge problem for database loading, since the processes halted and I didn't know what had been loaded and what hadn't so I had to roll back to some previously known state and rebuild out of the archive.
To fix this problem, my master loader daemon (written in python) now uses the logging package to track what files it has loaded. The basic workflow of the loader daemon is
cp json file to archive
`rm' original
insert archived copy to database (its MariaDB)
commit to database
log filename of loaded json file
I'm not so much worried about duplicates in the database, but I don't want gaps; that is, things in the archive that are not in the database. This methods has so far and seems guaranteed to prevent any gaps.
For my logging, it basically looks like this. When the daemon starts up on a set of received files' names, it checks for duplicates that have already been loaded to the destination database and then loads all the non-duplicates. It is possible to get duplicates from my remote data source.
def initialize_logs(filenames, destination)
try:
with open("/data/dblogs/{0}.log".format(destination), 'r') as already_used:
seen = set([line.rstrip("\n") for line in already_used])
except FileNotFoundError:
print("Log file for {0} not found. Repair database".format(destination))
quit()
fnamelog = logging.getLogger('filename.log')
fnamelog.setLevel(logging.INFO)
fh = logging.FileHandler("/data/dblogs/{0}.log".format(destination))
fh.setLevel(logging.INFO)
fnamelog.addHandler(fh)
Then, as I process the jsonfiles, I log each file added using
fnamelog.info(filename)
The database loader is run parallelized, so I originally chose the logging package for its built in concurrency protections. There are a variety of databases; not every database pulls all data from the json files. Some databases with more information are shorter in time, usually one to two months. In this case, it is nice to have a log file with all json files in a given database, so if I want to add some on to it, I don't have to worry about what is already in there, the log file is keeping track.
Problem
A year has passed. I have kept getting json files. I am now getting around a million files per month. The text logging of each filename as it is processed in is clumsy, but it still works...for now. There are multiple databases, but for the largest ones, the log file is over half a GB. I feel like this logging solution will not work well for much longer.
What options are available in python to track which filenames have been inserted into a database, when there are over 10 million filenames per database, and rising?

One approach would be to log the files in a table in the database itself rather than in a text log file. If you added some columns for things like import date or file name, that might provide you a little bit of flexibility with respect to finding information from these logs when you need to do that, but also it would allow you to perform periodic maintenance (like for example deleting log records that are more than a few months old if you know you won't need to look at those ever).
If you decide to keep using text based log files, you might consider breaking them up so you don't wind up with a giant monolithic log file. When you install things like Apache that log lots of data, you'll see it automatic sets up log rotation to compress and archive log files periodically...

You don't say what type of database you are using but the general approach to take is
1) make a hash of each json file. SHA256 is widely available. If you are concerned about performance see this post https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
2) make the hash field a unique key on your database and before you do the other operations try and insert it. If you can't the record already exists and the transaction will abort

Program 1:
Foreach file in input directory
INSERT IGNORE into database the MD5 of the file
"mv" the file to the archive directory
Program 2, a "keep-alive" program.
It is run via cron every minute and tries to launch Program 1, but don't start it if it is already running.
Notes:
'mv' and 'cron' assume Unix. If using Windows, do the equivalent.
'mv' is atomic, so the file will be either in one directory of the other; no hassle of knowing whether it is 'processed'. (So, I wonder why you even have a database table??)
Since the INSERT and mv cannot realistically be done "atomically", here is why my plan is safe: IGNORE.
The "is it running" can be handled in a number of ways, either in Program 1 or 2.
You could add a timestamp and/or filename to the table containing the md5; whatever you like.
Since it is not a good idea to have even 10K files in a directory, you should use something other than the pair of flat directories I have envisioned.
You are getting only about 1 file every 3 seconds. This is not a heavy load unless the files are huge. Then it becomes an I/O problem, not a database problem.
I have a feeling that either I am missing a hidden 'requirement', or you are being extra paranoid. I don't really understand what you need to do with the files.

How to restrict access of a file from python

So I made a python phonebook program which allows the user to add contacts, change contact info, delete contacts, etc. and write this data to a text file which I can read from every time the program is opened again and get existing contact data. However, in my program, I write to the text file in a very specific manner so I know what the format is and I can set it up to be read very easily. Since it is all formatted in a very specific manner, I want to prevent the user from opening the file and accidentally messing the data up with even just a simple space. How can I do this?

I want to prevent the user from opening the file and accidentally messing the data up...
I will advise you not to prevent users from accessing their own files. Messing with file permissions might result in some rogue files that the user won't be able to get rid of. Trust your user. If they delete or edit a sensitive file, it is their fault. Think of it this way - you have plenty of software installed on your own computer, but how often do you open them in an editor and make some damaging changes? Even if you do edit these files, does the application developer prevent you from doing so?
If you do intent to allow users to change/modify that file give them a good documentation on how to do it. This is the most apt thing to do. Also, make a backup file during run-time (see tempfile below) as an added layer of safety. Backups are almost always a good idea.
However, you can take some precautions to hide that data, so that users can't accidentally open them in an editor by double-clicking on it. There are plenty of options to do this including
Creating a binary file in a custom format
Zipping the text file using zipfile module.
Using tempfile module to create a temporary file, and zipping it using the previous option. (easy to discard if no changes needs to be saved)
Encryption
Everything from here on is not about preventing access, but about hiding the contents of your file
Note that all the above options doesn't have to be mutually exclusive. The advantages of using a zip file is that it will save some space, and it is not easy to read and edit in a text editor (binary data). It can be easily manipulated in your Python Script:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
It is as simple as that! A temp file on the other hand, is a volatile (delete=True/False) file and can be discarded once you are done with it. You can easily copy its contents to another file or zip it before you close it as mentioned above.
with open tempfile.NamedTemporaryFile() as temp:
temp.write(b"Binary Data")
Again, another easy process. However, you must zip or encrypt it to achieve the final result. Now, moving on to encryption. The easiest way is an XOR cipher. Since we are simply trying to prevent 'readability' and not concerned about security, you can do the following:
recommended solution (XOR cipher):
from itertools import cycle
def xorcize(data, key):
"""Return a string of xor mutated data."""
return "".join(chr(ord(a)^ord(b)) for a, b in zip(data, cycle(key)))
data = "Something came in the mail today"
key = "Deez Nuts"
encdata = xorcize(data, key)
decdata = xorcize(encdata, key)
print(data, encdata, decdata, sep="\n")
Notice how small that function is? It is quite convenient to include it in any of your scripts. All your data can be encrypted before writing them to a file, and save it using a file extension such as ".dat" or ".contacts" or any custom name you choose. Make sure it is not opened in an editor by default (such as ".txt", ".nfo").

It is difficult to prevent user access to your data storage completely. However, you can either make it more difficult for the user to access your data or actually make it easier not to break it. In the second case, your intention would be to make it clear to the user what the rules are hope that not destroying the data is in the user's own best interest. Some examples:
Using a well established, human-readable serialization format, e.g. JSON. This is often the best solution as it actually allows an experienced user to easily inspect the data, or even modify it. Inexperienced users are unlikely to mess with the data anyways, and an experienced user knowing the format will follow the rules. At the same time, your parser will detect inconsistencies in the file structure.
Using a non-human readable, binary format, such as Pickle. Those files are likely to be left alone by the user as it is pretty clear that they are not meant to be modified outside the program.
Using a database, such as MySQL. Databases provide special protocols for data access which can be used to ensure data consistency and also make it easier to prevent unwanted access.

Assuming that you file format has a comment character, or can be modified to have one, add these lines to the top of your text file:
# Do not edit this file. This file was automatically generated.
# Any change, no matter how slight, may corrupt this file beyond repair.
The contact file belongs to your user, not to you. The best you can do is to inform the user. The best you can hope for is that the user will make intelligent use of your product.

I think the best thing to do in your case is just to choose a new file extension for your format.
It obviously doesn't prevent editing, but it clearly states for user that it has some specific format and probably shouldn't be edited manually. And GUI won't open it by default probably (it will ask what to edit it with).
And that would be enough for any case I can imagine if what you're worrying about is user messing up their own data. I don't think you can win with user who actively tries to mess up their data. Also I doubt any program does anything more. The usual "contract" is that user's data is, well, user's so it can be destroyed by the user.
If you actually won't to prevent editing you could change permissions to forbid editing with os.chmod for example. User would still be able to lift them manually and there will be some time window when you are actually writing, so it will be neither clean nor significantly more effective. And I would expect more trouble than benefit from such a solution.
If you want to actually make it impossible for a user to read/edit a file you can run your process from a different user (or use some heavier like SELinux or other MAC mechanism) and so you could make it really impossible to damage the data (with user's permissions). But it is not worth the effort if it is only about protecting the user from the not-so-catastophic effects of being careless.

Custom Log File That Doesn't Grow Past a Given Size

I need to write some output to a log file every time certain events in my application are triggered. Right now I'm doing this in the simplest way imagineable:
with open('file.log', 'a+') as file:
file.write('some info')
This works, but the problem is I don't want the file to grow indefinitely. I want a hard size cutoff of, say 25 MB. But I don't want to simply clear the file when it reaches that size.
Instead, I want it to function how I believe other log files work, clearing the oldest data from the top of the file and appending the new data to the end of the file. What would be a good approach to achieving this in Python? Note that I can't use Python's logging library because this is for a Celery application and (a) I already have logging set up for purposes unrelated to this and (b) Celery and the logging library do not play well together AT ALL.

import os
statinfo = os.stat('somefile.txt')
This will return the size of somefile.txt in bytes.
Then you can do something like:
if statinfo.st_size > 25000000:
You can implement this however you want. You could read up to the number of lines that are going to be replaced, delete those and then save the reminder in a temporary file, then you could write the data you wish, and append the temporary file. This would not be very effective, but it would work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.