I'm currently working on a python tkinter windows-based application where I need to get the last modified time of a disk partition. My main aim is to get the latest updated time of a partition, where the host system user might have created a file\folder and also deleted some files and there might be some other changes made by the user to some files. I have tried this using python os.stat() but it only provides the modified date of existing files, it fails in case of a deleted file. The same is the case with the PowerShell command Get-ChildItem| Sort-Object -Descending -Property LastWriteTime | select -First 1, it provides the last time with respect to the contents present in the main directory but does not handle the changes made for file\folder deletion.
In the application, I want to get the comparison of the partition state change, i.e. if the user has made some changes to the disk partition since the last use of the application. Another option to get this result is by calculating the hash value for the disk partition but that is much time consuming, I need to get the result in just a few seconds.
This is my first interaction on StackOverflow as a questioner. Looking forward to getting helpful answers from the community.
Related
As I have written a few times before here, I am writing a programme to archive user-specified files in a certain time interval. The user also specifies when these files shall be deleted. Hence each file has a different archive and delete time interval associated with it.
I have written pretty much everything, including extracting the timings for each file in the list and working out when the next archive/delete time would be (relevant to the current time).
I am struggling with putting it all together, i.e. with actually scheduling these two processes (archive and delete archive) for each file with its individual time intervals. I guess these two functions have to be running in the background, but only execute when the clock strikes the required time.
I have looked into scheduler, timeloop, threading.Timer, but I don't see how I can set a different time interval for each file in the list, and make it run for both archive and delete processes without interfering. I came across the concept of 'cron jobs' - can anyone let me know if this might be on the right track? I'm just looking for some ideas from more experienced programmers for what I might be missing/what I should look into to get me on the right track.
Situation
I get a ton of json files from a remote data source. I organize these files into an archive, then read them into a database. The archive exists to rebuild the database, if necessary.
The json files are generated remotely and sent to my server periodically and the reading-in process happens continuously. One more than one occasion, we had a power loss to our severs overnight or over the weekend, this was a huge problem for database loading, since the processes halted and I didn't know what had been loaded and what hadn't so I had to roll back to some previously known state and rebuild out of the archive.
To fix this problem, my master loader daemon (written in python) now uses the logging package to track what files it has loaded. The basic workflow of the loader daemon is
cp json file to archive
`rm' original
insert archived copy to database (its MariaDB)
commit to database
log filename of loaded json file
I'm not so much worried about duplicates in the database, but I don't want gaps; that is, things in the archive that are not in the database. This methods has so far and seems guaranteed to prevent any gaps.
For my logging, it basically looks like this. When the daemon starts up on a set of received files' names, it checks for duplicates that have already been loaded to the destination database and then loads all the non-duplicates. It is possible to get duplicates from my remote data source.
def initialize_logs(filenames, destination)
try:
with open("/data/dblogs/{0}.log".format(destination), 'r') as already_used:
seen = set([line.rstrip("\n") for line in already_used])
except FileNotFoundError:
print("Log file for {0} not found. Repair database".format(destination))
quit()
fnamelog = logging.getLogger('filename.log')
fnamelog.setLevel(logging.INFO)
fh = logging.FileHandler("/data/dblogs/{0}.log".format(destination))
fh.setLevel(logging.INFO)
fnamelog.addHandler(fh)
Then, as I process the jsonfiles, I log each file added using
fnamelog.info(filename)
The database loader is run parallelized, so I originally chose the logging package for its built in concurrency protections. There are a variety of databases; not every database pulls all data from the json files. Some databases with more information are shorter in time, usually one to two months. In this case, it is nice to have a log file with all json files in a given database, so if I want to add some on to it, I don't have to worry about what is already in there, the log file is keeping track.
Problem
A year has passed. I have kept getting json files. I am now getting around a million files per month. The text logging of each filename as it is processed in is clumsy, but it still works...for now. There are multiple databases, but for the largest ones, the log file is over half a GB. I feel like this logging solution will not work well for much longer.
What options are available in python to track which filenames have been inserted into a database, when there are over 10 million filenames per database, and rising?
One approach would be to log the files in a table in the database itself rather than in a text log file. If you added some columns for things like import date or file name, that might provide you a little bit of flexibility with respect to finding information from these logs when you need to do that, but also it would allow you to perform periodic maintenance (like for example deleting log records that are more than a few months old if you know you won't need to look at those ever).
If you decide to keep using text based log files, you might consider breaking them up so you don't wind up with a giant monolithic log file. When you install things like Apache that log lots of data, you'll see it automatic sets up log rotation to compress and archive log files periodically...
You don't say what type of database you are using but the general approach to take is
1) make a hash of each json file. SHA256 is widely available. If you are concerned about performance see this post https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
2) make the hash field a unique key on your database and before you do the other operations try and insert it. If you can't the record already exists and the transaction will abort
Program 1:
Foreach file in input directory
INSERT IGNORE into database the MD5 of the file
"mv" the file to the archive directory
Program 2, a "keep-alive" program.
It is run via cron every minute and tries to launch Program 1, but don't start it if it is already running.
Notes:
'mv' and 'cron' assume Unix. If using Windows, do the equivalent.
'mv' is atomic, so the file will be either in one directory of the other; no hassle of knowing whether it is 'processed'. (So, I wonder why you even have a database table??)
Since the INSERT and mv cannot realistically be done "atomically", here is why my plan is safe: IGNORE.
The "is it running" can be handled in a number of ways, either in Program 1 or 2.
You could add a timestamp and/or filename to the table containing the md5; whatever you like.
Since it is not a good idea to have even 10K files in a directory, you should use something other than the pair of flat directories I have envisioned.
You are getting only about 1 file every 3 seconds. This is not a heavy load unless the files are huge. Then it becomes an I/O problem, not a database problem.
I have a feeling that either I am missing a hidden 'requirement', or you are being extra paranoid. I don't really understand what you need to do with the files.
I found this answer, and it somewhat provides what I needed, but I wanted to ask about any problems that can occur when storing and pulling files from Dropbox based on date.
I have an employee list with the filename empList.txt sitting in a DB folder named empList-20171106_183150. The folder name has the year,month,day and time right down to the second, appended to it(YYYYMMDD_HHMMSS).
Locally, I have a python script that has a log(txt), which just contains the date of the last time the script went and downloaded the updated list. The log looks like this if the last time the script ran was on Nov 01 2017 at 9am
20171101_090020
If I used Dropbox and a script written in Python to download the lastest version based on the date/time, are there any disadvantages to doing this?
I just compare the date stored in the log, to the date appended on the folder. If the date of folder in DB is greater, a download is needed. My only concern is during the date comparison and download, one of the managers might upload a new list, meaning I would have to run the script again.
How do complete programs such as MalwareBytes or internet security software manage a user downloading an update when they make a new update available at the same time? For me, I just run the update again to make sure while I was checkking/updating a new update wasn't made available.
I wouldn't recommend using date comparisons, because of potential issues with race conditions, like you mentioned, etc.
The Dropbox API exposes ways to tell if things have changed instead. Specifically, when downloading the file, you should store the metadata for the version of the file you downloaded. In particular, FileMetadata.rev or FileMetadata.content_hash would be useful.
If you then check again later and either of those values are different than the last one you downloaded, you know something has changed, so you should re-download.
I'm writing a python app that connects to perforce on a daily basis. The app gets the contents of an excel file on perfoce, parses it, and copies some data to a database. The file is rather big, so I would like to keep track of which revision of the file the app last read on the database, this way i can check to see if the revision number is higher and avoid reading the file if it has not changed.
I could make do with getting the revision number, or the changelist number when the file was last checked in / changed. Or if you have any other suggestion on how to accomplish my goal of avoiding doing an unnecessary read of the file.
I'm using python 2.7 and the perforce-python API
Several options come to mind.
The simplest approach would be to always let your program use the same client and let it sync the file. You could let your program call p4 sync and see if you get a new version or not. Let it continue if you get a new version. This approach has the advantage that you don't need to remember any states/version from the previous run of your program.
If you don't like using a fixed client you could let your program always check the current head revision of the file in question:
p4 fstat //depot/path/yourfile |grep headRev | sed 's/.*headRev \(.*\)/\1/'
You could store that version for the next run of your program in some temp file and compare versions each time.
If you run your program at fixed times (e.g. via cron) you could check the last modification time (either with p4 filelog or with p4 fstat) and if the time is between the time of the last run and the current time then you need to process the file. This option is a bit intricate since you need to parse those different time formats.
I'm looking to schedule FTP file transfers, but to conserve bandwidth, I would like to only upload files that have changed. What's a good reliable way to do this that will work on a variety of different hosting providers?
First, checking to see whether a local file has changed really doesn't have anything to do with FTP. You're stating that you're only going to open an FTP connection to upload a file if/when it has changed.
At a high level, the basic strategy you're going to need to employ is by keeping track of when your application last checked for changes (previous execution timestamp), and compare that to the timestamps of the files you are interested in uploading. If the timestamp on the files is more recent, they will most likely have changed. I say most likely because it is possible to update only the timestamp (e.g. touch on unix/linux).
Here's a quick example showing you how to check the modification time for all of the items in a specific directory:
import os, time
checkdir="./"
for item in os.listdir(checkdir):
item_path = "%s%s"%(checkdir,item)
mtime = os.path.getmtime(item_path)
print "%s: %s" %(item_path,mtime)
Note that this does not differentiate between file types (e.g. regular file, directory, symlink). Read the docs on os.path to discover how to determine file type so you can skip certain types, if you so choose.
You'll still need to come up with the logic to store the time of the previous 'scan' so that you refer to it in subsequent scans. A really simple way to do this would be to store a value in a file.
Make sure you use a locking strategy in case two 'scans' overlap. FTP uploads will take some time to complete.