As I have written a few times before here, I am writing a programme to archive user-specified files in a certain time interval. The user also specifies when these files shall be deleted. Hence each file has a different archive and delete time interval associated with it.
I have written pretty much everything, including extracting the timings for each file in the list and working out when the next archive/delete time would be (relevant to the current time).
I am struggling with putting it all together, i.e. with actually scheduling these two processes (archive and delete archive) for each file with its individual time intervals. I guess these two functions have to be running in the background, but only execute when the clock strikes the required time.
I have looked into scheduler, timeloop, threading.Timer, but I don't see how I can set a different time interval for each file in the list, and make it run for both archive and delete processes without interfering. I came across the concept of 'cron jobs' - can anyone let me know if this might be on the right track? I'm just looking for some ideas from more experienced programmers for what I might be missing/what I should look into to get me on the right track.
Related
I'm currently working on a python tkinter windows-based application where I need to get the last modified time of a disk partition. My main aim is to get the latest updated time of a partition, where the host system user might have created a file\folder and also deleted some files and there might be some other changes made by the user to some files. I have tried this using python os.stat() but it only provides the modified date of existing files, it fails in case of a deleted file. The same is the case with the PowerShell command Get-ChildItem| Sort-Object -Descending -Property LastWriteTime | select -First 1, it provides the last time with respect to the contents present in the main directory but does not handle the changes made for file\folder deletion.
In the application, I want to get the comparison of the partition state change, i.e. if the user has made some changes to the disk partition since the last use of the application. Another option to get this result is by calculating the hash value for the disk partition but that is much time consuming, I need to get the result in just a few seconds.
This is my first interaction on StackOverflow as a questioner. Looking forward to getting helpful answers from the community.
Problem
I want to create a program to monitor a directory and all sub-directories using python. My research has showed me multiple ways to do this using Watchdog.
If I understand correctly, Watchdog requires that you run a script 24/7 (could be in the background) and it essentially checks the monitored directory for changes on certain intervals (please correct me if this is wrong).
I do not want to use Watchdog if I don't have to because I would like to avoid running a python process 24/7 specifically for file monitoring.
I would instead like to run some sort of 'directory change detector' function manually whenever I execute the program.
Thoughts
I know that to accomplish my goal of avoiding Watchdog (and background processes / daemons) I could take a snapshot of the directory by recursively walking through it with os.walk() (maybe there are better functions for this) and then save that output to a file (I'll call this file last_dir_state.txt).
I could also pickle this file, but I'll avoid going into that right now for this explanation...
I could then load last_dir_state.txt the next time I run my program, re-run the os.walk() (or comparable function), and then compare the new function output to my old file.
I think this would work, but the problem is it will be VERY SLOW for large directories. I need a better / faster way to do this if another way exists.
I was thinking I could do something like recursively hashing the directory, saving the hash, and then comparing to the hash that is created when the program is next run, however this may be faster to detect that a change occurred, it wouldn't tell me exactly what change actually occurred.
Visual
As a visual this is what I want... without having to run Watchdog, and by just executing a function each time I run a program, I want the output of what changes occurred. It is OK if I have to save files locally (or create some sort of database) to do this if needed. I just need to avoid running background processes / daemons if possible.
First function execution:
# Monitored Dir Structure at runtime
RootDir
|--file_1
|--SubDir_1
|--file_2
# No Output
Second function execution:
# Monitored Dir Structure at runtime
RootDir
|--file_1
|--SubDir_1
| |--file_2
| |--file_3
|
|--SubDir_2
|--file_4
# Outputs newly added files and directories
/RootDir/SubDir_1/file_3
/RootDir/SubDir_2
/RootDir/SubDir_2/file_4
Note I am not concerned as much with if say file_1 was altered between the 2 executions (this would be nice, but I can probably figure that out if I get to that point). I am mainly just concerned with knowing what new directories and new files were created between program executions.
Question
In one sentence, my question is "What is the fastest way to monitor directory changes in python without using Watchdog / background processes?"
Hope that all makes sense. Let me know if I need to clarify anything.
I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?
Running such a problem with multi-threads poses a couple of challenges:
We have to run over all of the records in file A in order to get the algorithm done.
We have to synchronize the writing to the separate file, so we won't override the printed records.
I'd suggest:
Assign a single thread just for printing - so your external file won't get messed up.
Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.
The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).
You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.
My first gut reaction is that Luigi isn't suited for this sort of thing, but I would like the "pipeline" functionality and everything keeps pointing me back to Luigi/Airflow. I can't use Airflow as it is a Windows environment.
My use-case:
So currently from my “source” folder we have 20 or so machines that produce XML data. Over time, some process puts these files into a folder continually on each machine (its log data). On any given day these machines could have 0 files in the folder or it could have 100k+ files (each machine). Eventually someone will go delete all of the files.
One part of this process is to to watch all of these directories on all of these machines, and copy the files down to an archive folder if they are new.
My current process makes a listing of all the files on each machine every 5 minutes, grabs a listing of the files and loops over the source checking if the file is available at the destination. Copies if it doesn't exist at destination, skips if it does.
It seems that Luigi wants to work with only "a" (singular) file in its output and/or target. The issue is, I could have 1 new file, or several thousand files that shows up.
This same exact issue happens through the entire process. As the next step in my pipeline is to add the files and its metadata information (size, filename, directory location) to a db record. At that point another process reads all of the metadata record rows and puts them into a content extracted table of the XML log data.
Is Luigi even suited for something like this? Luigi seems to want to deal with one thing, do some work on it, and then emit that information out to another single file.
I can tell you that mine workflow works with 10K log files every day without any glitches. The main good stuff here that I have created one task for work with each file.
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.