I am currently running evaluations with multiple parameter configurations in a medium sized project.
I set certain parameters and change some code parts and run the main file with python.
Since the execution will take several hours, after starting it I make changes to some files (comment out some lines and change parameter) and start it again in a new tmux session.
While doing this, I observed behaviour, where the first execution will use configuration options of the second execution, so it seems like python was not done parsing the code files or maybe lazy loads them.
Therefore I wonder how python loads modules / code files and if changing them after I started the execution will have an impact on the execution?
Related
Problem
I want to create a program to monitor a directory and all sub-directories using python. My research has showed me multiple ways to do this using Watchdog.
If I understand correctly, Watchdog requires that you run a script 24/7 (could be in the background) and it essentially checks the monitored directory for changes on certain intervals (please correct me if this is wrong).
I do not want to use Watchdog if I don't have to because I would like to avoid running a python process 24/7 specifically for file monitoring.
I would instead like to run some sort of 'directory change detector' function manually whenever I execute the program.
Thoughts
I know that to accomplish my goal of avoiding Watchdog (and background processes / daemons) I could take a snapshot of the directory by recursively walking through it with os.walk() (maybe there are better functions for this) and then save that output to a file (I'll call this file last_dir_state.txt).
I could also pickle this file, but I'll avoid going into that right now for this explanation...
I could then load last_dir_state.txt the next time I run my program, re-run the os.walk() (or comparable function), and then compare the new function output to my old file.
I think this would work, but the problem is it will be VERY SLOW for large directories. I need a better / faster way to do this if another way exists.
I was thinking I could do something like recursively hashing the directory, saving the hash, and then comparing to the hash that is created when the program is next run, however this may be faster to detect that a change occurred, it wouldn't tell me exactly what change actually occurred.
Visual
As a visual this is what I want... without having to run Watchdog, and by just executing a function each time I run a program, I want the output of what changes occurred. It is OK if I have to save files locally (or create some sort of database) to do this if needed. I just need to avoid running background processes / daemons if possible.
First function execution:
# Monitored Dir Structure at runtime
RootDir
|--file_1
|--SubDir_1
|--file_2
# No Output
Second function execution:
# Monitored Dir Structure at runtime
RootDir
|--file_1
|--SubDir_1
| |--file_2
| |--file_3
|
|--SubDir_2
|--file_4
# Outputs newly added files and directories
/RootDir/SubDir_1/file_3
/RootDir/SubDir_2
/RootDir/SubDir_2/file_4
Note I am not concerned as much with if say file_1 was altered between the 2 executions (this would be nice, but I can probably figure that out if I get to that point). I am mainly just concerned with knowing what new directories and new files were created between program executions.
Question
In one sentence, my question is "What is the fastest way to monitor directory changes in python without using Watchdog / background processes?"
Hope that all makes sense. Let me know if I need to clarify anything.
Context
I'm working on a Data Science Project in which I'm running a data analysis task on a dataset (let's call it original dataset) and creating a processed dataset (let's call this one result). The last one can be queried by a user by creating different plots through use of a Dash application. The system also makes some predictions on an attribute of this dataset thanks to ML models. Everything will work on an external VM of my company.
What is my current "code"
Currently I have these python scripts that create the result dataset (except the Dashboard one):
concat.py (simply concatenates some files)
merger.py (merges different files in the project directory)
processer1.py (processes the first file needed for the analysis)
processer2.py (processes a second file needed for the analysis)
Dashboard.py (the Dash application)
ML.py (runs a classic ML task, creates a report and an updated result dataset with some predictions)
What I should obtain
I'm interested in creating this kind of solution that will run the VM:
Dashboard.py runs 24/7 based on the existence of the "result" dataset, without it it's useless.
Every time there's a change in the project directory (new files every month are added), the system triggers the execution of concat.py, merger.py, processer1.py and processer2.py. Maybe a python script and the watchdog package can help to create this trigger mechanism? I'm not sure.
Once the execution above is done, the ML.py file is executed based on the "result" dataset and it's uploaded to the dashboard.
The Dashboard.py it's restarted with new csv file.
I would like to receive some help to understand what are the technologies necessary to get what I would like. Something like an example or maybe a source, so I can fully understand and apply what is right. I know that maybe I have to use a python script to orchestrate the whole system, maybe the same script that observes the directory or maybe not.
The most important thing is that the dashboard operates always. This is what creates the need of running things simultaneously. Just when the "result" csv dataset is completed and uploaded it is necessary to restart it, I think that for the users is best to keep the service continuity.
The users will feed the dashboard with new files in the observed directory. It's necessary to create automation by using "triggers" to execute the code, since they are not skilled users and they will not be allowed to use the VM bash (I suppose). Maybe I could think about creating a repetitive execution instead, like every month.
Company won't let me grant another VM or similar if it's needed, so I should do it just with a single VM.
Premise
This is the first time that I have to get "in production" something, and I have no experience at all. Could anyone help me to get the best approach? Thanks in advance.
I have two separate scripts, one written in Python and one in Ruby, which run on a schedule to achieve a single goal. Ruby isn't my code of choice, but it is all I can use for this task.
The Python script is run every 30 seconds, talks to a few scientific instruments, gather certain data, writes the data to a text file (one per instrument).
The ruby script then reads these files every 20 seconds and displays the information on a dashing dashboard.
The trouble I have is that sometimes the file is being written to by Python at the same time as Ruby is trying to read it. You can see obvious problems here...
Despite putting several checks in my ruby code such as:
If myFile.exists? and myFile.readable? and not myFile.zero?
I still get these clashes every now and then.
Is there a better way in ruby to avoid reading open files / files being written to?
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.
There are some ways to execute python code from within vim. E.g.:
typing ":! python %" on the command line! or
using the marvellous python-mode plugin!
These approaches invoke a new python instance, transfer the code, execute the code, feed back the output and close the instance. Thus, the python workspace will always be empty at start of execution. No possibility to access previously calculated results. When I want to execute just an incremental subset (some lines) of my code, I would have to take care of all the prerequisites as well (cf.: working with MATLAB provides an interactive mode where the workspace is always persistent).
Is there a way to keep a python instance alive (keep the objects within the workspace) and feed incremental code portions from vim into the open python instance and execute them with respect to the retained workspace?
One could think of storing / retrieving the python workspace on each exit / entry of a new python instance, but this might not be the best way to keep the workspace alive.