Prefect still treats files as cached after I've deleted them - python

The rest of my team uses Prefect for "pipelining stuff", so I'm trying to do a thing in Prefect, but for this thing, I need behavior sort of like GNU make. Specifically, I want to specify a filename at runtime, and
If the file doesn't exist, I want Prefect to run a specific task.
If the file exists, I want Prefect to skip that task.
I read through
Prefect caching through a file target
and got that system mostly working: behavior 2 works, and if I run it twice, then the second time is faster because the task is skipped. But behavior 1 doesn't work. If I run the flow, delete the file, and run the flow again, I want it to run the task, but it doesn't, and I still don't have my file at the end. How do I get it to run the task in this situation? Here's a little example.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect.engine.results import LocalResult
from prefect import task, Flow, Parameter
import subprocess
#task(result=LocalResult(), target="{myfilename}")
def make_my_file(myfilename):
subprocess.call(["touch", myfilename])
subprocess.call(["sleep", "1"])
return True
with Flow("makemyfile") as flow:
myfilename = Parameter("myfilename", default="foo.txt")
is_my_file_done = make_my_file(myfilename)
flow.run(myfilename = "bar.txt")
To see the behavior:
python demo_flow.py # makes bar.txt
python demo_flow.py # skips the task
rm bar.txt
python demo_flow.py # still skips the task! Rawr!

For prefect2 I guess that the answer should have been:
https://docs.prefect.io/concepts/tasks/#refreshing-the-cache
But somehow in current version (2.7.9) there is no refresh_cache on the task.
The only way I found to selectively delete the cache from a given flow is to:
in the table task_run_state update data from all task_runs from given flow_run to null
in the table task_run_state_cache delete all the updated task_run_state.
I have also found somewhere on the internet that some people update / create own the cache_key_fn. Unfortunately I can't find where.
I have found that it should be easier to manage after following fix is implemented:
https://github.com/PrefectHQ/prefect/issues/8239

Related

Is there a way to run several time a combinaison of python code and Pytest tests automatically?

I am looking to automate the process where:
I run some python code,
then run a set of tests using pytest
then, if all tests are validated, start the process again with new data.
I am thinking of writing a script executing the python code, then calling pytest using pytest.main(), check with the help of the exit code that all tests passed and in case of success start again.
The issue is that it is stated in pytest docs (https://docs.pytest.org/en/stable/usage.html) that it is not recommended to make multiple calls to pytest.main():
Note from pytest docs:
"Calling pytest.main() will result in importing your tests and any modules that they import. Due to the caching mechanism of python’s import system, making subsequent calls to pytest.main() from the same process will not reflect changes to those files between the calls. For this reason, making multiple calls to pytest.main() from the same process (in order to re-run tests, for example) is not recommended."
I was woundering if it was ok to call pytest.main() the way I intend to or if there was any better way to achieve what I am looking for?
I've made a simple example to make it problem more clear:
A = [0]
def some_action(x):
x[0] += 1
if __name__ == '__main__':
print('Initial value of A: {}'.format(A))
for i in range(10):
if i == 5:
# one test in test_mock2 that fails
test_dir = "./tests/functional_tests/test_mock2.py"
else:
# two tests in test_mock that pass
test_dir = "./tests/functional_tests/test_mock.py"
some_action(A)
check_tests = int(pytest.main(["-q", "--tb=no", test_dir]))
if check_tests != 0:
print('Interrupted at i={} because of tests failures'.format(i))
break
if i > 5:
print('All tests validated, final value of A: {}'.format(A))
else:
print('final value of A: {}'.format(A))
In this example some_action is executed until i reaches 5, at which point the tests fail and the process of executing/testing is interrupted. It seems to work fine, I'm only concerned because of the comments in pytest docs as stated above
The warning applies to the following sequence of events:
Run pytest.main on some folder which imports a.py, directly or indirectly.
Modify a.py (manually or programatically).
Attempt to rerun pytest.main on the same directory in the same python process as #1
The second run in step #3 will not not see the changes you made to a.py in step #2. That is because python does not import a file twice. Instead, it will check if the file has an entry in sys.modules, and use that instead. It's what lets you import large libraries multiple times without incurring a huge penalty every time.
Modifying the values in imported modules is fine. Python binds names to references, so if you bind something (like a new integer value) to the right name, everyone will be able to see it. Your some_action function is a good example of this. Future tests will run with the modified value if they import your script as a module.
The reason that the caveat is there is that pytest is usually used to test code after it has been modified. The warning is simply telling you that if you modify your code, you need to start pytest.main in a new python process to see the changes.
Since you do not appear to be modifying the code of the files in your test and expecting the changes to show up, the caveat you cite does not apply to you. Keep doing what you are doing.

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

Is it possible to display file size in a directory served using http.server in python?

I've served a directory using
python -m http.server
It works well, but it only shows file names. Is it possible to show created/modified dates and file size, like you see in ftp servers?
I looked through the documentation for the module but couldn't find anything related to it.
Thanks!
http.server is meant for dead-simple use cases, and to serve as sample code.1 That's why the docs link right to the source.
That means that, by design, it doesn't have a lot of configuration settings; instead, you configure it by reading the source and choosing what methods you want to override, then building a subclass that does that.
In this case, what you want to override is list_directory. You can see how the base-class version works, and write your own version that does other stuff—either use scandir instead of listdir, or just call stat on each file, and then work out how you want to cram the results into the custom-built HTML.
Since there's little point in doing this except as a learning exercise, I won't give you complete code, but here's a skeleton:
class StattyServer(http.server.HTTPServer):
def list_directory(self, path):
try:
dirents = os.scandir(path)
except OSError:
# blah blah blah
# etc. up to the end of the header-creating bit
for dirent in dirents:
fullname = dirent.path
displayname = linkname = dirent.name
st = dirent.stat()
# pull stuff out of st
# build a table row to append to r
1. Although really, it's sample code for an obsolete and clunky way of building servers, so maybe that should be "to serve as sample code to understand legacy code that you probably won't ever need to look at but just in case…".

How do I stop hg child process when using hglib

I have a Python application in Mercurial. In the application I found the need to display from which commit it is currently running. The best solution I have found so far is to make use of hglib. I have a module which looks like this:
def _get_version():
import hglib
repo = hglib.open()
[p] = repo.parents()
return p[1]
version = _get_version()
This uses hglib to find the used version and stores the result in a variable, which I can use for the entire time the service remains running.
My problem now is that this leaves a hg child process running, which is useless to me, since as soon as this module is done initializing, I don't need to use hglib anymore.
I would have expected the child process to be shut down during garbage collection once my reference to the repository instance goes out of scope. But apparently that is not how it works.
When reading the hglib documentation I didn't find any documentation on how to get the child process shut down.
What is the preferred method to get the hg child process shut down once I am done with it?
You need to treat repo sort of like a file. Either call repo.close() when you're done or use it inside a with:
def _get_version():
import hglib
with hglib.open() as repo:
[p] = repo.parents()
return p[1]
version = _get_version()

In Python, how do I make a temp file that persists until the next run?

I need to create a folder that I use only once, but need to have it exist until the next run. It seems like I should be using the tmp_file module in the standard library, but I'm not sure how to get the behavior that I want.
Currently, I'm doing the following to create the directory:
randName = "temp" + str(random.randint(1000, 9999))
os.makedirs(randName)
And when I want to delete the directory, I just look for a directory with "temp" in it.
This seems like a dirty hack, but I'm not sure of a better way at the moment.
Incidentally, the reason that I need the folder around is that I start a process that uses the folder with the following:
subprocess.Popen([command], shell=True).pid
and then quit my script to let the other process finish the work.
Creating the folder with a 4-digit random number is insecure, and you also need to worry about collisions with other instances of your program.
A much better way is to create the folder using tempfile.mkdtemp, which does exactly what you want (i.e. the folder is not deleted when your script exits). You would then pass the folder name to the second Popen'ed script as an argument, and it would be responsible for deleting it.
What you've suggested is dangerous. You may have race conditions if anyone else is trying to create those directories -- including other instances of your application. Also, deleting anything containing "temp" may result in deleting more than you intended. As others have mentioned, tempfile.mkdtemp is probably the safest way to go. Here is an example of what you've described, including launching a subprocess to use the new directory.
import tempfile
import shutil
import subprocess
d = tempfile.mkdtemp(prefix='tmp')
try:
subprocess.check_call(['/bin/echo', 'Directory:', d])
finally:
shutil.rmtree(d)
"I need to create a folder that I use only once, but need to have it exist until the next run."
"Incidentally, the reason that I need the folder around is that I start a process ..."
Not incidental, at all. Crucial.
It appears you have the following design pattern.
mkdir someDirectory
proc1 -o someDirectory # Write to the directory
proc2 -i someDirectory # Read from the directory
if [ %? == 0 ]
then
rm someDirectory
fi
Is that the kind of thing you'd write at the shell level?
If so, consider breaking your Python application into to several parts.
The parts that do the real work ("proc1" and "proc2")
A Shell which manages the resources and processes; essentially a Python replacement for a bash script.
A temporary file is something that lasts for a single program run.
What you need is not, therefore, a temporary file.
Also, beware of multiple users on a single machine - just deleting anything with the 'temp' pattern could be anti-social, doubly so if the directory is not located securely out of the way.
Also, remember that on some machines, the /tmp file system is rebuilt when the machine reboots.
You can also automatically register an function to completely remove the temporary directory on any exit (with or without error) by doing :
import atexit
import shutil
import tempfile
# create your temporary directory
d = tempfile.mkdtemp()
# suppress it when python will be closed
atexit.register(lambda: shutil.rmtree(d))
# do your stuff...
subprocess.Popen([command], shell=True).pid
tempfile is just fine, but to be on a safe side you'd need to safe a directory name somewhere until the next run, for example pickle it. then read it in the next run and delete directory. and you are not required to have /tmp for the root, tempfile.mkdtemp has an optional dir parameter for that. by and large, though, it won't be different from what you're doing at the moment.
The best way of creating the temporary file name is either using tempName.TemporaryFile(mode='w+b', suffix='.tmp', prifix='someRandomNumber' dir=None)
or u can use mktemp() function.
The mktemp() function will not actually create any file, but will provide a unique filename (actually does not contain PID).

Categories