As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.
Related
The rest of my team uses Prefect for "pipelining stuff", so I'm trying to do a thing in Prefect, but for this thing, I need behavior sort of like GNU make. Specifically, I want to specify a filename at runtime, and
If the file doesn't exist, I want Prefect to run a specific task.
If the file exists, I want Prefect to skip that task.
I read through
Prefect caching through a file target
and got that system mostly working: behavior 2 works, and if I run it twice, then the second time is faster because the task is skipped. But behavior 1 doesn't work. If I run the flow, delete the file, and run the flow again, I want it to run the task, but it doesn't, and I still don't have my file at the end. How do I get it to run the task in this situation? Here's a little example.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect.engine.results import LocalResult
from prefect import task, Flow, Parameter
import subprocess
#task(result=LocalResult(), target="{myfilename}")
def make_my_file(myfilename):
subprocess.call(["touch", myfilename])
subprocess.call(["sleep", "1"])
return True
with Flow("makemyfile") as flow:
myfilename = Parameter("myfilename", default="foo.txt")
is_my_file_done = make_my_file(myfilename)
flow.run(myfilename = "bar.txt")
To see the behavior:
python demo_flow.py # makes bar.txt
python demo_flow.py # skips the task
rm bar.txt
python demo_flow.py # still skips the task! Rawr!
For prefect2 I guess that the answer should have been:
https://docs.prefect.io/concepts/tasks/#refreshing-the-cache
But somehow in current version (2.7.9) there is no refresh_cache on the task.
The only way I found to selectively delete the cache from a given flow is to:
in the table task_run_state update data from all task_runs from given flow_run to null
in the table task_run_state_cache delete all the updated task_run_state.
I have also found somewhere on the internet that some people update / create own the cache_key_fn. Unfortunately I can't find where.
I have found that it should be easier to manage after following fix is implemented:
https://github.com/PrefectHQ/prefect/issues/8239
I'm looking for a way to test, in my python script, if said script is running from Ansible so I can also run it through shell (for running unit tests etc). Calling AnsibleModule without calling from an ansible playbook will just endlessly wait for a response that will never come.
I'm expecting that there isn't a simple test and that I have to restructure in some way, but I'm open to any options.
def main():
# must test if running via ansible before next line
module = AnsibleModule(
argument_spec=dict(
server=dict(required=True, type='str'),
[...]
)
[... do things ...]
)
if __name__ == "__main__":
if running_via_ansible:
main()
else:
run_tests()
I believe there are a couple of answers, with various levels of trickery involved
since your module is written in python, ansible will use the AnsiballZ framework to run it, which means its sys.argv[0] will start with AnsiballZ_; it will also likely be written to $HOME/.ansible/tmp on the target machine, so one could sniff for .ansible/tmp showing up in argv[0] also
if the file contains the string WANT_JSON in it, then ansible will invoke it with the module's JSON payload as the first argument instead of feeding it JSON on sys.stdin (thus far the filename has been colocated with the AnsiballZ_ script, but I don't know that such a thing is guaranteed)
Similar, although apparently far more python specific: if it contains a triple-quoted sentinel """<<INCLUDE_ANSIBLE_MODULE_JSON_ARGS>>""" (or the ''' flavor works, too) then that magic string is replaced by the serialized JSON that, again, would have been otherwise provided via stdin
While this may not apply, or be helpful, I actually would expect that any local testing environment would have more "fingerprints" than trying to detect the opposite, and has the pleasing side-effect of "failing open" in that the module will assume it is running in production mode unless it can prove testing mode, which should make for less weird false positives. Then again, I guess the reasonable default depends on how problematic it would be for the module to attempt to carry out its payload when not really in use
I've served a directory using
python -m http.server
It works well, but it only shows file names. Is it possible to show created/modified dates and file size, like you see in ftp servers?
I looked through the documentation for the module but couldn't find anything related to it.
Thanks!
http.server is meant for dead-simple use cases, and to serve as sample code.1 That's why the docs link right to the source.
That means that, by design, it doesn't have a lot of configuration settings; instead, you configure it by reading the source and choosing what methods you want to override, then building a subclass that does that.
In this case, what you want to override is list_directory. You can see how the base-class version works, and write your own version that does other stuff—either use scandir instead of listdir, or just call stat on each file, and then work out how you want to cram the results into the custom-built HTML.
Since there's little point in doing this except as a learning exercise, I won't give you complete code, but here's a skeleton:
class StattyServer(http.server.HTTPServer):
def list_directory(self, path):
try:
dirents = os.scandir(path)
except OSError:
# blah blah blah
# etc. up to the end of the header-creating bit
for dirent in dirents:
fullname = dirent.path
displayname = linkname = dirent.name
st = dirent.stat()
# pull stuff out of st
# build a table row to append to r
1. Although really, it's sample code for an obsolete and clunky way of building servers, so maybe that should be "to serve as sample code to understand legacy code that you probably won't ever need to look at but just in case…".
I have a script, that uses a config file called config.py. Actually this is rather a configuration module then. Anyways: the configuration-module contains a lot of parameters and dictionaries and lists of dictionaries and so on.
In the script today it is used like this
import config
def main():
myParameter = config.myParameter
Now I have another application scenario for this script that uses a related config ('config_advanced.py', but the parameters and dictionaries have different values.
My goal is now, to chose the name of the used config-modul as a passed command-line argument:
myScript.py -configuration config_advanced.py
Since the configuration-module is in the same folder than the main script, I guess I have to rename the passed configuration file to 'config.py' first. Afterwards I can perform import config. Otherwise, if I used `import config_advanced, I wouldn't be able to use a call like
config.myParameter
in the main script.
Another possibility could be, to put the configuration-modules in subfolders and keep the name config.py. The passed command-line-argument will have to contain the subfolder then.
Either way I won't be able to perform the import at the top of the main file, since I have to do the argument parsing first. This isn't a technically problem, but someone said that this it at least bad pratice.
What do you think?
What is a better way to do the trick with not much effort?
Thanks a lot
Edit:
One working solution has been
import sys fullpath = "d:\\python\\scripts\\projectA\\configurationFiles\\"
sys.path.append(fullpath)
config = __import__('config_advanced')
Without syspath it does NOT work, so those following tries won't work:
config = __import__('d:\\python\\scripts\\projectA\\configurationFiles\\config_advanced')
config = __import__('d:\\python\\scripts\\projectA\\configurationFiles\\config_advanced.py')
Another possibility that's similar to what you suggest in the question, but which doesn't need you to hide things in subfolders, is to put config_advanced.py and config_plain.py in the same folder as the main script and then dynamically make config.py a link to the actual config file you want to use.
However, martineau's suggestion is much simpler.
OTOH, georg brings up a very valid point, especially if this script isn't just for your own personal use. While using Python itself for the config data is flexible and powerful, it's perhaps a little too powerful. Config data should just be data, not live executable code. If you make a minor mistake when modifying config data you could cause havoc if it's in an executable file. And if a malicious user gets to it, there's no limit to the damage they could cause.
Bad data in a plain old data file will at worst cause a ValueError if it does something weird that your config parsing code isn't suspecting. But bad data in a live Python file could throw all sorts of nasty errors. Or even worse, it could do something evil in complete silence...
In reply to your comments, here's some code to illustrate the first point:
#! /usr/bin/env python
import os
config_file = "config.py"
def link_config(mode):
if os.path.exists(config_file):
os.remove(config_file)
config_name = "config_%s.py" % mode
os.symlink(config_name, config_file)
#.... parse command line to determine config_mode string, then do
link_config(config_mode)
#Now import the newly-linked config file
import config
If config_mode == "plain" the above code will cause config_plain.py to be imported as 'config'
and if config_mode == "advanced" it will cause config_advanced.py to be imported as 'config'
But as I said before, martineau's method is much simpler. And IIRC, os.symlink may not work on non-unix systems.
...
As for your second point, check out the docs for the json module
I need to create a folder that I use only once, but need to have it exist until the next run. It seems like I should be using the tmp_file module in the standard library, but I'm not sure how to get the behavior that I want.
Currently, I'm doing the following to create the directory:
randName = "temp" + str(random.randint(1000, 9999))
os.makedirs(randName)
And when I want to delete the directory, I just look for a directory with "temp" in it.
This seems like a dirty hack, but I'm not sure of a better way at the moment.
Incidentally, the reason that I need the folder around is that I start a process that uses the folder with the following:
subprocess.Popen([command], shell=True).pid
and then quit my script to let the other process finish the work.
Creating the folder with a 4-digit random number is insecure, and you also need to worry about collisions with other instances of your program.
A much better way is to create the folder using tempfile.mkdtemp, which does exactly what you want (i.e. the folder is not deleted when your script exits). You would then pass the folder name to the second Popen'ed script as an argument, and it would be responsible for deleting it.
What you've suggested is dangerous. You may have race conditions if anyone else is trying to create those directories -- including other instances of your application. Also, deleting anything containing "temp" may result in deleting more than you intended. As others have mentioned, tempfile.mkdtemp is probably the safest way to go. Here is an example of what you've described, including launching a subprocess to use the new directory.
import tempfile
import shutil
import subprocess
d = tempfile.mkdtemp(prefix='tmp')
try:
subprocess.check_call(['/bin/echo', 'Directory:', d])
finally:
shutil.rmtree(d)
"I need to create a folder that I use only once, but need to have it exist until the next run."
"Incidentally, the reason that I need the folder around is that I start a process ..."
Not incidental, at all. Crucial.
It appears you have the following design pattern.
mkdir someDirectory
proc1 -o someDirectory # Write to the directory
proc2 -i someDirectory # Read from the directory
if [ %? == 0 ]
then
rm someDirectory
fi
Is that the kind of thing you'd write at the shell level?
If so, consider breaking your Python application into to several parts.
The parts that do the real work ("proc1" and "proc2")
A Shell which manages the resources and processes; essentially a Python replacement for a bash script.
A temporary file is something that lasts for a single program run.
What you need is not, therefore, a temporary file.
Also, beware of multiple users on a single machine - just deleting anything with the 'temp' pattern could be anti-social, doubly so if the directory is not located securely out of the way.
Also, remember that on some machines, the /tmp file system is rebuilt when the machine reboots.
You can also automatically register an function to completely remove the temporary directory on any exit (with or without error) by doing :
import atexit
import shutil
import tempfile
# create your temporary directory
d = tempfile.mkdtemp()
# suppress it when python will be closed
atexit.register(lambda: shutil.rmtree(d))
# do your stuff...
subprocess.Popen([command], shell=True).pid
tempfile is just fine, but to be on a safe side you'd need to safe a directory name somewhere until the next run, for example pickle it. then read it in the next run and delete directory. and you are not required to have /tmp for the root, tempfile.mkdtemp has an optional dir parameter for that. by and large, though, it won't be different from what you're doing at the moment.
The best way of creating the temporary file name is either using tempName.TemporaryFile(mode='w+b', suffix='.tmp', prifix='someRandomNumber' dir=None)
or u can use mktemp() function.
The mktemp() function will not actually create any file, but will provide a unique filename (actually does not contain PID).