Pythonic way of moving temp files - python

From what I gather, open() in really io.open(), a high level "wrapper" for os.open().
For other file operations like renaming and removing files I have to use os funcations like os.remove and os.rename a file or even shutil.move in some cases, like below:
import shutil
with open("/tmp/workfile", "w") as f:
f.write("some stuff")
shutil.move(f.name, "finalfile")
Why is there no similar wrapper like open for removal/renaming?
Is there a better, perhaps more pythonic way of accomplishing above task?
It seems strange to have to do imports instead of maybe having rename and remove be methods in f, point it over to another file. Especially when open() requires no import.
edit: I removed the del f at the end that seemed to anger a lot of people. I know it's not needed. I had it there to highlight that an f-object that no longer points to a removed file has very little use.

#micke, wanna bet this is due to history? I'm guessing the open function was one of the first added to Python because, well... the creator of the language needed that early on.
I'd argue that having open as a built-in function is weird, not the other way around.
Also note that you're using the variable f outside of the with block. And the with block automatically calls close on f when the block exits, so the del f statement is not necessary. That's the whole point of using with blocks (forgetting to .close() is a very common mistake)

Related

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

Python doesn't delete temporary file if the program fails

I'm writing a program, which, inter alia, works with temporary file, created using tempfile library.
The temporary file creates and fills in function:
def func():
mod_script = tempfile.NamedTemporaryFile(dir='special')
dest = open(mod_script, 'w')
# filling dest
return mod_script
(I use open() and not with open() because I execute the temporary file after calling func())
After some operations with mod_script outside func(), I call mod_script.close(). And all works fine.
But I have one problem. If my program fails (or if I interrupt it), the temporary file doesn't remove.
How do I fix it ?
I really don't want to write try...except...finally clauses because I'll have to write it so many times (there are many points, where my program can fail).
First, use a with statement, and pass delete=False to the constructor.
Then you need to put the necessary error handling in your program. Catch exceptions (see try..finally) and clean up during program exit whether it is successful or crashes.
Alternatively, keep the file open while executing it to prevent the automatic deletion-on-close from deleting it before you have executed it. This may have issues on Windows where it tends to have conflicts using files that are open.

Did I TDD this method well or is there a better way?

Note: I'm used to using Dependency Injection with C# code,
but from what I understand, dynamic languages like Ruby and Python are
like play-doh not LEGOs, and thus don't need to follow use IoC
containers, though there is some debate on if IoC patterns are still useful. In the code below I used fudge's .patch feature that provides the seams needed to mock/stub the code. However, the components of the code are thus coupled. I'm not sure I like this. This SO answer also explains that coupling in dynamic languages is looser than static ones, but does reference another answer in that question that says the tools for IoC are unneeded but the patterns are not. So a side question would be, "Should I have used DI for this?"
I'm using the following python frameworks:
Nose for unit testing
Fudge for fakes (stubs, mocking, etc)
Here is the resulting production code:
def to_fasta(seq_records, file_name):
file_object = open(file_name, "w")
Bio.SeqIO.write(seq_records, file_object, "fasta")
file_object.close()
Now I did TDD this code, but I did it with the following test (which wasn't all the thorough):
#istest
#fudge.patch('__builtin__.open', 'Bio.SeqIO.write')
def to_fasta_writes_file(fake_open, fake_SeqIO):
fake_open.is_a_stub()
fake_SeqIO.expects_call()
seq_records = build_expected_output_sequneces()
file_path = "doesn't matter"
to_fasta(seq_records, file_path)
Here is the updated test along with explicit comments to ensure I'm following the Four-Phase Test pattern:
#istest
#fudge.patch('__builtin__.open', 'Bio.SeqIO')
def to_fasta_writes_file(fake_open, fake_SeqIO):
# Setup
seq_records = build_expected_output_sequneces()
file_path = "doesn't matter"
file_type = 'fasta'
file_object = fudge.Fake('file').expects('close')
(fake_open
.expects_call()
.with_args(file_path, 'w')
.returns(file_object))
(fake_SeqIO
.is_callable()
.expects("write")
.with_args(seq_records, file_object, file_type))
# Exercise
to_fasta(seq_records, file_path)
# Verify (not needed due to '.patch')
# Teardown
While the second example is more thorough, is this test overkill? Are there better ways to TDD python code? Basically, I'm looking for feedback on how I did with TDDing this operation, and would welcome any alternative ways to write either the test code or the production code.
Think about what this function does and think about what you actually have responsibility for. It looks to me like: given some data and a file name, write the records in to the file in a particular format (fasta). You aren't actually responsible for the workings of Python file I/O, or how Bio.SeqIO works.
Your second version tests that:
The file is opened for writing.
That Bio.SeqIO.write is called with the expected parameters.
The file is closed.
That looks pretty good. Most of this is simple, and some people may call it overkill, but the TDD approach can help remind you to do something like close the file (obvious, but we all forget stuff like that all the time). These tests also guard against such things as Bio.SeqIO.write being changed in the future to expect different parameters. You can either upgrade your version of the library and wonder why your program breaks, or upgrade your version of the library, run your tests, and know why and where it breaks.
Naturally you should write other tests for the case when you can't open the file, or any exceptions that Bio.SeqIO.write might throw.

printing to a file in Python: redirect vs print's file argument vs write

I have a bunch of print calls that I need to write to a file instead of stdout. (I don't need stdout at all.)
I am considering three approaches. Are there any advantages (including performance) to any one of them?
Full redirect, which I saw here:
import sys
saveout = sys.stdout
fsock = open('out.log', 'w')
sys.stdout = fsock
print(x)
# and many more print calls
# later if I ever need it:
# sys.stdout = saveout
# fsock.close()
Redirect in each print statement:
fsock = open('out.log', 'w')
print(x, file = fsock)
# and many more print calls
Write function:
fsock = open('out.log', 'w')
fsock.write(str(x))
# and many more write calls
I would not expect any durable performance differences among these approaches.
The advantage of the first approach is that any reasonably well-behaved code which you rely upon (modules you import) will automatically pick up your desired redirection.
The second approach has no advantage. It's only suitable for debugging or throwaway code ... and not even a good idea for that. You want your output decisions to be consolidated in a few well-defined places, not scattered across your code in every call to print(). In Python3 print() is a function rather than a statement. This allows you to re-define it, if you like. So you can def print(*args) if you want. You can also call __builtins__.print() if you need access to it, within the definition of your own custom print(), for example.
The third approach ... and by extension the principle that all of your output should be generated in specific functions and class methods that you define for that purpose ... is probably best.
You should keep your output and formatting separated from your core functionality as much as possible. By keeping them separate you allow your core to be re-used. (For example you might start with something that's intended to run from a text/shell console, and later need to provide a Web UI, a full-screen (curses) front end or a GUI for it. You may also build entirely different functionality around it ... in situations where the resulting data needs to be returned in its native form (as objects) rather than pulled in as text (output) and re-parsed into new objects.
For example I've had more than one occasional where something I wrote to perform some complex queries and data gathering from various sources and print a report ... say of the discrepancies ... later need to be adapted into a form which could spit out the data in some form (such as YAML/JSON) that could be fed into some other system (say, for reconciling one data source against another.
If, from the outset, you keep the main operations separate from the output and formatting then this sort of adaptation is relatively easy. Otherwise it entails quite a bit of refactoring (sometimes tantamount to a complete re-write).
From the filenames you're using in your question, it sounds like you're wanting to create a log file. Have you consider the Python logging module instead?
I think that semantics is imporant:
I would suggest first approach for situation when you printing the same stuff you would print to console. Semantics will be the same. For more complex situation I would use standard logging module.
The second and third approach are a bit different in case you are printing text lines. Second approach - print adds the newline and write does not.
I would use the third approach in writing mainly binary or non-textual format and I would use redirect in print statement in the most other cases.

how to get content of a small ascii file in python?

Let's say we want to implement an equivalent of the PHP's file_get_content.
What is the best practice? (elegant and reliable)
Here are some proposition, are they correct?
using with statement:
def file_get_contents(filename):
with file(filename) as f:
s = f.read()
return s
is using standard open() safe?
def file_get_contents(filename):
return open(filename).read()
What happens to file descriptor in either solution?
In the current implementation of CPython, both will generally immediately close the file. However, Python the language makes no such guarantee for the second one - the file will eventually be closed, but the finaliser may not be called until the next gc cycle. Implementations like Jython and IronPython will work like this, so it's good practice to explicitely close your files.
I'd say using the first solution is the best practice, though open is generally preferred to file. Note that you can shorten it a little though if you prefer the brevity of the second example:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
The __exit__ part of the context manager will execute when you leave the body for any reason, including exceptions and returning from the function - there's no need to use an intermediate variable.
import os
def file_get_contents(filename):
if os.path.exists(filename):
fp = open(filename, "r")
content = fp.read()
fp.close()
return content
This case it will return None if the file didn't exist, and the file descriptor will be closed before we exit the function.
Using the with statement is actually the nicest way to be sure that the file is really closed.
Depending on the garbage collector behavior for this task might work, but in this case, there is a nice way to be sure in all cases, so...
with will make sure that the file is closed when the block is left.
In your second example, the file handle might remain open (Python makes no guarantee that it's closed or when if you don't do it explicitly).
You can also use Python's v3 feature:
>>> ''.join(open('htdocs/config.php', 'r').readlines())
"This is the first line of the file.\nSecond line of the file"
Read more here http://docs.python.org/py3k/tutorial/inputoutput.html

Categories