Remember variables after code execution - python

Is it possible to run python code in Eclipse (PyDev) and use variables computed in previously executed code (similar to using console and interpret code in real time as we enter it)?
Details: I want to use python for experimenting with signal processing and to the signal are applied 2 computationally intensive filters in a row. Each filter take some time and it would be nice to remember the result of the first filter without the need to recompute it at each launch.

Or just do: Password Protection Python
import pickle
reading a "cache" / database:
with open('database.db', 'rb') as fh:
db = pickle.load(fh)
adding to it:
db = {}
db['new_user'] = 'password'
with open('database.db', 'wb') as fh:
pickle.dump(db, fh)

Decorate your functions with Simple Cache and it will save parameters/result hash to disk. I should point out that it works only when arguments are of an immutable type (no lists, dictionaries...). Otherwise, you could handle cache results with the API exposed by Simple Cache or use pickle to serialize results to disk and load it later (which is what simple_cache does, actually).

Related

Python read/write vs shutil copy

I need to save files uploaded to my server (Max file size is 10MB) and found this answer, which works perfectly. However, I'm wondering what is the point of using the shutil module, and what is the difference between this:
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
file_object.write(uploaded_file.file.read())
and this:
import shutil
file_location = f"files/{uploaded_file.filename}"
with open(file_location, "wb+") as file_object:
shutil.copyfileobj(uploaded_file.file, file_object)
During my programming experience, I came across shutil module multiple times, but still can't figure out what its benefits are over read() and write() methods.
Your method requires the whole file be in memory. shutil copies in chunks so you can copy files larger than memory. Also, shutil has routines to copy files by name so you don't have to open them at all, and it can preserve the permissions, ownership, and creation/modification/access timestamps.
I would like to highlight a few points with regards to OP's question and the (currently accepted) answer by #Tim Roberts:
"shutil copies in chunks so you can copy files larger than memory". You can also copy a file in chunks using read()—please
have a look at the short example below, as well as this and this answer for more
details—just like you can load the whole file into memory
using shutil.copyfileobj(), by giving a negative length value.
with open(uploaded_file.filename, 'wb') as f:
while contents := uploaded_file.file.read(1024 * 1024): # adjust the chunk size as desired
f.write(contents)
Under the hood, copyfileob() uses a very similar approach to the above, utilising read() and write() methods of file objects; hence, it would make little difference, if you used one over the other. The source code of copyfileob() can be seen below. The default buffer size, i.e., COPY_BUFSIZE below, is set to 1MB (1024 *1024 bytes), if it is running on Wnidows, or 64KB (64 * 1024 bytes) on other platforms (see here).
def copyfileobj(fsrc, fdst, length=0):
"""copy data from file-like object fsrc to file-like object fdst"""
if not length:
length = COPY_BUFSIZE
# Localize variable access to minimize overhead.
fsrc_read = fsrc.read
fdst_write = fdst.write
while True:
buf = fsrc_read(length)
if not buf:
break
fdst_write(buf)
"shutil has routines to copy files by name so you don't have to open them at all..." Since OP seems to be using FastAPI
framework (which is actually
Starlette underneath), UploadFile exposes an actual Python
SpooledTemporaryFile (a file-like object) that you can get using the .file
attribute (source code can be found here). When FastAPI/Starlette creates a new instance of UploadFile, it already creates the SpooledTemporaryFile behind the scenes, which remains open. Hence, since you are dealing with a temporary
file that has no visible name in the file system—that would otherwise allow you to copy the contents without opening the file using shutil—and which is already open, it would make no
difference using either read() or copyfileobj().
"it can preserve the permissions, ownership, and creation/modification/access timestamps." Even though this is about saving a file uploaded through a web framework—and hence, most of these metadata wouldn't be transfered along with the file—as per the documentation, the above statement is not entirely true:
Warning: Even the higher-level file copying functions (shutil.copy(), shutil.copy2()) cannot copy all file
metadata.
On POSIX platforms, this means that file owner and group are lost
as well as ACLs. On Mac OS, the resource fork and other metadata are
not used. This means that resources will be lost and file type and creator codes will not be correct. On Windows, file
owners,
ACLs and alternate data streams are not copied.
That being said, there is nothing wrong with using copyfileobj(). On the contrary, if you are dealing with large files and you would like to avoid loading the entire file into memory—as you may not have enough RAM to accommodate all the data—and you would rather use copyfileobj() instead of a similar solution using read() method (as described in point 1 above), it is perfectly fine to use shutil.copyfileobj(fsrc, fdst). Besides, copyfileobj() has been offered (since Python 3.8) as an alternative platform-dependent efficient copy operation. You can change the default buffer size through adjusting the length argument in copyfileobj().
Note
If copyfileobj() is used inside a FastAPI def (sync) endpoint, it is perfectly fine, as a normal def endpoint in FastAPI is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server). On the other hand, async def endpoints run on the main (single) thread, and thus, calling such a method, i.e., copyfileobj(), that performs blocking I/O operations (as shown in the source code) would result in blocking the entire server (for more information on def vs async def, please have a look at this answer). Hence, if you are about to call copyfileobj() from within an async def endpoint, you should make sure to run this operation—as well as all other file operations, such as open() and close()—in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. You can do that using Starlette's run_in_threadpool(), which is also used by FastAPI internally when you call the async methods of the UploadFile object, as shown here. For instance:
await run_in_threadpool(shutil.copyfileobj, fsrc, fdst)
For more details and code examples, please have a look at this answer.

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.
So a solution I use quite often is:
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
if invalid_hash:
data = load_data_initial()
filehash = mpu.io.hash(original_filepath)
mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})
return data
This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.
Is there a way to check for changes in Python functions?
Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…
There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.
Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:
def source_match(source_path, object):
try:
with open(source_path) as f:
source = f.read()
if source == inspect.getsource(object):
return True
except Exception as e:
# Maybe log e or something here, but any of the obvious problems,
# like the file not existing or the function not being inspectable,
# mean we have to re-generate the data
pass
return False
def load_data(serialize_pickle_path, original_filepath):
invalid_hash = True
if os.path.exists(serialize_pickle_path):
if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
content = mpu.io.read(serialize_pickle_path)
data = content['data']
invalid_hash = mpu.io.hash(original_filepath) != content['hash']
# etc., but make sure to save the source when you save the pickle too
Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.
And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.
Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.
And plenty of other variations. But this should be enough to get you started.
A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.
Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.
Anyway, the simplest pre-commit hook would be something like (untested):
#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py
Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).
If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)
You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.
If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

How to avoid re-importing modules and re-defining large object every time a script runs

This must have an answer but I cant find it. I am using a quite large python module called quippy. With this module one can define an intermolecular potential to use as a calculator in ASE like so:
from quippy import *
from ase import atoms
pot=Potential("Potential xml_label=gap_h2o_2b_ccsdt_3b_ccsdt",param_filename="gp.xml")
some_structure.set_calculator(pot)
This is the beginning of a script. The problem is that the import takes about 3 seconds and pot=Potential... takes about 30 seconds with 100% cpu load. (I believe it is due to parsing a large ascii xml-file.) If I would be typing interactively I could keep the module imported and the potential defined, but when running the script it is done again on each run.
Can I save the module and the potential object in memory/disk between runs? Maybe keep a python process idling and keeping those things in memory? Or run these lines in the interpreter and somehow call the rest of the script from there?
Any approach is fine, but some help is be appreciated!
You can either use raw files or modules such as pickle to store data easily.
import cPickle as pickle
from quippy import Potential
try: # try previously calculated value
with open('/tmp/pot_store.pkl') as store:
pot = pickle.load(store)
except OSError: # fall back to calculating it from scratch
pot = quippy.Potential("Potential xml_label=gap_h2o_2b_ccsdt_3b_ccsdt",param_filename="gp.xml")
with open('/tmp/pot_store.pkl', 'w') as store:
pot = pickle.dump(pot, store)
There are various optimizations to this, e.g. checking whether your pickle file is older than the file generating it's value.
I found one solution, but I'm interested in alternatives. You can divide the script into two parts:
start.py:
from quippy import Potential
from ase import atoms
pot=Potential(... etc...
body.py:
for i in range(max_int):
print "doing things"
# etc...
Then enter python interpreter and run the start-script only once but the body as much as needed:
me#laptop:~/dir$ python
>>> execfile('start.py')
>>> execfile('body.py')
>>> #(change code of "body.py" in editor)
>>> execfile('body.py') # again without reloading "start.py"
So this means a terminal is occupied and the script is affected, but it works.

printing to a file in Python: redirect vs print's file argument vs write

I have a bunch of print calls that I need to write to a file instead of stdout. (I don't need stdout at all.)
I am considering three approaches. Are there any advantages (including performance) to any one of them?
Full redirect, which I saw here:
import sys
saveout = sys.stdout
fsock = open('out.log', 'w')
sys.stdout = fsock
print(x)
# and many more print calls
# later if I ever need it:
# sys.stdout = saveout
# fsock.close()
Redirect in each print statement:
fsock = open('out.log', 'w')
print(x, file = fsock)
# and many more print calls
Write function:
fsock = open('out.log', 'w')
fsock.write(str(x))
# and many more write calls
I would not expect any durable performance differences among these approaches.
The advantage of the first approach is that any reasonably well-behaved code which you rely upon (modules you import) will automatically pick up your desired redirection.
The second approach has no advantage. It's only suitable for debugging or throwaway code ... and not even a good idea for that. You want your output decisions to be consolidated in a few well-defined places, not scattered across your code in every call to print(). In Python3 print() is a function rather than a statement. This allows you to re-define it, if you like. So you can def print(*args) if you want. You can also call __builtins__.print() if you need access to it, within the definition of your own custom print(), for example.
The third approach ... and by extension the principle that all of your output should be generated in specific functions and class methods that you define for that purpose ... is probably best.
You should keep your output and formatting separated from your core functionality as much as possible. By keeping them separate you allow your core to be re-used. (For example you might start with something that's intended to run from a text/shell console, and later need to provide a Web UI, a full-screen (curses) front end or a GUI for it. You may also build entirely different functionality around it ... in situations where the resulting data needs to be returned in its native form (as objects) rather than pulled in as text (output) and re-parsed into new objects.
For example I've had more than one occasional where something I wrote to perform some complex queries and data gathering from various sources and print a report ... say of the discrepancies ... later need to be adapted into a form which could spit out the data in some form (such as YAML/JSON) that could be fed into some other system (say, for reconciling one data source against another.
If, from the outset, you keep the main operations separate from the output and formatting then this sort of adaptation is relatively easy. Otherwise it entails quite a bit of refactoring (sometimes tantamount to a complete re-write).
From the filenames you're using in your question, it sounds like you're wanting to create a log file. Have you consider the Python logging module instead?
I think that semantics is imporant:
I would suggest first approach for situation when you printing the same stuff you would print to console. Semantics will be the same. For more complex situation I would use standard logging module.
The second and third approach are a bit different in case you are printing text lines. Second approach - print adds the newline and write does not.
I would use the third approach in writing mainly binary or non-textual format and I would use redirect in print statement in the most other cases.

does close() imply flush() in Python?

In Python, and in general - does a close() operation on a file object imply a flush() operation?
Yes. It uses the underlying close() function which does that for you (source).
NB: close() and flush() won't ensure that the data is actually secure on the disk. It just ensures that the OS has the data == that it isn't buffered inside the process.
You can try sync or fsync to get the data written to the disk.
Yes, in Python 3 this is finally in the official documentation, but is was already the case in Python 2 (see Martin's answer).
As a complement to this question, yes python flushes before close, however if you want to ensure data is written properly to disk this is not enough.
This is how I would write a file in a way that it's atomically updated on a UNIX/Linux server, whenever the target file exists or not. Note that some filesystem will implicitly commit data to disk on close+rename (ext3 with data=ordered (default), and ext4 initially uncovered many application flaws before adding detection of write-close-rename patterns and sync data before metadata on those[1]).
# Write destfile, using a temporary name .<name>_XXXXXXXX
base, name = os.path.split(destfile)
tmpname = os.path.join(base, '.{}_'.format(name)) # This is the tmpfile prefix
with tempfile.NamedTemporaryFile('w', prefix=tmpname, delete=False) as fd:
# Replace prefix with actual file path/name
tmpname = str(fd.name)
try:
# Write fd here... ex:
json.dumps({}, fd)
# We want to fdatasync before closing, so we need to flush before close anyway
fd.flush()
os.fdatasync(fd)
# Since we're using tmpfile, we need to also set the proper permissions
if os.path.exists(destfile):
# Copy destination file's mask
os.fchmod(fd.fileno, os.stat(destfile).st_mode)
else:
# Set mask based on current umask value
umask = os.umask(0o22)
os.umask(umask)
os.fchmod(fd.fileno, 0o666 & ~umask) # 0o777 for dirs and executable files
# Now we can close and rename the file (overwriting any existing one)
fd.close()
os.rename(tmpname, destfile)
except:
# On error, try to cleanup the temporary file
try:
os.unlink(tmpname)
except OSError:
pass
raise
IMHO it would have been nice if Python provided simple methods around this... At the same time I guess if you care about data consistency it's probably best to really understand what is going on at a low level, especially since there are many differences across various Operating Systems and Filesystems.
Also note that this does not guarantee the written data can be recovered, only that you will get a consistent copy of the data (old or new). To ensure the new data is safely written and accessible when returning, you need to use os.fsync(...) after the rename, and even then if you have unsafe caches in the write path you could still lose data. this is common on consumer-grade hardware although any system can be configured for unsafe writes which boosts performance too. At least even with unsafe caches, the method above should still guarantee whichever copy of the data you get is valid.
filehandle.close does not necessarily flush. Surprisingly, filehandle.flush doesn't help either---it still can get stuck in the OS buffers when Python is running. Observe this session where I wrote to a file, closed it and Ctrl-Z to the shell command prompt and examined the file:
$ cat xyz
ghi
$ fg
python
>>> x=open("xyz","a")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.write("morestuff\n")
>>> x.flush
<built-in method flush of file object at 0x7f58e0044660>
>>> x.close
<built-in method close of file object at 0x7f58e0044660>
>>>
[1]+ Stopped python
$ cat xyz
ghi
Subsequently I can reopen the file, and that necessarily syncs the file (because, in this case, I open it in the append mode). As the others have said, the sync syscall (available from the os package) should flush all buffers to disk but it has possible system-wide performance implications (it syncs all files on the system).

Categories