Python Unit Testing with PyTables and HDF5

Python Unit Testing with PyTables and HDF5 - python

What is a proper way to do Unit Testing with file IO, especially if it involves PyTables and HDF5?
My application evolves around storage and retrieval of python data into and from hdf5 files. So far I simply write the hdf5 files in the unit tests myself and load them for comparison. The problem is that I, of course, cannot be sure when some one else runs the test that he has privileges to actually write files to hard disk. (This probably gets even worse when I want to use automated test frameworks like Jenkins, but I haven't checked that, yet).
What is a proper way to handle these situations? Is it best practice to create a /tmp/ folder at a particular place where write access is very likely to be granted? If so, where is that? Or is there an easy and straight forward way to mock PyTables writing and reading?
Thanks a lot!

How about using the module "tempfile" to create the files?
http://docs.python.org/2/library/tempfile.html
I don't know if it's guaranteed to work on all platforms but I bet it does work on most common ones. It would certainly be better practice than hardcoding "/tmp" as the destination.
Another way would be to create an HDF5 database in memory so that no file I/O is required.
http://pytables.github.io/cookbook/inmemory_hdf5_files.html
I obtained that link by googling "hdf5 in memory" so I can't say for sure how well it works.
I think the best practice would be writing all test cases to run against both an in-memory database and a tempfile database. This way, even if one of the above techniques fails for the user, the rest of the tests will still run. Also you can separately identify whether bugs are related to file-writing or something internal to the database.

Fundmentally, HDF5 and Pytables are I/O libraries. They provide an API for file system manipulation. Therefore if you really want to test PyTables / HDF5 you have to hit the file system. There is no way around this. If a user does not have write access on a system, they cannot run the tests. Or at least they cannot run realistic tests.
You can use the in memory file driver to do testing. This is useful for speeding up most tests and testing higher level functionality. However, even if you go this route you should still have a few tests which actually write out real files. If these fail you know that something is wrong.
Normally, people create the temporary h5 files in the tests directory. But if you are truly worried about the user not having write access to this dir, you should use tempfile.gettempdir() to find their environment's correct /tmp dir. Note that this is cross-platform so should work everywhere. Put the h5 files that you create there and remember to delete them afterwards!

Related

Is it dangerous to create a non-temp file in a temp dir?

In python I am creating a tempdir (using tempfile.TemporaryDirectory), writing a few text files within it, and then (after processing the files) calling tempdir.cleanup(). The program works, but I'm wondering if there are any dangers to this that aren't immediately visible, especially when working with a large number of files.

Is it a bad practice? It certainly is. Is it dangerous? Depending on the name, it might get overwritten by another application. Or it might not.
The entire temp folder might be wiped between reboots (linux does this, not sure about windows). Or it might not. The user might delete files in the temp folder to free up space. Or they might not.
Generally, it's just bad practice and don't do it. If your file is non-temporary, then don't put it in the temp folder.

Reading encrypted files into pandas

Update: I have asked a new question that gives a full code example: Decrypting a file to a stream and reading the stream into pandas (hdf or stata)
My basic problem is that I need to keep data encrypted and then read into pandas. I'm open to a variety of solutions but the encryption needs to be AES256. As of now, I'm using PyCrypto, but that's not a requirement.
My current solution is:
Decrypt into a temporary file (CSV, HDF, etc.)
Read the temp file into pandas
Delete the temp file
That's far from ideal because there is temporarily an un-encrypted file sitting on the harddrive, and with user error it could be longer than temporary. Equally bad, the IO is essentially tripled as an un-encrypted file is written out and then read into pandas.
Ideally, encryption would be built into HDF or some other binary format that pandas can read, but it doesn't seem to be as far as I can tell.
(Note: this is on a linux box, so perhaps there is a shell script solution, although I'd probably prefer to avoid that if it can all be done inside of python.)
Second best, and still a big improvement, would be to de-crypt the file into memory and read directly into pandas without ever creating a new (un-encrypted) file. So far I haven't been able to do that though.
Here's some pseudo code to hopefully illustrate.
# this works, but less safe and IO intensive
decrypt_to_file('encrypted_csv', 'decrypted_csv') # outputs decrypted file to disk
pd.read_csv('decrypted_csv')
# this is what I want, but don't know how to make it work
# no decrypted file is ever created
pd.read_csv(decrypt_to_memory('encrypted_csv'))
So that's what I'm trying to do, but also interested in other alternatives that accomplish the same thing (are efficient and don't create a temp file).
Update: Probably there is not going to be a direct answer to this question -- not too surprising, but I thought I would check. I think the answer will involve something like BytesIO (mentioned by DSM) or mmap (mentioned by Mad Physicist), so I'm exploring those. Thanks to all who made a sincere attempt to help here.

If you are already using Linux, and you look for a "simple" alternative, which does not involve encrypting\decrypting on the Python level, you could use native file system encryption with ext4.
This approach might make your installation complicated, but it has the following advantages:
Zero risk of leakage via temporary file.
Fast, since the native encryption is in C (although, PyCrypto is also in C, I am guessing it will be faster at the kernel level).
Disadvantage:
You need to learn to work with the specific file system commands
You current linux kernel is two old
You don't know how to upgrade\can't upgrade your linux kernel.
As for writing the decrypted file to memory you can use /dev/shm as your write location, thus sparing the need to do complicated streaming or overriding pandas methods.
In short, /dev/shm uses the memory (in some cases your tmpfs does that too), and it much faster than your normal hard drive (info /dev/shm/).
I hope this helps you in a way.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

How to save program settings to computer?

I'm looking to store some individual settings to each user's computer. Things like preferences and a license key. From what I know, saving to the registry could be one possibility. However, that won't work on Mac.
One of the easy but not so proper techniques are just saving it to a settings.txt file and reading that on load.
Is there a proper way to save this kind of data? I'm hoping to use my wx app on Windows and Mac.

There is no proper way. Use whatever works best for your particular scenario. Some common ways for storing user data include:
Text files (e.g. Windows INI, cfg files)
binary files (sometimes compressed)
Windows registry
system environment variables
online profiles
There's nothing wrong with using text files. A lot of proper applications uses them exactly for the reason that they are easy to implement, and additionally human readable. The only thing you need to worry about is to make sure you have some form of error handling in place, in case the user decides to replace you config file content with some rubbish.

Take a look at Data Persistence on python docs. One option a you said could be persist them to a simple text file. Or you can save your data using some serialization format as pickle (see previous link) or json but it will be pretty ineficient if you have several keys and values or it will be too complex.
Also, you could save user preferences in an .ini file using python's ConfigParser module as show in this SO answer.
Finally, you can use a database like sqlite3 which is simpler to handle from your code in order to save and retrieve preferences.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.

I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.

Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.