Reading encrypted files into pandas - python

Update: I have asked a new question that gives a full code example: Decrypting a file to a stream and reading the stream into pandas (hdf or stata)
My basic problem is that I need to keep data encrypted and then read into pandas. I'm open to a variety of solutions but the encryption needs to be AES256. As of now, I'm using PyCrypto, but that's not a requirement.
My current solution is:
Decrypt into a temporary file (CSV, HDF, etc.)
Read the temp file into pandas
Delete the temp file
That's far from ideal because there is temporarily an un-encrypted file sitting on the harddrive, and with user error it could be longer than temporary. Equally bad, the IO is essentially tripled as an un-encrypted file is written out and then read into pandas.
Ideally, encryption would be built into HDF or some other binary format that pandas can read, but it doesn't seem to be as far as I can tell.
(Note: this is on a linux box, so perhaps there is a shell script solution, although I'd probably prefer to avoid that if it can all be done inside of python.)
Second best, and still a big improvement, would be to de-crypt the file into memory and read directly into pandas without ever creating a new (un-encrypted) file. So far I haven't been able to do that though.
Here's some pseudo code to hopefully illustrate.
# this works, but less safe and IO intensive
decrypt_to_file('encrypted_csv', 'decrypted_csv') # outputs decrypted file to disk
pd.read_csv('decrypted_csv')
# this is what I want, but don't know how to make it work
# no decrypted file is ever created
pd.read_csv(decrypt_to_memory('encrypted_csv'))
So that's what I'm trying to do, but also interested in other alternatives that accomplish the same thing (are efficient and don't create a temp file).
Update: Probably there is not going to be a direct answer to this question -- not too surprising, but I thought I would check. I think the answer will involve something like BytesIO (mentioned by DSM) or mmap (mentioned by Mad Physicist), so I'm exploring those. Thanks to all who made a sincere attempt to help here.

If you are already using Linux, and you look for a "simple" alternative, which does not involve encrypting\decrypting on the Python level, you could use native file system encryption with ext4.
This approach might make your installation complicated, but it has the following advantages:
Zero risk of leakage via temporary file.
Fast, since the native encryption is in C (although, PyCrypto is also in C, I am guessing it will be faster at the kernel level).
Disadvantage:
You need to learn to work with the specific file system commands
You current linux kernel is two old
You don't know how to upgrade\can't upgrade your linux kernel.
As for writing the decrypted file to memory you can use /dev/shm as your write location, thus sparing the need to do complicated streaming or overriding pandas methods.
In short, /dev/shm uses the memory (in some cases your tmpfs does that too), and it much faster than your normal hard drive (info /dev/shm/).
I hope this helps you in a way.

Related

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.
As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Live-analysis of simulation data using pytables / hdf5

I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:
Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.
Since I have a lot of data and also some metadata I though it would be better
to switch to hdf5 file format. So my Idea was something like,
Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.
I really would like to do some live analysis of the data i.e.
write from the c-programm to hdf5 and directly read from python using pytables.
This would be pretty useful, but I am really not
sure how much this is supported by pytables.
Since I never worked with pytables or hdf5 it would be good to know
if this is a good approach or if there are maybe some pitfalls.
I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:
implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.
Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.
To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html
Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp

Should I use a ramdisk for pictures that are converted and removed?

I have a little program here (python 2.7) that runs on an old machine and it basically keeps getting pictures (for timelapses) by running an external binary and converts them to an efficient format to save up disk space.
I want to minimize the disk operations, because it's already pretty old and I want it to last some more time.
At the moment the program writes the data from the camera on the disk, then converts it and removes the original data. However it does that for every image, 1- it writes a large file on disk, 2- reads it to convert, 3- and then deletes it... a bunch of disc operations that aren't necessary and could be done in ram, because the original file doesn't have to be stored and is only used as a basis to create another one.
I was sure a ramdisk was the solution, then I googled on how to do that, and google returned me a bunch of links that discourage the use of ramdisk, the reasons are many: because they are not useful in modern systems (i'm running a pretty new linux kernel); they should only be used if you want to decrypt data that shouldn't hit the disk; some tests shows that ramdisk could be actually slower than hd; the operating system has a cache...
So I'm confused...
In this situation, should I use a ramdisk?
Thank you.
PS: If you want more info: I have a proprietary high-res camera, and a proprietary binary that I run to capture a single image, I can specify where it will write the file, which is a huge TIFF file, and then the python program runs the convert program from imagemagick to convert it to JPEG and then compress it in tar.bz2, so the quality is almost the same but the filesize is 1/50 of the TIFF.
My experience with ramdisks is congruent with what you've mentioned here. I lost performance when I moved to them because there was less memory available for the kernel to do it's caching intelligently and that messed things up.
However, from your question, I understand that you want to optimise for number of disk operations rather than speed in which case a RAM disk might make sense. As with most of these kinds of problems, monitoring is the right way to do it.
Another thing that struck me was that if your original image is not that big, you might want to buy a cheap USB stick and do the I/O on that rather than on your main drive. Is that not an option?
Ah, proprietary binaries that only give certain options. Yay. The simplest solution would be adding a solid state hard drive. You will still be saving to disk, but disk IO will be much higher for reading and writing.
A better solution would be outputting the tiff to stdout, perhaps in a different format, and piping it to your python program. It would never hit the hard drive at all, but it would be more work. Of course, if the binary doesn't allow you to do this, then it's moot.
If on Debian (and possibly its derivatives), use "/run/shm" directory.

Utilities or libraries for finding most closely matched binary file

I would like to be able to compare a binary file X to a directory of other binary files and find which other file is most similar to X. The nature of the data is such that identical chunks will exist between files, but possibly shifted in location. The files are all 1MB in size, and there are about 200 of them. I would like to be have something quick enough to analyze these in a few minutes or less on a modern desktop computer.
I've googled a bit and found a few different binary diff utilities, but none of them seem appropriate for my application.
For example there is bsdiff, which looks like it creates some a patch file which is optimized for size. Or vbindiff which just displays the differences graphically, but those don't really seem to help me figure out if one file is more similar to X than another file.
If there is not a tool that I can use directly for this purpose, is there a good library someone could recommend for writing my own utility? Python would be preferable, but I'm flexible.
Here's a simple perl script which more or less tries to do exactly that.
Edit: Also have a look at the following stackoverflow thread.

Categories