I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:
Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.
Since I have a lot of data and also some metadata I though it would be better
to switch to hdf5 file format. So my Idea was something like,
Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.
I really would like to do some live analysis of the data i.e.
write from the c-programm to hdf5 and directly read from python using pytables.
This would be pretty useful, but I am really not
sure how much this is supported by pytables.
Since I never worked with pytables or hdf5 it would be good to know
if this is a good approach or if there are maybe some pitfalls.
I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:
implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.
Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.
To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html
Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp
Related
I have been trying to wrap my head around pyarrow for a while, reading their documentation but I still feel like I have not been able to grasp it in it's entirety. I saw their depcrecated method of serialization for arbitrary python objects, but since it's deprecated I was wondering what the correct way is to save for example a list of objects or an arbitrary python object in general?
When do you want to bother using pyarrow as well?
PyArrow is python binding for (Apache) Arrow. Arrow is a cross-language specification that describes how to store columnar data in memory. It serves as the internals of data processing applications & libraries, allowing them to efficiently work with large tabular datasets.
When do you want to bother using pyarrow as well?
One simple use case for PyArrow is to convert between Pandas/Numpy/dict and the Parquet file format. So for example, if you had columnar data (eg DataFrames) that you need to share between programs written in different languages, or even programs using different versions of python, a nice way to do this is to save your Pandas/Numpy/dict to a Parquet file (serialisation). This is a much more portable format that, for example, pickle. It also allows you to embed custom metadata in a portable fashion.
I have a ROS/CPP simulator that saves large amounts of data to a rosbag (around 90 MB). I want to read this data frequently from Python and since reading rosbags is slow and cumbersome, I currently have another python script that reads the rosbag and saves the relevant contents to a HDF5 file.
It would be nice though to be able to just save the data from the simulator directly (in C++) and then read it from my scripts (in Python). So I was wondering which data format I should use.
It should be:
Fast to load from Python
Be compact (so ideally a binary of some sort)
Be easy to use
You might be wondering why I don't just save to HDF5 from my C++ simulator, but it just doesn't seem to be easy. There is basically nothing on forums such as Stackoverflow and the HDF5 Group website is opaque, seems to have some complicated licensing and very poor examples. I just want something quick and dirty that I can get running this afternoon.
You may want to have a look at HDFql as it is a high-level language (similar to SQL) to manage HDF5 files. Amongst others, HDFql supports C++ and Python. There are some examples that illustrates how to use HDFql in these languages here.
I see two solutions that can be useful for your problem :
LV: Length Value that you can store directly in binary into a file.
JSON: This does not add many data more than you need, and there are many libraries in Python or C++ that can simplify you the work
Protocol Buffers is an option with language bindings in C++ and Python, though it might be more time investment than quick/dirty running this afternoon.
I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?
It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.
It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.
Update: I have asked a new question that gives a full code example: Decrypting a file to a stream and reading the stream into pandas (hdf or stata)
My basic problem is that I need to keep data encrypted and then read into pandas. I'm open to a variety of solutions but the encryption needs to be AES256. As of now, I'm using PyCrypto, but that's not a requirement.
My current solution is:
Decrypt into a temporary file (CSV, HDF, etc.)
Read the temp file into pandas
Delete the temp file
That's far from ideal because there is temporarily an un-encrypted file sitting on the harddrive, and with user error it could be longer than temporary. Equally bad, the IO is essentially tripled as an un-encrypted file is written out and then read into pandas.
Ideally, encryption would be built into HDF or some other binary format that pandas can read, but it doesn't seem to be as far as I can tell.
(Note: this is on a linux box, so perhaps there is a shell script solution, although I'd probably prefer to avoid that if it can all be done inside of python.)
Second best, and still a big improvement, would be to de-crypt the file into memory and read directly into pandas without ever creating a new (un-encrypted) file. So far I haven't been able to do that though.
Here's some pseudo code to hopefully illustrate.
# this works, but less safe and IO intensive
decrypt_to_file('encrypted_csv', 'decrypted_csv') # outputs decrypted file to disk
pd.read_csv('decrypted_csv')
# this is what I want, but don't know how to make it work
# no decrypted file is ever created
pd.read_csv(decrypt_to_memory('encrypted_csv'))
So that's what I'm trying to do, but also interested in other alternatives that accomplish the same thing (are efficient and don't create a temp file).
Update: Probably there is not going to be a direct answer to this question -- not too surprising, but I thought I would check. I think the answer will involve something like BytesIO (mentioned by DSM) or mmap (mentioned by Mad Physicist), so I'm exploring those. Thanks to all who made a sincere attempt to help here.
If you are already using Linux, and you look for a "simple" alternative, which does not involve encrypting\decrypting on the Python level, you could use native file system encryption with ext4.
This approach might make your installation complicated, but it has the following advantages:
Zero risk of leakage via temporary file.
Fast, since the native encryption is in C (although, PyCrypto is also in C, I am guessing it will be faster at the kernel level).
Disadvantage:
You need to learn to work with the specific file system commands
You current linux kernel is two old
You don't know how to upgrade\can't upgrade your linux kernel.
As for writing the decrypted file to memory you can use /dev/shm as your write location, thus sparing the need to do complicated streaming or overriding pandas methods.
In short, /dev/shm uses the memory (in some cases your tmpfs does that too), and it much faster than your normal hard drive (info /dev/shm/).
I hope this helps you in a way.
I have a python script that needs to read a huge file into a var and then search into it and perform other stuff,
the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads.
Is it possible to make the file persist in memory to have faster access to it atlater times ?
I know i can make the script as a service using supervisor but i can't do that for this.
Any other suggestions please.
PS I am already using var = pickle.load(open(file))
You should take a look at http://docs.h5py.org/en/latest/. It allows to perform various operations on huge files. It's what the NASA uses.
Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:
(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.
(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).
You are talking about memory-caching a large array, essentially…?
There are three fairly viable options for large arrays:
use memory-mapped arrays
use h5py or pytables as a back-end
use an array caching-aware package like klepto or joblib.
Memory-mapped arrays index the array in file, as if there were in memory.
h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).