On wikipedia one can read the following criticism about HDF5:
Criticism of HDF5 follows from its monolithic design and lengthy
specification. Though a 150-page open standard, there is only a single
C implementation of HDF5, meaning all bindings share its bugs and
performance issues. Compounded with the lack of journaling, documented
bugs in the current stable release are capable of corrupting entire
HDF5 databases. Although 1.10-alpha adds journaling, it is
backwards-incompatible with previous versions. HDF5 also does not
support UTF-8 well, necessitating ASCII in most places. Furthermore
even in the latest draft, array data can never be deleted.
I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?
I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.
My scripts always have a Python API, hence I am using h5py (version 2.5.0).
So, is this criticism relevant to me and should I be concerned about corrupted data?
Declaration up front: I help maintain h5py, so I probably have a bias etc.
The wikipedia page has changed since the question was posted, here's what I see:
Criticism
Criticism of HDF5 follows from its monolithic design and lengthy specification.
Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.
HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.
Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).
I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).
I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).
I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.
Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.
Related
My application is going to edit a bunch of large files, completely unrelated to each other (belonging to different users), and I need to store checkpoints of the previous state of the files. Delta compression should work extremely well on this file format. I only need a linear history, not branches or merges.
There are low-level libraries that give part of the solution, for example xdelta3 sounds like a good binary diff/patch system.
RCS actually seems like a pretty close match to my problem, but doesn't handle binary files well.
git provides a complete solution to my problem, but is an enormous suite of programs, and its storage format is an entire directory.
Is there anything less complicated than git that would:
work on binary files
perform delta compression
let me commit new "newest" versions
let me recall old versions
Bonus points if it would:
have a single-file storage format
be available as a C, C++, or Python library
I can't even find the right combination of words to google for this category of program, so that would also be helpful.
From RCS manual (1. Overview)
[RCS] can handle text as well as binary files, although functionality is reduced for the latter.
RCS seems a good option worth to try.
I work for a Foundation which has been using RCS to keep under version control tens of thousands of completely unrelated files (git or hg are not an option). Mostly text, but also some media files, which are binary in nature.
RCS does work quite well with binary files, only make sure not to use the Substitute mode options, to avoid inadvertently substituting binary bits that looks like $ Id.
To see if this could work for you, you could for example try with a Photoshop image, put it under version control with RCS. Then change a part, or add a layer, and commit the change. You could then verify how well RCS can manage binary files for you.
RCS has been serving us quite well. It is well maintained, reliable, predictable, and definitely worth a try.
Forgive me for asking, but my experience has taught me to challenge assumptions. I don't know why you need a 'single-file' solution, but my answer depends on that.
Option 1 - If you are simply looking for ease of use, have you considered using a single git repo to track multiple binaries?
By using git's per-file history capabilities, you can see the history for every file in the repo independently, create patches and rollback without affecting the rest of the repo. For example, by using a commit naming convention, you can easily rollback changes for individual files using:
git log -- filename
git revert <commit-id>
Option #2 - If you have a system constraint that forces you to store a single file, I would recommend considering git-bundle. Basically, that allows you to pack a git repo into a single file for easier storage/relocation (I guess that's pretty much as zipping your repo and storing the zipped file).
Option #3 - Consider Fossil. I haven't used it, so can't comment on it's qualities, but it looks like it might answer your requirements.
I'm generating time series data in ocaml which are basically long lists of floats, from a few kB to hundreds of MB. I would like to read, analyze and plot them using the python numpy and pandas libraries. Right now, i'm thinking of writing them to csv files.
A binary format would probably be more efficient? I'd use HDF5 in a heartbeat but Ocaml does not have a binding. Is there a good binary exchange format that is usable easily from both sides? Is writing a file the best option, or is there a better protocol for data exchange? Potentially even something that can be updated on-line?
First of all I would like to mention, that there're actually bindings for HDF-5 for OCaml. But, when I was faced with the same problem I didn't find one that suits my purposes and is mature enough. So I wouldn't suggest you to use it, but who knows, maybe today there is something more descent.
So, to my experience the best way to store numeric data in OCaml is Bigarrays. They are actually wrappers around the C-pointer, that can be allocated outside of OCaml runtime. They also can be a memory mapped regions. So, for me this is the most efficient way to share data between different processes (potentially written in different languages). You can share data using memory mapping with OCaml, Python, Matlab or whatever with very little pain, especially if you're not trying to modify it from different processes simultaneously.
Other approaches, is to use MPI, ZMQ or bare sockets. I would prefer the latter for the only reason that the former doesn't support bigarrays. Also, I would suggest you to look for capn'proto, it is also very efficient, and have bindings for OCaml and Python, and for your particular use case, can work very fine.
Originally, I had to deal with just 1.5[TB] of data. Since I just needed fast write/reads (without any SQL), I designed my own flat binary file format (implemented using python) and easily (and happily) saved my data and manipulated it on one machine. Of course, for backup purposes, I added 2 machines to be used as exact mirrors (using rsync).
Presently, my needs are growing, and there's a need to build a solution that would successfully scale up to 20[TB] (and even more) of data. I would be happy to continue using my flat file format for storage. It is fast, reliable and gives me everything I need.
The thing I am concerned about is replication, data consistency etc (as obviously, data will have to be distributed -- not all data can be stored on one machine) across the network.
Are there any ready-made solutions (Linux / python based) that would allow me to keep using my file format for storage, yet would handle the other components that NoSql solutions normally provide? (data consistency / availability / easy replication)?
basically, all I want to make sure is that my binary files are consistent throughout my network. I am using a network of 60 core-duo machines (each with 1GB RAM and 1.5TB disk)
Approach: Distributed Map reduce in Python with The Disco Project
Seems like a good way of approaching your problem. I have used the disco project with similar problems.
You can distribute your files among n numbers of machines (processes), and implement the map and reduce functions that fit your logic.
The tutorial of the disco project, exactly describes how to implement a solution for your problems. You'll be impressed about how little code you need to write and definitely you can keep the format of your binary file.
Another similar option is to use Amazon's Elastic MapReduce
Perhaps some of the commentary on the Kivaloo system developed for Tarsnap will help you decide what's most appropriate: http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
Without knowing more about your application (size/type of records, frequency of reading/writing) or custom format it's hard to say more.
This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.
By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).
It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.
The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.
A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.
Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.
My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code
One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")
I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.
"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.
You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.
Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/