Pros and cons for different configuration formats?

Pros and cons for different configuration formats? - python

I've seen people using *.cfg (Python Buildout), *.xml (Gnome), *.json (Chrome extension), *.yaml (Google App Engine), *.ini and even *.py for app configuration files (like Django).
My question is: why there are so many different configuration file formats? I can see an advantage from a xml vs json approach (much less verbose) or a Python one (sometimes you have a Python app and don't want to use a specific module just to parse a config file), but what about the other approaches?
I know there are even more formats than those configuration files I exemplified. What are really their advantages in comparison to each other? Historical reasons? Compatibility with different systems?
If you would start an application to read some kind of configuration files (with a plugin ecosystem), which one would you use?
Which ones that I gave as example are the oldest ones? Do you know it's history?

It's mostly personal preference, purpose, and available libraries. Personally I think xml is way too verbose for config files, but it is popular and has great libraries.
.cfg, .ini are legacy formats that work well and many languages have an included library that reads them. I've used it in Java, Python, C++ without issues. It doesn't really work as a data interchange format and if I am passing data I will probably use the same format for config and data interchange.
yaml, and json are between xml and cfg/ini. You can define many data structures in both, or it can be a simple key-value like with cfg. Both of these formats have great libraries in python and I'm assuming many other languages have libraries as well. I believe json is subset of yaml.
I've never used a python file as config, but it does seem to work well for django. It does allow you to have some code in the config which might be useful.
Last time I was choosing a format I chose yaml. It's simple but has some nice features, and the python library was easy to install and really good. Json was a close second and since the yaml library parsed json I chose yaml over it.

Note, this is pure opinion and speculation on my part but I suspect that the single biggest reason for the plethora of formats is likely due to the lack of a readily available, omnipresent configuration file parsing library. Lacking that, most programs have to write their own parsers so it would often come down to balance between how complex the configuration structure needs to be (hierarchical vs flat, purely data vs embedded logic like if statements, etc), how much effort the developers were willing to spend on writing a configuration file parser, and how much of a pain it should be for the end user. However, probably every reason you've listed and could think has likely been the motivation for a project or two in choosing their format.
For my own projects I tend to use .ini simply because there's an excellent parser already built in to Python and it's been "good enough" for most of my use cases. In the couple of cases it's been insufficient, I've used an XML based configuration file due, again, to the relative simplicity of implementation.

Most configuration file formats inherit significant complexity because they support too many data types. For example, 2 can be interpreted as an integer, a real number, and a string, and since most configuration languages convert values into these underlying data types, a value like 2 must be written in such a way to convey both the value and the type. Thus, 2 is generally interpreted as an integer, 2.0 is interpreted as a real number, and "2" is interpreted as a string. This makes these languages hard to use for non-programmers who do not know these conventions. It also adds complexity and clutter to the language. This is particularly true for strings. Strings are a sequence of arbitrary characters, and so valid strings may contain the quote character. To distinguish between an embedded quote character and the quote character that terminates the string, the embedded quote character must be escaped. This leads to more complexity and clutter.
JSON and Python are unambiguous and handle hierarchy well, but are cluttered by type cues, quoting and escaping.
YAML makes the quoting optional, which in turn makes the format ambiguous.
INI takes all values as strings, which requires the end application to convert to the expected data type if required. This eliminates the need for type cues, quoting and escaping. But INI does not naturally support hierarchy, and is not standardized and different implementations are inconsistent with their interpretation of files.
A new format is now available that is similar to INI in that all leaf values are strings, but it is standardized (it has an unambiguous spec and a comprehensive test suite) and naturally handles hierarchy. NestedText results in a very simple configuration format that is suitable for both programmers and non-programmers. It should be considered as the configuration file and structured data file formats for all new applications, particularly for those where the files are to be read, written, or modified by casual users.

It really depends on whether the reader/writer of the configuration file is, or can be, a non-programmer human or if it has strictly programmatic access.
XML, JSON, etc., are totally unsuitable for human consumption, while the INI format is half-way reasonable for humans.
If the configuration file has only programmatic access (with occasional editing by a programmer), then any of the supported formats (XML, JSON, or INI) are suitable. The INI file is not suitable for structured parameters, where XML and JSON can support arrays, structs/objects, etc.

According to Python official docs, 'future enhancements to configuration functionality will
be added to dictConfig()'.
So any config file which can be used with dictconfig() should be fine.

Related

parsing PDB header information with python (protein structure files)?

Is there a parser for PDB-files (Protein Data Bank) that can extract (most) information from the header/REMARK-section, like refinement statistics, etc.?
It might be worthwhile to note that I am mainly interested in accessing data from files right after they have been produced, not from structures that have already been deposited in the Protein Data Bank. This means that there is quite a variety of different "propriety" formats to deal with, depending on the refinement software used.
I've had a look at Biopython, but they explicitly state in the FAQ that "If you are interested in data mining the PDB header, you might want to look elsewhere because there is only limited support for this."
I am well aware that it would be a lot easier to extract this information from mmCIF-files, but unfortunately these are still not output routinely from many macromolecular crystallography programs.

The best way I have found so far is converting the PDB-file into mmcif-format using pdb_extract (http://pdb-extract.wwpdb.org/ , either online or as a standalone).
The mmcif-file can be parsed using Biopythons Bio.PDB-module.
Writing to the mmcif-file is a bit trickier, Python PDBx seems to work reasonably well.
This, and other useful PDB-/mmcif-tools can be found at http://mmcif.wwpdb.org/docs/software-resources.html

Maybe you should try that library?
https://pypi.python.org/pypi/bioservices

Is json.load inefficient?

I was looking at the source of the json module to try to answer another question when I found something curious. Removing the docstring and a whole bunch of keyword arguments, the source of json.load is as follows:
def load(fp):
return loads(fp.read())
This was not at all as I expected. If json.load doesn't avoid the overhead of reading the whole file at once, is its only advantage over json.loads(f.read()) the savings of a few characters of source code? Why does it even exist? Why does it get the short name, instead of loads getting the load name and load getting a name like loadf? I can think of reasons (copying the pickle interface, for example), but can anyone provide an authoritative answer rather than speculation?

Though it is natural to expect json.load() does something better, as mentioned in the comments, it doesn't guarantee to do so. This is purely speculative, but if I were a Python maintainer, I would design the modules for the simplicity and least maintenance overhead.
Python standard library json module is not optimal, in speed-wise or memory-usage wise. There are many alternative JSON reading implementations for different sweet spots and some of them have Python bindings e.g. Jansson:
https://stackoverflow.com/a/3512887/315168
Alternative JSON implementation are born from the necessity to handle streaming and/or huge amount of data in efficient manner.

It is probably safe to say that reading JSON from files, although important, is not the main use case for JSON serialization. Thus, implementing an efficient JSON load from file is not that interesting, except in special cases (there are more efficient ways of serializing huge data structures to disk).
However, generalizing the concept may introduce some useful aspects (JSON deserialization from network streams, for example, or progressive JSON deserialization from pipes).
What we are looking for then is a streaming parser (for example like SAX for XML). YAJL is a common such parser, and there are some Python bindings for it
Also see the top answer to this question: Is there a streaming API for JSON?

How to choose a proper Python parser generator to parse C struct definitions?

I am working on a project which contains two servers, one is written in python, the other in C. To maximize the capacity of the servers, we defined a binary proprietary protocol by which these two could talk to each other.
The protocol is defined in a C header file in the form of C struct. Usually, I would use VIM to do some substitutions to convert this file into Python code. But that means I have to manually do this every time the protocol is modified.
So, I believe a parser that could parse C header file would be a better choice. However, there are at least a dozen of Python parser generator. So I don't which one is more suitable for my particular task.
Any suggestion? Thanks a lot.
EDIT:
Of course I am ask anyone to write me the code....
The code is already finished. I converted the header file into Python code in the form that construct, a python library which could parse the binary data, could recognize.
I am also not looking for some already exist C parser. I am asking this question because a book I am reading talks a little about parser generator inspired me to learn how to use a real parser generator.
EDIT Again:
When we make the design of the system, I suggested to use Google Protocol Buffer, ZeroC ICE, or whatever multi-language network programming middleware to eliminate the task of implementing a protocol.
However, not every programmer could read English documents and would like to try new things, especially when they have plenty of experience of doing it in old and simple but a little clumsy way.

As an alternative solution that might feel a bit over-ambitious from the beginning, but also might serve you very well in the long-term, is:
Redefine the protocol in some higher-level language, for instance some custom XML
Generate both the C struct definitions and any required Python versions from the same source.

I would personally use PLY:
http://www.dabeaz.com/ply/
And there is already a C parser written with PLY:
http://code.google.com/p/pycparser/

If I were doing this, I would use IDL as the structure definition language. The main problem you will have with doing C structs is that C has pointers, particularly char* for strings. Using IDL restricts the data types and imposes some semantics.
Then you can do whatever you want. Most parser generators are going to have IDL as a sample grammar.

A C struct is unlikely to be portable enough to be sending between machines. Different endian, different word-sizes, different compilers will all change the way the structure is mapped to bytes.
It would be better to use a properly portable binary format that is designed for communications.

Is it reasonable to save data as python modules?

This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.

By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).

It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.

The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.

A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.

Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.

What's the best serialization method for objects in memcached?

My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code

One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")

I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.

"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.

You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.

Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.