Ideal Data Structure to Deal with XML Data - python

Maybe a silly question, but I usually learn a lot with those. :)
I'm working on a software that deals a lot with XML, both as input and as output, and in between a lot of processing occurs.
My first thought was to use internally a dict as a internal data structure, and from there, work my way with a process of reading and writing it.
What you guys think? Any better approach, python-wise?

An XML document in general is a tree with lots of bells and whistles (attributes vs. child nodes, mixing of text with child nodes, entities, xml declarations, comments, and many more). Handling that should be left to existing, mature libraries - for Python, it's commonly agreed that lxml is the most convenient choice, followed by the stdlib ElementTree modules (by which one lxml module, lxml.etree, is so heavily inspired that incompabilities are exceptions).
These handle all that complexity and expose it in somewhat handable ways with many convenience methods (lxml's XPath support has saved me lot of code). After parsing, programs can - of course - go on to convert the trees into simpler data structures that fit the data actually modeled much better. What data structures exactly are possible and sensible depending on what you want to represent (if you misuse XML as a flat key-value storage, for instance, you could indeed go on to convert the tree into a dictionary).

It depends completely on what type of data you have in XML, what kinds of processing you'll need to do with it, what sort of output you'll need to produce from it, etc.

Related

parsing PDB header information with python (protein structure files)?

Is there a parser for PDB-files (Protein Data Bank) that can extract (most) information from the header/REMARK-section, like refinement statistics, etc.?
It might be worthwhile to note that I am mainly interested in accessing data from files right after they have been produced, not from structures that have already been deposited in the Protein Data Bank. This means that there is quite a variety of different "propriety" formats to deal with, depending on the refinement software used.
I've had a look at Biopython, but they explicitly state in the FAQ that "If you are interested in data mining the PDB header, you might want to look elsewhere because there is only limited support for this."
I am well aware that it would be a lot easier to extract this information from mmCIF-files, but unfortunately these are still not output routinely from many macromolecular crystallography programs.
The best way I have found so far is converting the PDB-file into mmcif-format using pdb_extract (http://pdb-extract.wwpdb.org/ , either online or as a standalone).
The mmcif-file can be parsed using Biopythons Bio.PDB-module.
Writing to the mmcif-file is a bit trickier, Python PDBx seems to work reasonably well.
This, and other useful PDB-/mmcif-tools can be found at http://mmcif.wwpdb.org/docs/software-resources.html
Maybe you should try that library?
https://pypi.python.org/pypi/bioservices

Is json.load inefficient?

I was looking at the source of the json module to try to answer another question when I found something curious. Removing the docstring and a whole bunch of keyword arguments, the source of json.load is as follows:
def load(fp):
return loads(fp.read())
This was not at all as I expected. If json.load doesn't avoid the overhead of reading the whole file at once, is its only advantage over json.loads(f.read()) the savings of a few characters of source code? Why does it even exist? Why does it get the short name, instead of loads getting the load name and load getting a name like loadf? I can think of reasons (copying the pickle interface, for example), but can anyone provide an authoritative answer rather than speculation?
Though it is natural to expect json.load() does something better, as mentioned in the comments, it doesn't guarantee to do so. This is purely speculative, but if I were a Python maintainer, I would design the modules for the simplicity and least maintenance overhead.
Python standard library json module is not optimal, in speed-wise or memory-usage wise. There are many alternative JSON reading implementations for different sweet spots and some of them have Python bindings e.g. Jansson:
https://stackoverflow.com/a/3512887/315168
Alternative JSON implementation are born from the necessity to handle streaming and/or huge amount of data in efficient manner.
It is probably safe to say that reading JSON from files, although important, is not the main use case for JSON serialization. Thus, implementing an efficient JSON load from file is not that interesting, except in special cases (there are more efficient ways of serializing huge data structures to disk).
However, generalizing the concept may introduce some useful aspects (JSON deserialization from network streams, for example, or progressive JSON deserialization from pipes).
What we are looking for then is a streaming parser (for example like SAX for XML). YAJL is a common such parser, and there are some Python bindings for it
Also see the top answer to this question: Is there a streaming API for JSON?

How to encode parsing rules in python?

Given XML objects of many classes (say, types of document images), I need to generate some outputs depending on the class of the object, and a complex set of mathematical rules relating the contents of the XML file.
What is the generic name of this task (parsing?) and what is the easiest way to encode separate rules for each class, bearing in mind that the rules may involve mathematical relationships. I think I should create a file for each class to keep it manageable using a DSL but I am not sure. Someone suggested incorporating a full-blown Lua or Javascript interpreter. Is this a good idea? I want to keep it lean, and simple.
Parsing refers to reading a series of tokens and matching rules in a grammar. If you can specify your problem in this way you can write the grammar using pyparsing.
If what you are interested in doing is extracting the structure of an XML document, then you can use the standard python module xml.etree.ElementTree. Also look at beautifulsoup.

Provide validated script configuration input (in Python)

I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.
I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.
Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.

What's the best serialization method for objects in memcached?

My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code
One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")
I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.
"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.
You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.
Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/

Categories