My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code
One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")
I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.
"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.
You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.
Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/
Related
I'm generating time series data in ocaml which are basically long lists of floats, from a few kB to hundreds of MB. I would like to read, analyze and plot them using the python numpy and pandas libraries. Right now, i'm thinking of writing them to csv files.
A binary format would probably be more efficient? I'd use HDF5 in a heartbeat but Ocaml does not have a binding. Is there a good binary exchange format that is usable easily from both sides? Is writing a file the best option, or is there a better protocol for data exchange? Potentially even something that can be updated on-line?
First of all I would like to mention, that there're actually bindings for HDF-5 for OCaml. But, when I was faced with the same problem I didn't find one that suits my purposes and is mature enough. So I wouldn't suggest you to use it, but who knows, maybe today there is something more descent.
So, to my experience the best way to store numeric data in OCaml is Bigarrays. They are actually wrappers around the C-pointer, that can be allocated outside of OCaml runtime. They also can be a memory mapped regions. So, for me this is the most efficient way to share data between different processes (potentially written in different languages). You can share data using memory mapping with OCaml, Python, Matlab or whatever with very little pain, especially if you're not trying to modify it from different processes simultaneously.
Other approaches, is to use MPI, ZMQ or bare sockets. I would prefer the latter for the only reason that the former doesn't support bigarrays. Also, I would suggest you to look for capn'proto, it is also very efficient, and have bindings for OCaml and Python, and for your particular use case, can work very fine.
I was looking at the source of the json module to try to answer another question when I found something curious. Removing the docstring and a whole bunch of keyword arguments, the source of json.load is as follows:
def load(fp):
return loads(fp.read())
This was not at all as I expected. If json.load doesn't avoid the overhead of reading the whole file at once, is its only advantage over json.loads(f.read()) the savings of a few characters of source code? Why does it even exist? Why does it get the short name, instead of loads getting the load name and load getting a name like loadf? I can think of reasons (copying the pickle interface, for example), but can anyone provide an authoritative answer rather than speculation?
Though it is natural to expect json.load() does something better, as mentioned in the comments, it doesn't guarantee to do so. This is purely speculative, but if I were a Python maintainer, I would design the modules for the simplicity and least maintenance overhead.
Python standard library json module is not optimal, in speed-wise or memory-usage wise. There are many alternative JSON reading implementations for different sweet spots and some of them have Python bindings e.g. Jansson:
https://stackoverflow.com/a/3512887/315168
Alternative JSON implementation are born from the necessity to handle streaming and/or huge amount of data in efficient manner.
It is probably safe to say that reading JSON from files, although important, is not the main use case for JSON serialization. Thus, implementing an efficient JSON load from file is not that interesting, except in special cases (there are more efficient ways of serializing huge data structures to disk).
However, generalizing the concept may introduce some useful aspects (JSON deserialization from network streams, for example, or progressive JSON deserialization from pipes).
What we are looking for then is a streaming parser (for example like SAX for XML). YAJL is a common such parser, and there are some Python bindings for it
Also see the top answer to this question: Is there a streaming API for JSON?
A rather high-profile security vulnerability in Rails recently illuminated the potential dangers of parsing user-supplied YAML in a Ruby application.
A quick Google search reveals that Python's YAML library includes a safe_load method which will only deserialize "simple Python objects like integers or lists" and not objects of any arbitrary type.
Does Ruby have an equivalent? Is there any way to safely accept YAML input in a Ruby application without hand-writing a custom parser?
Following Jim's advice I went ahead and wrote safe_yaml, a gem which adds the YAML.safe_load method and uses Psych internally to do the heavy lifting.
Using the lower level interfaces to Psych (the actual parser engine), it is possible to gain access to the lower level structures without serializing them back to Ruby objects (see http://rubydoc.info/stdlib/psych/Psych/Parser). This isn't as easy as safe_load, but it does provide a route to do it.
There may be other options that work in Syck and Psych, and that are more direct such as safe_load.
Maybe a silly question, but I usually learn a lot with those. :)
I'm working on a software that deals a lot with XML, both as input and as output, and in between a lot of processing occurs.
My first thought was to use internally a dict as a internal data structure, and from there, work my way with a process of reading and writing it.
What you guys think? Any better approach, python-wise?
An XML document in general is a tree with lots of bells and whistles (attributes vs. child nodes, mixing of text with child nodes, entities, xml declarations, comments, and many more). Handling that should be left to existing, mature libraries - for Python, it's commonly agreed that lxml is the most convenient choice, followed by the stdlib ElementTree modules (by which one lxml module, lxml.etree, is so heavily inspired that incompabilities are exceptions).
These handle all that complexity and expose it in somewhat handable ways with many convenience methods (lxml's XPath support has saved me lot of code). After parsing, programs can - of course - go on to convert the trees into simpler data structures that fit the data actually modeled much better. What data structures exactly are possible and sensible depending on what you want to represent (if you misuse XML as a flat key-value storage, for instance, you could indeed go on to convert the tree into a dictionary).
It depends completely on what type of data you have in XML, what kinds of processing you'll need to do with it, what sort of output you'll need to produce from it, etc.
I am working on a project which contains two servers, one is written in python, the other in C. To maximize the capacity of the servers, we defined a binary proprietary protocol by which these two could talk to each other.
The protocol is defined in a C header file in the form of C struct. Usually, I would use VIM to do some substitutions to convert this file into Python code. But that means I have to manually do this every time the protocol is modified.
So, I believe a parser that could parse C header file would be a better choice. However, there are at least a dozen of Python parser generator. So I don't which one is more suitable for my particular task.
Any suggestion? Thanks a lot.
EDIT:
Of course I am ask anyone to write me the code....
The code is already finished. I converted the header file into Python code in the form that construct, a python library which could parse the binary data, could recognize.
I am also not looking for some already exist C parser. I am asking this question because a book I am reading talks a little about parser generator inspired me to learn how to use a real parser generator.
EDIT Again:
When we make the design of the system, I suggested to use Google Protocol Buffer, ZeroC ICE, or whatever multi-language network programming middleware to eliminate the task of implementing a protocol.
However, not every programmer could read English documents and would like to try new things, especially when they have plenty of experience of doing it in old and simple but a little clumsy way.
As an alternative solution that might feel a bit over-ambitious from the beginning, but also might serve you very well in the long-term, is:
Redefine the protocol in some higher-level language, for instance some custom XML
Generate both the C struct definitions and any required Python versions from the same source.
I would personally use PLY:
http://www.dabeaz.com/ply/
And there is already a C parser written with PLY:
http://code.google.com/p/pycparser/
If I were doing this, I would use IDL as the structure definition language. The main problem you will have with doing C structs is that C has pointers, particularly char* for strings. Using IDL restricts the data types and imposes some semantics.
Then you can do whatever you want. Most parser generators are going to have IDL as a sample grammar.
A C struct is unlikely to be portable enough to be sending between machines. Different endian, different word-sizes, different compilers will all change the way the structure is mapped to bytes.
It would be better to use a properly portable binary format that is designed for communications.