Is json.load inefficient? - python

I was looking at the source of the json module to try to answer another question when I found something curious. Removing the docstring and a whole bunch of keyword arguments, the source of json.load is as follows:
def load(fp):
return loads(fp.read())
This was not at all as I expected. If json.load doesn't avoid the overhead of reading the whole file at once, is its only advantage over json.loads(f.read()) the savings of a few characters of source code? Why does it even exist? Why does it get the short name, instead of loads getting the load name and load getting a name like loadf? I can think of reasons (copying the pickle interface, for example), but can anyone provide an authoritative answer rather than speculation?

Though it is natural to expect json.load() does something better, as mentioned in the comments, it doesn't guarantee to do so. This is purely speculative, but if I were a Python maintainer, I would design the modules for the simplicity and least maintenance overhead.
Python standard library json module is not optimal, in speed-wise or memory-usage wise. There are many alternative JSON reading implementations for different sweet spots and some of them have Python bindings e.g. Jansson:
https://stackoverflow.com/a/3512887/315168
Alternative JSON implementation are born from the necessity to handle streaming and/or huge amount of data in efficient manner.

It is probably safe to say that reading JSON from files, although important, is not the main use case for JSON serialization. Thus, implementing an efficient JSON load from file is not that interesting, except in special cases (there are more efficient ways of serializing huge data structures to disk).
However, generalizing the concept may introduce some useful aspects (JSON deserialization from network streams, for example, or progressive JSON deserialization from pipes).
What we are looking for then is a streaming parser (for example like SAX for XML). YAJL is a common such parser, and there are some Python bindings for it
Also see the top answer to this question: Is there a streaming API for JSON?

Related

Provide validated script configuration input (in Python)

I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.
I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.
Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.

Ideal Data Structure to Deal with XML Data

Maybe a silly question, but I usually learn a lot with those. :)
I'm working on a software that deals a lot with XML, both as input and as output, and in between a lot of processing occurs.
My first thought was to use internally a dict as a internal data structure, and from there, work my way with a process of reading and writing it.
What you guys think? Any better approach, python-wise?
An XML document in general is a tree with lots of bells and whistles (attributes vs. child nodes, mixing of text with child nodes, entities, xml declarations, comments, and many more). Handling that should be left to existing, mature libraries - for Python, it's commonly agreed that lxml is the most convenient choice, followed by the stdlib ElementTree modules (by which one lxml module, lxml.etree, is so heavily inspired that incompabilities are exceptions).
These handle all that complexity and expose it in somewhat handable ways with many convenience methods (lxml's XPath support has saved me lot of code). After parsing, programs can - of course - go on to convert the trees into simpler data structures that fit the data actually modeled much better. What data structures exactly are possible and sensible depending on what you want to represent (if you misuse XML as a flat key-value storage, for instance, you could indeed go on to convert the tree into a dictionary).
It depends completely on what type of data you have in XML, what kinds of processing you'll need to do with it, what sort of output you'll need to produce from it, etc.

Pros and cons for different configuration formats?

I've seen people using *.cfg (Python Buildout), *.xml (Gnome), *.json (Chrome extension), *.yaml (Google App Engine), *.ini and even *.py for app configuration files (like Django).
My question is: why there are so many different configuration file formats? I can see an advantage from a xml vs json approach (much less verbose) or a Python one (sometimes you have a Python app and don't want to use a specific module just to parse a config file), but what about the other approaches?
I know there are even more formats than those configuration files I exemplified. What are really their advantages in comparison to each other? Historical reasons? Compatibility with different systems?
If you would start an application to read some kind of configuration files (with a plugin ecosystem), which one would you use?
Which ones that I gave as example are the oldest ones? Do you know it's history?
It's mostly personal preference, purpose, and available libraries. Personally I think xml is way too verbose for config files, but it is popular and has great libraries.
.cfg, .ini are legacy formats that work well and many languages have an included library that reads them. I've used it in Java, Python, C++ without issues. It doesn't really work as a data interchange format and if I am passing data I will probably use the same format for config and data interchange.
yaml, and json are between xml and cfg/ini. You can define many data structures in both, or it can be a simple key-value like with cfg. Both of these formats have great libraries in python and I'm assuming many other languages have libraries as well. I believe json is subset of yaml.
I've never used a python file as config, but it does seem to work well for django. It does allow you to have some code in the config which might be useful.
Last time I was choosing a format I chose yaml. It's simple but has some nice features, and the python library was easy to install and really good. Json was a close second and since the yaml library parsed json I chose yaml over it.
Note, this is pure opinion and speculation on my part but I suspect that the single biggest reason for the plethora of formats is likely due to the lack of a readily available, omnipresent configuration file parsing library. Lacking that, most programs have to write their own parsers so it would often come down to balance between how complex the configuration structure needs to be (hierarchical vs flat, purely data vs embedded logic like if statements, etc), how much effort the developers were willing to spend on writing a configuration file parser, and how much of a pain it should be for the end user. However, probably every reason you've listed and could think has likely been the motivation for a project or two in choosing their format.
For my own projects I tend to use .ini simply because there's an excellent parser already built in to Python and it's been "good enough" for most of my use cases. In the couple of cases it's been insufficient, I've used an XML based configuration file due, again, to the relative simplicity of implementation.
Most configuration file formats inherit significant complexity because they support too many data types. For example, 2 can be interpreted as an integer, a real number, and a string, and since most configuration languages convert values into these underlying data types, a value like 2 must be written in such a way to convey both the value and the type. Thus, 2 is generally interpreted as an integer, 2.0 is interpreted as a real number, and "2" is interpreted as a string. This makes these languages hard to use for non-programmers who do not know these conventions. It also adds complexity and clutter to the language. This is particularly true for strings. Strings are a sequence of arbitrary characters, and so valid strings may contain the quote character. To distinguish between an embedded quote character and the quote character that terminates the string, the embedded quote character must be escaped. This leads to more complexity and clutter.
JSON and Python are unambiguous and handle hierarchy well, but are cluttered by type cues, quoting and escaping.
YAML makes the quoting optional, which in turn makes the format ambiguous.
INI takes all values as strings, which requires the end application to convert to the expected data type if required. This eliminates the need for type cues, quoting and escaping. But INI does not naturally support hierarchy, and is not standardized and different implementations are inconsistent with their interpretation of files.
A new format is now available that is similar to INI in that all leaf values are strings, but it is standardized (it has an unambiguous spec and a comprehensive test suite) and naturally handles hierarchy. NestedText results in a very simple configuration format that is suitable for both programmers and non-programmers. It should be considered as the configuration file and structured data file formats for all new applications, particularly for those where the files are to be read, written, or modified by casual users.
It really depends on whether the reader/writer of the configuration file is, or can be, a non-programmer human or if it has strictly programmatic access.
XML, JSON, etc., are totally unsuitable for human consumption, while the INI format is half-way reasonable for humans.
If the configuration file has only programmatic access (with occasional editing by a programmer), then any of the supported formats (XML, JSON, or INI) are suitable. The INI file is not suitable for structured parameters, where XML and JSON can support arrays, structs/objects, etc.
According to Python official docs, 'future enhancements to configuration functionality will
be added to dictConfig()'.
So any config file which can be used with dictconfig() should be fine.

Is it reasonable to save data as python modules?

This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.
By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).
It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.
The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.
A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.
Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.

What's the best serialization method for objects in memcached?

My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code
One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")
I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.
"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.
You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.
Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/

Categories