A rather high-profile security vulnerability in Rails recently illuminated the potential dangers of parsing user-supplied YAML in a Ruby application.
A quick Google search reveals that Python's YAML library includes a safe_load method which will only deserialize "simple Python objects like integers or lists" and not objects of any arbitrary type.
Does Ruby have an equivalent? Is there any way to safely accept YAML input in a Ruby application without hand-writing a custom parser?
Following Jim's advice I went ahead and wrote safe_yaml, a gem which adds the YAML.safe_load method and uses Psych internally to do the heavy lifting.
Using the lower level interfaces to Psych (the actual parser engine), it is possible to gain access to the lower level structures without serializing them back to Ruby objects (see http://rubydoc.info/stdlib/psych/Psych/Parser). This isn't as easy as safe_load, but it does provide a route to do it.
There may be other options that work in Syck and Psych, and that are more direct such as safe_load.
Related
is there any good reason not to use XML-RPC for an object-broker server/client architecture? Maybe something like "no it's already outfashioned, there is X for that now".
To give you more details: I want to build a framework which allows for standardized interaction and the exchange of results between many little tools (e. g. command-line tools). In case someone wants to integrate another tool she writes a wrapper for that purpose. The wrapper could, e. g., convert the STDOUT of a tool into objects usable by the architecture.
Currently I'm thinking of writing the proof-of-concept server in Python. Later it could be rewritten in C/C++. Just to make sure clients can be written in as many languages as possible I thought of using XML-RPC. CORBA seems to be too bloated for that purpose, since the server shouldn't be too complex.
Thanks for your advice and opinions,
Rainer
XML-RPC has a lot going for it. It's simple to create and to consume, easy to understand and easy to code for.
I'd say avoid SOAP and CORBA like the plague. They are way too complex, and with SOAP you have endless problems because only implementations from single vendors tend to interact nicely - probably because the complexity of the standard leads to varying interpretations.
You may want to consider a RESTful architecture. REST and XML-RPC cannot be directly compared. XML-RPC is a specific implementation of RPC, and REST is an architectural style. REST does not mandate anything much - it's more a style of approach with a bunch of conventions and suggestions. REST can look a lot like XML-RPC, but it doesn't have to.
Have a look at http://en.wikipedia.org/wiki/Representational_State_Transfer and some of the externally linked articles.
One of the goals of REST is that by creating a stateless interface over HTTP, you allow the use of standard caching mechanisms and load balancing mechanisms without having to invent new ways of doing what has already been well solved by HTTP.
Having read about REST, which hopefully is an interesting read, you may decide that for your project XML-RPC is still the best solution, which would be a perfectly reasonable conclusion depending on what exactly you are trying to achieve.
I am working on a project which contains two servers, one is written in python, the other in C. To maximize the capacity of the servers, we defined a binary proprietary protocol by which these two could talk to each other.
The protocol is defined in a C header file in the form of C struct. Usually, I would use VIM to do some substitutions to convert this file into Python code. But that means I have to manually do this every time the protocol is modified.
So, I believe a parser that could parse C header file would be a better choice. However, there are at least a dozen of Python parser generator. So I don't which one is more suitable for my particular task.
Any suggestion? Thanks a lot.
EDIT:
Of course I am ask anyone to write me the code....
The code is already finished. I converted the header file into Python code in the form that construct, a python library which could parse the binary data, could recognize.
I am also not looking for some already exist C parser. I am asking this question because a book I am reading talks a little about parser generator inspired me to learn how to use a real parser generator.
EDIT Again:
When we make the design of the system, I suggested to use Google Protocol Buffer, ZeroC ICE, or whatever multi-language network programming middleware to eliminate the task of implementing a protocol.
However, not every programmer could read English documents and would like to try new things, especially when they have plenty of experience of doing it in old and simple but a little clumsy way.
As an alternative solution that might feel a bit over-ambitious from the beginning, but also might serve you very well in the long-term, is:
Redefine the protocol in some higher-level language, for instance some custom XML
Generate both the C struct definitions and any required Python versions from the same source.
I would personally use PLY:
http://www.dabeaz.com/ply/
And there is already a C parser written with PLY:
http://code.google.com/p/pycparser/
If I were doing this, I would use IDL as the structure definition language. The main problem you will have with doing C structs is that C has pointers, particularly char* for strings. Using IDL restricts the data types and imposes some semantics.
Then you can do whatever you want. Most parser generators are going to have IDL as a sample grammar.
A C struct is unlikely to be portable enough to be sending between machines. Different endian, different word-sizes, different compilers will all change the way the structure is mapped to bytes.
It would be better to use a properly portable binary format that is designed for communications.
There seems to be a handful of JSON libraries out there for Python even though Python has a built-in library. One even claims to be built according to http://www.json.org spec (which caused me to think 'hmmm, is Python's built in library not built fully to spec?', so I find myself here to ask what others have found when trying out different libraries. Is there any difference?
I will be using it for a Django-based web AJAX API (I know there are Django apps for this, but I want to get at the root of this before I just grab an app).
The built in library is fine most of the time although occasionally you can get issues to do with character encoding.
There is cjson if you have performance issues to deal with.
Personally, I just use simplejson - for no particular reason.
Python < 2.6 did not include json module. The presence of multiple JSON implementations says nothing about the quality of the built-in module and everything about the history of having no built-in json.
I suggest that your assumption (multiple implemntations means low quality in the library) is false.
The built-in module json works perfect. If you have to use an earlier python, use simplejson, a third-party module (which is exactly the same interface). These have the serialization interface you expect from Python and are widely used.
(simple)json by default has some very minor extensions of the JSON standard. You can read about these in the documentation for json and disable them if you want to for some reason.
My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code
One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")
I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.
"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.
You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.
Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/
A lot of useful features in Python are somewhat "hidden" inside modules. Named tuples (new in Python 2.6), for instance, are found in the collections module.
The Library Documentation page will give you all the modules in the language, but newcomers to Python are likely to find themselves saying "Oh, I didn't know I could have done it this way using Python!" unless the important features in the language are pointed out by the experienced developers.
I'm not specifically looking for new modules in Python 2.6, but modules that can be found in this latest release.
The most impressive new module is probably the multiprocessing module. First because it lets you execute functions in new processes just as easily and with roughly the same API as you would with the threading module. But more importantly because it introduces a lot of great classes for communicating between processes, such as a Queue class and a Lock class which are each used just like those objects would be in multithreaded code, as well as some other classes for sharing memory between processes.
You can find the documentation at http://docs.python.org/library/multiprocessing.html
The new json module is a real boon to web programmers!! (It was known as simplejson before being merged into the standard library.)
It's ridiculously easy to use: json.dumps(obj) encodes a built-in-type Python object to a JSON string, while json.loads(string) decodes a JSON string into a Python object.
Really really handy.
May be PEP 0631 and What's new in 2.6 can provide elements of answer. This last article explains the new features in Python 2.6, released on October 1 2008.
Essential Libraries
The main challenge for an experienced programmer coming from another language to Python is figuring out how one language maps to another. Here are a few essential libraries and how they relate to Java equivalents.
os, os.path
Has functionality like in java.io.File, java.lang.Process, and others. But cleaner and more sophisticated, with a Unix flavor. Use os.path instead of os for higher-level functionality.
sys
Manipulate the sys.path (which is like the classpath), register exit handlers (like in java Runtime object), and access the standard I/O streams, as in java.lang.System.
unittest
Very similar (and based on) jUnit, with test fixtures and runnable harnesses.
logging
Functionality almost identical to log4j with loglevels and loggers. ( logging is also in the standard java.util.Logging library)
datetime
Allows parsing and formatting dates and times, like in java.text.DateFormat, java.util.Date and related.
ConfigParser
Allows persistant configuration as in a java Properties file (but also allows nesting). Use this when you don't want the complexity of XML or a database backend.
socket, urllib
Similar functionality to what is in java.net, for working with either sockets, or retrieving content via URLs/URIs.
Also, keep in mind that a lot of basic functionality, such as reading files, and working with collections, is in the core python language, whereas in Java it lives in packages.