Patterns for Python object transformation (encoding, decoding, deserialization, serialization)

Patterns for Python object transformation (encoding, decoding, deserialization, serialization) - python

I've been working with the parsing module construct and have found myself really enamored with the declarative nature of it's data structures. For those of you unfamiliar with it, you can write Python code that essentially looks like what you're trying to parse by nesting objects when you instantiate.
Example:
ethernet = Struct("ethernet_header",
Bytes("destination", 6),
Bytes("source", 6),
Enum(UBInt16("type"),
IPv4 = 0x0800,
ARP = 0x0806,
RARP = 0x8035,
X25 = 0x0805,
IPX = 0x8137,
IPv6 = 0x86DD,
),
)
While construct doesn't really support storing values inside this structure (you can either parse an abstract Container into a bytestream or a bytestream into an abstract container), I want to extend the framework so that parsers can also store values as they parse through so that they can be accessed in dot notation ethernet.type.
But in doing so, think the best possible solution here would be a generic method of writing encoding/decoding mechanisms so that you could register an encode/decode mechanism and be able to produce a variety of outputs from the abstract data structure (the parser itself), as well as the output of the parser.
To give you an example, by default when you run an ethernet packed through the parser, you end up with something like a dict:
Container(name='ethernet_header',
destination='\x01\x02\x03\x04\x05\x06',
source='\x01\x02\x03\x04\x05\x06',
type=IPX
)
I don't want to have to parse things twice - I would ideally like to have the parser generate the 'target' object/strings/bytes in a configurable manner.
The root of the idea is that you could have various 'plugins' registered for consuming or processing structures so that you could programatically generate XML or Graphviz diagrams as well as being able to convert from bytes to Python dicts. The crux of the task is, walk a tree of nodes and based on the encoder/decoder, transform and return the transformed objects.
So the question is essentially - what patterns are best suited for this purpose?
codec style:
I've looked at the codecs module, which is fairly elegant in that you create encoding mechanisms, register that your class can encode things and you can specify the specific encoding you want on the fly.
'blah blah'.encode('utf8')
serdes (serializer, deserializer):
There's are a couple examples of existing serdes modules for Python, off the top of my head JSON springs to mind - but the problem with it is that it's very specific and doesn't support arbitrary formats easily. You can either encode or decode JSON and thats basically it. There's a variety of serdes constructed like this, some use the loads,*dumps* methods, some don't. It's a crapshoot.
objects = json.loads('{'a': 1, 'b': 2})
visitor pattern(?):
I'm not terribly familiar with the visitor pattern, but it does seem to have some mechanisms that might be applicable - the idea being that (if I understand this correctly), you'd setup a visitor for nodes and it'd walk the tree and apply some transformation (and return the new objects?).. I'm hazy here.
other?:
Are there other mechanisms that might be more pythonic or already be written? I considered perhaps using ElementTree and subclassing Elements - but I wanted to consult stackoverflow before doing something daft.

Related

python: how to define types for conversion to xml

I am implementing a parser in python 2.7 for an xml-based file format. The format defines classes with fields of specific types.
What I want to do is define objects like this:
class DataType1(SomeCoolParent):
a = Integer()
b = Float()
c = String()
#and so on...
The resulting classes need to be able to be constructed from an XML document, and render back into XML (so that I can save and load files).
This looks a whole lot like the ORM libraries in SQLAlchemy or Django... except that there is no database, just XML. It's not clear to me how I would implement what I'm after with those libraries. I would basically want to rip out their backend database stuff and have my own conversion to XML and back again.
My current plan: set up a metaclass and a parent class (SomeCoolParent), define all the types (Integer, Float, String), and generally build a custom almost-ORM myself. I know how to do this, but it seems that someone else must have done/thought of this before, and maybe come up with a general-purpose solution...
Is there a "better" way to implement a hierarchy of fixed type structures with python objects, which allows or at least facilitates conversion to XML?
Caveats/Notes
As mentioned, I'm aware of SQLAlchemy and Django, but I'm not clear on how/whether I can do what I want with their ORM libraries. Answers that show how to do this with either of them definately count.
I'm definitely open to other methods, those two are just the ORM libraries I am aware of.

There a several XML Schema specifications out there. My knowledge is somewhat dated but XML Schema (W3C) - a.k.a. XSD and relax ng are the most popular with relax ng having a lower learning curve. With that in mind, a good ORM (or would that be OXM 'Object-XML Mapper) using these data description formats ought to be out there. Some candidates are
pyxb
generateDS
Stronger google foo may yield more options.

Packet Declaration Language

I'm writing a program which will communicate with another program and, obviously, should use same protocol for it.
What I need is something like protobuf, but not protobuf, because it won't let me describe packets format exactly as I want. For example, it inserts field numbers in its packets. Pickle won't do either because of the same reasons.
I wrote my own thing using struct but it is ugly and I not fully understand how is it working. I need something where I can describe different fields like short, integer, their endianness, complex fields, which consists of primitive fields or another complex fields, arrays of primitive fields, array of complex fields.
Could you recommend something like this? Or I doomed to stick to my own solution?

I've had to write Python code for dealing with binary formats before, and struct is no fun to work with. The construct module is much nicer. It allows you to both consume and generate complex binary formats using a simple declarative syntax.

Parsing a DTD to reveal hierarchy of elements

My goal is to parse several relatively complex DTDs to reveal the hierarchy of elements. The only distinction between DTDs is the version, but each version made no attempt to remain backwards compatible--that would be too easy! As such, I intend to visualize the structure of elements defined by each DTD so that I can design a database model suitable for uniformly storing the data.
Because most solutions I've investigated in Python will only validate against external DTDs, I've decided to start my efforts from the beginning. Python's xml.parsers.expat only parses XML files and implements very basic DTD callbacks, so I've decided to check out the original version, which was written in C and claims to fully comport with the XML 1.0 specifications. However, I have the following questions about this approach:
Will expat (in C) parse external entity references in a DTD file and follow those references, parse their elements, and add those elements to the hierarchy?
Can expat generalize and handle SGML, or will it fail after encountering an invalid DTD yet valid SGML file?
My requirements may lead to the conclusion that expat is inappropriate. If that's the case, I'm considering writing a lexer/parser for XML 1.0 DTDs. Are there any other options I should consider?
The following illustrates more succinctly my intent:
Input DTD Excerpt

<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>
Object Created from DTD Excerpt (pseudocode)
class abstract:
member doc_page_array[]
member abst_problem
member abst_solution
member paragraph_array[]
member description = "A concise summary of the disclosure."
One challenging aspect is to attribute to the <!ELEMENT> tag the comment appearing above it. Hence, a homegrown parser might be necessary if I cannot use expat to accomplish this.
Another problem is that some parsers have problems processing DTDs that use unicode characters greater than #xFFFF, so that might be another factor that favors creating my own.
If it turns out that the lexer/parser route is better suited for my task, does anyone happen to know of a good way to convert these EBNF expressions to something capable of being parsed? I suppose the "best" approach might be to use regular expressions.
Anyway, these are just the thoughts I've had regarding my issue. Any answers to the above questions or suggestions on alternative approaches would be appreciated.

There are several existing tools which may fit your needs, including DTDParse, OpenSP, Matra, and DTD Parser. There are also articles on creating a custom parser.

lists or dicts over zeromq in python

What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.

Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.

JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py

In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.

Encode as JSON before sending, and decode as JSON after receiving.

Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"

In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.

There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html

Creating a custom file like object python suggestions?

Hi i am looking to implement my own custom file like object for an internal binary format we use at work(i don't really want to go into too much detail because i don't know if i can). I am trying to go for a more pythonic way of doing things since currently we have two functions read/write(each ~4k lines of code) which do everything. However we need more control/finesse hence the fact of me rewriting this stuff.
I looked at the python documentation and they say what methods i need to implement, but don't mention stuff like iter() / etc.
Basically what i would love to do is stuff like this:
output_file_objs = [
open("blah.txt", "w")
open("blah142.txt", "wb")
my_lib.open("internal_file.something", "wb", ignore_something=True)
]
data_to_write = <data>
for f in output_file_objs:
f.write(data_to_write)
So i can mix it in with the others, and basically have a level of transparency. I will add custom methods to it, but thats not a problem.
Is there any sort of good reference regarding writing your own custom file like objects? Like any form of restrictions or special methods (iter). I should implement?
Or is there a good example of one from within the python standard library that i can look at?

What makes up a "file-like" actually depends on what you intend to use it for; not all methods are required to be implemented (or to have a sane implementation).
Having said that, the file and iterator docs are what you want.

Why not stuff your data in StringIO? Otherwise, you can look at the documentation and implement all of the file like methods. Truth be told, there are no real interfaces in Python, and some features (like tell()) may not make sense for your files, so you can leave them unimplemented.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.