I'm writing a program which will communicate with another program and, obviously, should use same protocol for it.
What I need is something like protobuf, but not protobuf, because it won't let me describe packets format exactly as I want. For example, it inserts field numbers in its packets. Pickle won't do either because of the same reasons.
I wrote my own thing using struct but it is ugly and I not fully understand how is it working. I need something where I can describe different fields like short, integer, their endianness, complex fields, which consists of primitive fields or another complex fields, arrays of primitive fields, array of complex fields.
Could you recommend something like this? Or I doomed to stick to my own solution?
I've had to write Python code for dealing with binary formats before, and struct is no fun to work with. The construct module is much nicer. It allows you to both consume and generate complex binary formats using a simple declarative syntax.
Related
I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?
No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.
Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.
I've been working with the parsing module construct and have found myself really enamored with the declarative nature of it's data structures. For those of you unfamiliar with it, you can write Python code that essentially looks like what you're trying to parse by nesting objects when you instantiate.
Example:
ethernet = Struct("ethernet_header",
Bytes("destination", 6),
Bytes("source", 6),
Enum(UBInt16("type"),
IPv4 = 0x0800,
ARP = 0x0806,
RARP = 0x8035,
X25 = 0x0805,
IPX = 0x8137,
IPv6 = 0x86DD,
),
)
While construct doesn't really support storing values inside this structure (you can either parse an abstract Container into a bytestream or a bytestream into an abstract container), I want to extend the framework so that parsers can also store values as they parse through so that they can be accessed in dot notation ethernet.type.
But in doing so, think the best possible solution here would be a generic method of writing encoding/decoding mechanisms so that you could register an encode/decode mechanism and be able to produce a variety of outputs from the abstract data structure (the parser itself), as well as the output of the parser.
To give you an example, by default when you run an ethernet packed through the parser, you end up with something like a dict:
Container(name='ethernet_header',
destination='\x01\x02\x03\x04\x05\x06',
source='\x01\x02\x03\x04\x05\x06',
type=IPX
)
I don't want to have to parse things twice - I would ideally like to have the parser generate the 'target' object/strings/bytes in a configurable manner.
The root of the idea is that you could have various 'plugins' registered for consuming or processing structures so that you could programatically generate XML or Graphviz diagrams as well as being able to convert from bytes to Python dicts. The crux of the task is, walk a tree of nodes and based on the encoder/decoder, transform and return the transformed objects.
So the question is essentially - what patterns are best suited for this purpose?
codec style:
I've looked at the codecs module, which is fairly elegant in that you create encoding mechanisms, register that your class can encode things and you can specify the specific encoding you want on the fly.
'blah blah'.encode('utf8')
serdes (serializer, deserializer):
There's are a couple examples of existing serdes modules for Python, off the top of my head JSON springs to mind - but the problem with it is that it's very specific and doesn't support arbitrary formats easily. You can either encode or decode JSON and thats basically it. There's a variety of serdes constructed like this, some use the loads,*dumps* methods, some don't. It's a crapshoot.
objects = json.loads('{'a': 1, 'b': 2})
visitor pattern(?):
I'm not terribly familiar with the visitor pattern, but it does seem to have some mechanisms that might be applicable - the idea being that (if I understand this correctly), you'd setup a visitor for nodes and it'd walk the tree and apply some transformation (and return the new objects?).. I'm hazy here.
other?:
Are there other mechanisms that might be more pythonic or already be written? I considered perhaps using ElementTree and subclassing Elements - but I wanted to consult stackoverflow before doing something daft.
What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.
Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.
JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py
In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.
Encode as JSON before sending, and decode as JSON after receiving.
Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"
In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.
There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html
(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?
You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.
Hi i am looking to implement my own custom file like object for an internal binary format we use at work(i don't really want to go into too much detail because i don't know if i can). I am trying to go for a more pythonic way of doing things since currently we have two functions read/write(each ~4k lines of code) which do everything. However we need more control/finesse hence the fact of me rewriting this stuff.
I looked at the python documentation and they say what methods i need to implement, but don't mention stuff like iter() / etc.
Basically what i would love to do is stuff like this:
output_file_objs = [
open("blah.txt", "w")
open("blah142.txt", "wb")
my_lib.open("internal_file.something", "wb", ignore_something=True)
]
data_to_write = <data>
for f in output_file_objs:
f.write(data_to_write)
So i can mix it in with the others, and basically have a level of transparency. I will add custom methods to it, but thats not a problem.
Is there any sort of good reference regarding writing your own custom file like objects? Like any form of restrictions or special methods (iter). I should implement?
Or is there a good example of one from within the python standard library that i can look at?
What makes up a "file-like" actually depends on what you intend to use it for; not all methods are required to be implemented (or to have a sane implementation).
Having said that, the file and iterator docs are what you want.
Why not stuff your data in StringIO? Otherwise, you can look at the documentation and implement all of the file like methods. Truth be told, there are no real interfaces in Python, and some features (like tell()) may not make sense for your files, so you can leave them unimplemented.