Python CJSON encoding custom objects - python

I am updating an old project that used an old version of cjson to speed up its json encoding. It also has a custom class called JSONString (which sets a string to its 'value' property) that is used for communicating with the database.
It used to call cjson.encode((dict containing a JSONString), (custom encoding funct for JSONSTRING)) but the newer version of cjson has changed its parameters to only accepting one argument, and not exposing any other functions that could allow customizations of the encoding process. Encoding the dict without the custom encoder throws an EncodeError (object is not JSON encodable).
The options I have now are to either find out how to use custom encoders in cjson, modify the cjson source (trying to avoid patching libraries), or make it so the JSONString type inserted into the dict is converted to a string before the operation, but I am trying to avoid placing 'fixes' all over the code (compartmentalization and re-usability and all that). Modifying JSONString in some way so that the encoder takes the string value of it instead of throwing an exception would work too, but I don't know enough of python's quirks to do this. I can understand why cjson might not allow custom encoders (speed reasons) but if there is no way I might just have to find something else.
Any suggestions would be greatly appreciated.

Looking through my unanswered posts and remembered I never marked this as answered. Yavar's post did help; there is an enhanced version of cjson for python. It works well but has some interesting name collisions at times so be aware of that.
http://python.cx.hu/python-cjson/

Related

How to write test cases for pickle backwards compatibility

You're writing a library, and you know your users pickle your object. Sometimes you add new fields, and this creates a BC problem because old pickles of the objects don't have the needed fields. You'd like to add some tests for this case.
The obvious way to do this is to save some actual pickles from your old version, shove them in your test suite somehow, and make sure you can keep unpickling them. But having hard-coded binary test data is pretty uncool. Is there a way to write tests without having to do some manual binary? I've tried playing around with "faking" the __module__ and __qualname__ fields on a locally declared class (which is intended to simulate the "old" version) but I get errors like "_pickle.PicklingError: Can't pickle : it's not the same object as torch.nn.modules.conv.Conv2d" Is there a good way to do this?

Is using strings as an object identifier bad practice?

I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?
No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.
Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.

Python Pickle, avoiding module dependencies

Is there a particular way to pickle objects so that pickle.load() has no dependencies on any modules? I read that while unpickling objects, Pickle tries to load the module containing the class definition of the object. Is there a way to avoid this, so that pickle.load() doesnt try to load any modules?
May be a bit unrelated but still I would quote form the documentation:
Warning The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
You need to write a custom unpickler that avoids loading extra modules. A general approach will be:
Derive your custom unpickler by subclassing pickle.Unpickler
Override find_class(..)
Inside find_class(..) Check for module and the class that needs to be loaded. Avoid loading it by raising errors.
Use this custom class to unpickle from the string.
Here is an excellent article about dangers of using pickle. You would also find the code that has the above approach.
Does not make much sense what you are asking since the serialization and deserialization of objects is the primary purpose of the pickle functionality. If you want something different: serialize or deserialize your objects to XML or JSON (or any other suitable format).
There is e.g. lxml.objectify or you google for "Python serialize json" or "Python serialize xml"...but you can not deserialize an object from a pickle without its class definition - at least not without further coding.
http://docs.python.org/library/pickle.html
documents how to write a custom unpickler...perhaps that a good way to start - but this appears like the wrong way to do it.

lists or dicts over zeromq in python

What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.
Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.
JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py
In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.
Encode as JSON before sending, and decode as JSON after receiving.
Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"
In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.
There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html

Serializing python object to python source code

I have a python dictionary that I'd like to serialize into python source code required to initialize that dictionary's keys and values.
I'm looking for something like json.dumps(), but the output format should be Python not JSON.
My object is very simple: it's a dictionary whose values are one of the following:
built-in type literals (strings, ints, etc.)
lists of literals
I know I can build one myself, but I suspect there are corner cases involving nested objects, keyword escaping, etc. so I'd prefer to use an existing library with those kinks already worked out.
In the most general case, it's not possible to dump an arbitrary Python object into Python source code. E.g. if the object is a socket, recreating the very same socket in a new object cannot work.
As aix explains, for the simple case, repr is explicitly designed to make reproducable source representations. As a slight generalization, the pprint module allows selective customization through the PrettyPrinter class.
If you want it even more general, and if your only requirement is that you get executable Python source code, I recommend to pickle the object into a string, and then generate the
source code
obj = pickle.loads(%s)
where %s gets substituted with repr(pickle.dumps(obj)).
repr(d) where d is your dictionary could be a start (doesn't address all the issues that you mention though).
You can use yaml
http://pyyaml.org/
YAML doesn't serialise EXACTLY to python source code. But the yaml module can directly serialise and deserialise python objects. It's a superset of JSON.
What exactly are you trying to do?

Categories