As I'm trying to solve "Not JSON serializable" problems for the last couple of hours, I'm very interested in what the hard part is while serializing and deserializing an instance.
Why a class instance can not be serialized - for example - with JSON?
To serialize:
Note the class name (in order to rebuild the object)
Note the variable values at the time of packaging.
Convert it to string.
Optionally compress it (as msgpack does)
To deserialize:
Create a new instance
Assign known values to appropriate variables
Return the object.
What is difficult? What is complex data type?
The "hard" part is mainly step 3 of your serialization, converting the contained values to strings (and later back during deserialization)
For simple types like numbers, strings, booleans, it's quite straight forward, but for complex types like a socket connected to a remote server or an open file descriptor, it won't work very well.
The solution is usually to either move the complex types from the types you want to serialize and keep the serialized types very clean, or somehow tag or otherwise tell the serializers exactly which properties should be serialized, and which should not.
Related
I am used to build rest apis with PHP and I make heavy usage of the JMC serializer. It basically lets me write annotations to class properties that define the name and type of variables, including nested classes and arrays. This lets me completely abstract away the json format and just work with classes which transparently serialize and deserialize to JSON. In combination with symfony validator, this approach is very simple to work with, but also very powerful.
Now to my question, I recently started adopting python for some projects and I would like to reimplement an API in python. I have searched the internet for a suitable equivalent to JMS serializer, but I didn't find one that has the same or simillar capabilities.
Would anyone be so kind to point me in the right direction? (either good library or recommend a different approach with equal or better efficiency)
What I need:
ability to serialize and deserialize object into JSON
define how object is serialized - names of JSON attributes and their data types
define a complex object graphs (ability to define a class as property type, which would than be mapped by its own definition)
ability to map dicts or arrays and types they contain
Thanks in advance
Marshmallow
I haven't used it yet, so I can't tell if it answers all you needs. Feedback welcome.
I am actually using Python itself to dump a huge data structure (multiple lists and dictionaries) and am sending it over a socket to a client.
I keep getting a ValueError: 'Expecting ',' delimiter: line 1 column 16177 (char 16176) at various different locations every time I run the program (it could be column 25000, or column 13000, it keeps changing).
Should I use marshal instead of json (or even pickle)? What is the most reliable format for large file sizes?
I'd suggest to use pickle (or cPickle if you're on Python 2.X) as it can serialize almost anything, including user-defined classes. And, as the docs say
The marshal serialization format is not guaranteed to be portable across Python versions. Because its primary job in life is to support .pyc files, the Python implementers reserve the right to change the serialization format in non-backwards compatible ways should the need arise. The pickle serialization format is guaranteed to be backwards compatible across Python releases.
(Emphasis mine).
Another advantage of pickle:
The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again. marshal doesn’t do this.
This has implications both for recursive objects and object sharing.
Recursive objects are objects that contain references to themselves.
These are not handled by marshal, and in fact, attempting to marshal
recursive objects will crash your Python interpreter. Object sharing
happens when there are multiple references to the same object in
different places in the object hierarchy being serialized. pickle
stores such objects only once, and ensures that all other references
point to the master copy. Shared objects remain shared, which can be
very important for mutable objects.
You can also use dill if pickle fails to serialize some data.
I've heard people say that PyMongo automatically uses BSON format for everything you insert in the database. Is this true? Or do I still need to run BSON.encode manually?
The drivers will handle marshaling python builtin objects to their bson counterpart as part of the intermediate layer between you and the database. Ultimately, the data stored in mongodb is bson.
datetime objects will be saved properly, as will numerics, strings, lists. You do not need to specifically serialize them. A document object is a dictionary.
The only reason for manual encoding is when you want to give custom classes the ability to be stored, without having to break them down into builtin types. It's very much like any other serialization format (pickle, json, ...). They usually handle the built-ins fine, but need extra help for custom types.
The python shelf module only seems to be allow string keys. Is there a way to use arbitrary typed (e.g. numeric) keys? May be an sqlite-backed dictionary?
Thanks!
Why not convert your keys to strings? Numeric keys should be pretty easy to do this with.
You can serialize on the fly (via pickle or cPickle, like shelve.py does) every key, as well as every value. It's not really worth subclassing shelve.Shelf since you'd have to subclass almost every method -- for once, I'd instead recommend copying shelve.py into your own module and editing it to suit. That's basically like coding your new module from scratch but you get a working example to show you the structure and guidelines;-).
sqlite has no real advantage in a sufficiently general case (where the keys could be e.g. arbitrary tuples, of different arity and types for every entry) -- you're going to have to serialize the keys anyway to make them homogeneous. Still, nothing stops you from using sqlite, e.g. to keep several "generalized shelves" into a single file (different tables of the same sqlite DB) -- if you care about performance you should measure it each way, though.
I think you want to overload the [] operator. You can do it by defining the __getitem__ method.
I ended up subclassing the DbfilenameShelf from the shelve-module. I made a shelf which automatically converts non-string-keys into string-keys and returns them in original form when queried. It works well for Python's standard immutable objects: int, float, string, tuple, boolean.
It can be found in: https://github.com/North-Guard/simple_shelve
surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?
We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.
You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.
You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).
The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.