I've heard people say that PyMongo automatically uses BSON format for everything you insert in the database. Is this true? Or do I still need to run BSON.encode manually?
The drivers will handle marshaling python builtin objects to their bson counterpart as part of the intermediate layer between you and the database. Ultimately, the data stored in mongodb is bson.
datetime objects will be saved properly, as will numerics, strings, lists. You do not need to specifically serialize them. A document object is a dictionary.
The only reason for manual encoding is when you want to give custom classes the ability to be stored, without having to break them down into builtin types. It's very much like any other serialization format (pickle, json, ...). They usually handle the built-ins fine, but need extra help for custom types.
Related
I'm currently dealing with a material science dataset having various information.
In particular, I have a column 'Structure' with several pymatgen.core.Structure objects.
I would like to save/store this dataset as .csv file or something similar but the problem is that after having done that and reopening, the pymatgen structures lose their type becoming just formatted strings and I cannot get back to their initial pymatgen.core.Structure data type.
Any hints on how to that? I'm searching on pymatgen documentation but haven't been lucky for now..
Thanks in advance!
From the docs:
Side-note : as_dict / from_dict
As you explore the code, you may
notice that many of the objects have an as_dict method and a from_dict
static method implemented. For most of the non-basic objects, we have
designed pymatgen such that it is easy to save objects for subsequent
use. While python does provide pickling functionality, pickle tends to
be extremely fragile with respect to code changes. Pymatgen’s as_dict
provide a means to save your work in a more robust manner, which also
has the added benefit of being more readable. The dict representation
is also particularly useful for entering such objects into certain
databases, such as MongoDb. This as_dict specification is provided in
the monty library, which is a general python supplementary library
arising from pymatgen.
The output from an as_dict method is always json/yaml serializable. So
if you want to save a structure, you may do the following:
with open('structure.json','w') as f:
json.dump(structure.as_dict(), f)
Similarly, to get the structure back from a json, you can do the following to restore the
structure (or any object with a as_dict method) from the json as
follows:
with open('structure.json', 'r') as f:
d = json.load(f)
structure = Structure.from_dict(d)
You may replace any of the above json commands with yaml in the PyYAML package to create a yaml
file instead. There are certain tradeoffs between the two choices.
JSON is much more efficient as a format, with extremely fast
read/write speed, but is much less readable. YAML is an order of
magnitude or more slower in terms of parsing, but is more human
readable.
See also https://pymatgen.org/usage.html#montyencoder-decoder and https://pymatgen.org/usage.html#reading-and-writing-structures-molecules
pymatgen.core.structure object can be stored with only some sort of fixed format, for example, cif, vasp, xyz... so maybe you, first, need to store your structure information to cif or vasp. and open it and preprocess to make it "csv" form with python command.(hint : using python string-related command).
As I'm trying to solve "Not JSON serializable" problems for the last couple of hours, I'm very interested in what the hard part is while serializing and deserializing an instance.
Why a class instance can not be serialized - for example - with JSON?
To serialize:
Note the class name (in order to rebuild the object)
Note the variable values at the time of packaging.
Convert it to string.
Optionally compress it (as msgpack does)
To deserialize:
Create a new instance
Assign known values to appropriate variables
Return the object.
What is difficult? What is complex data type?
The "hard" part is mainly step 3 of your serialization, converting the contained values to strings (and later back during deserialization)
For simple types like numbers, strings, booleans, it's quite straight forward, but for complex types like a socket connected to a remote server or an open file descriptor, it won't work very well.
The solution is usually to either move the complex types from the types you want to serialize and keep the serialized types very clean, or somehow tag or otherwise tell the serializers exactly which properties should be serialized, and which should not.
I need to save a dictionary to a file, In the dictionary there are strings, integers, and dictionarys.
I did it by my own and it's not pretty and nice to user.
I know about pickle but as I know it is not safe to use it, because if someone replace the file and I (or someone else) will run the file that uses the replaced file, It will be running and might do some things. it's just not safe.
Is there another function or imported thing that does it.
Pickle is not safe when transfered by a untrusted 3rd party. Local files are just fine, and if something can replace files on your filesystem then you have a different problem.
That said, if your dictionary contains nothing but string keys and the values are nothing but Python lists, numbers, strings or other dictionaries, then use JSON, via the json module.
Presuming your dictionary contains only basic data types, the normal answer is json, it's a popular, well defined language for this kind of thing.
If your dictionary contains more complex data, you will have to manually serialise it at least part of the way.
JSON is not quite Python-way because of several reasons:
It can't wrap/unwrap all Python data types: there's no support for sets or tuples.
Not fast enough because it needs to deal with textual data and encodings.
Try to use sPickle instead.
I have a python dictionary that I'd like to serialize into python source code required to initialize that dictionary's keys and values.
I'm looking for something like json.dumps(), but the output format should be Python not JSON.
My object is very simple: it's a dictionary whose values are one of the following:
built-in type literals (strings, ints, etc.)
lists of literals
I know I can build one myself, but I suspect there are corner cases involving nested objects, keyword escaping, etc. so I'd prefer to use an existing library with those kinks already worked out.
In the most general case, it's not possible to dump an arbitrary Python object into Python source code. E.g. if the object is a socket, recreating the very same socket in a new object cannot work.
As aix explains, for the simple case, repr is explicitly designed to make reproducable source representations. As a slight generalization, the pprint module allows selective customization through the PrettyPrinter class.
If you want it even more general, and if your only requirement is that you get executable Python source code, I recommend to pickle the object into a string, and then generate the
source code
obj = pickle.loads(%s)
where %s gets substituted with repr(pickle.dumps(obj)).
repr(d) where d is your dictionary could be a start (doesn't address all the issues that you mention though).
You can use yaml
http://pyyaml.org/
YAML doesn't serialise EXACTLY to python source code. But the yaml module can directly serialise and deserialise python objects. It's a superset of JSON.
What exactly are you trying to do?
surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?
We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.
You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.
You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).
The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.