Saving/Storing pymatgen Structures - python

I'm currently dealing with a material science dataset having various information.
In particular, I have a column 'Structure' with several pymatgen.core.Structure objects.
I would like to save/store this dataset as .csv file or something similar but the problem is that after having done that and reopening, the pymatgen structures lose their type becoming just formatted strings and I cannot get back to their initial pymatgen.core.Structure data type.
Any hints on how to that? I'm searching on pymatgen documentation but haven't been lucky for now..
Thanks in advance!

From the docs:
Side-note : as_dict / from_dict
As you explore the code, you may
notice that many of the objects have an as_dict method and a from_dict
static method implemented. For most of the non-basic objects, we have
designed pymatgen such that it is easy to save objects for subsequent
use. While python does provide pickling functionality, pickle tends to
be extremely fragile with respect to code changes. Pymatgen’s as_dict
provide a means to save your work in a more robust manner, which also
has the added benefit of being more readable. The dict representation
is also particularly useful for entering such objects into certain
databases, such as MongoDb. This as_dict specification is provided in
the monty library, which is a general python supplementary library
arising from pymatgen.
The output from an as_dict method is always json/yaml serializable. So
if you want to save a structure, you may do the following:
with open('structure.json','w') as f:
json.dump(structure.as_dict(), f)
Similarly, to get the structure back from a json, you can do the following to restore the
structure (or any object with a as_dict method) from the json as
follows:
with open('structure.json', 'r') as f:
d = json.load(f)
structure = Structure.from_dict(d)
You may replace any of the above json commands with yaml in the PyYAML package to create a yaml
file instead. There are certain tradeoffs between the two choices.
JSON is much more efficient as a format, with extremely fast
read/write speed, but is much less readable. YAML is an order of
magnitude or more slower in terms of parsing, but is more human
readable.
See also https://pymatgen.org/usage.html#montyencoder-decoder and https://pymatgen.org/usage.html#reading-and-writing-structures-molecules

pymatgen.core.structure object can be stored with only some sort of fixed format, for example, cif, vasp, xyz... so maybe you, first, need to store your structure information to cif or vasp. and open it and preprocess to make it "csv" form with python command.(hint : using python string-related command).

Related

Suggestions for python REST API json serialization

I am used to build rest apis with PHP and I make heavy usage of the JMC serializer. It basically lets me write annotations to class properties that define the name and type of variables, including nested classes and arrays. This lets me completely abstract away the json format and just work with classes which transparently serialize and deserialize to JSON. In combination with symfony validator, this approach is very simple to work with, but also very powerful.
Now to my question, I recently started adopting python for some projects and I would like to reimplement an API in python. I have searched the internet for a suitable equivalent to JMS serializer, but I didn't find one that has the same or simillar capabilities.
Would anyone be so kind to point me in the right direction? (either good library or recommend a different approach with equal or better efficiency)
What I need:
ability to serialize and deserialize object into JSON
define how object is serialized - names of JSON attributes and their data types
define a complex object graphs (ability to define a class as property type, which would than be mapped by its own definition)
ability to map dicts or arrays and types they contain
Thanks in advance
Marshmallow
I haven't used it yet, so I can't tell if it answers all you needs. Feedback welcome.

Persistence of a large number of objects

I have some code that I am working on that scrapes some data from a website, and then extracts certain key information from that website and stores it in an object. I create a couple hundred of these objects each day, each from unique url's. This is working quite well, however, I'm inexperienced in what options are available to me in Python for persistence and what would be best suited for my needs.
Currently I am using pickle. To do so, I am keeping all of these webpage objects and appending them in a list as new ones are created, then saving that list to a pickle (then reloading it whenever the list is to be updated). However, as i'm in the order of some GB of data, i'm finding pickle to be somewhat slow. It's not unworkable, but I'm wondering if there is a more well suited alternative. I don't really want to break apart the structure of my objects and store it in a sql type database, as its important for me to keep the methods and the data as a single object.
Shelve is one option I've been looking into, as my impression is then that I wouldn't have to unpickle and pickle all the old entries (just the most recent day that needs to be updated), but am unsure if this is how shelve works, and how fast it is.
So to avoid rambling on, my question is: what is the preferred persistence method for storing a large number of objects (all of the same type), to keep read/write speed up as the collection grows?
Martijn's suggestion could be one of the alternatives.
You may consider to store the pickle objects directly in a sqlite database which still can manage from the python standard library.
Use a StringIO object to convert between the database column and python object.
You didn't mention the size of each object you are pickling now. I guess it should stay well within sqlite's limit.

Is there an alternative to pickle - save a dictionary (python)

I need to save a dictionary to a file, In the dictionary there are strings, integers, and dictionarys.
I did it by my own and it's not pretty and nice to user.
I know about pickle but as I know it is not safe to use it, because if someone replace the file and I (or someone else) will run the file that uses the replaced file, It will be running and might do some things. it's just not safe.
Is there another function or imported thing that does it.
Pickle is not safe when transfered by a untrusted 3rd party. Local files are just fine, and if something can replace files on your filesystem then you have a different problem.
That said, if your dictionary contains nothing but string keys and the values are nothing but Python lists, numbers, strings or other dictionaries, then use JSON, via the json module.
Presuming your dictionary contains only basic data types, the normal answer is json, it's a popular, well defined language for this kind of thing.
If your dictionary contains more complex data, you will have to manually serialise it at least part of the way.
JSON is not quite Python-way because of several reasons:
It can't wrap/unwrap all Python data types: there's no support for sets or tuples.
Not fast enough because it needs to deal with textual data and encodings.
Try to use sPickle instead.

lists or dicts over zeromq in python

What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.
Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.
JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py
In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.
Encode as JSON before sending, and decode as JSON after receiving.
Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"
In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.
There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html

What is the least resource intense data structure to distribute with a Python Application

I am building an application to distribute to fellow academics. The application will take three parameters that the user submits and output a list of dates and codes related to those events. I have been building this using a dictionary and intended to build the application so that the dictionary loaded from a pickle file when the application called for it. The parameters supplied by the user will be used to lookup the needed output.
I selected this structure because I have gotten pretty comfortable with dictionaries and pickle files and I see this going out the door with the smallest learning curve on my part. There might be as many as two million keys in the dictionary. I have been satisfied with the performance on my machine with a reasonable subset. I have already thought through about how to break the dictionary apart if I have any performance concerns when the whole thing is put together. I am not really that worried about the amount of disk space on their machine as we are working with terabyte storage values.
Having said all of that I have been poking around in the docs and am wondering if I need to invest some time to learn and implement an alternative data storage file. The only reason I can think of is if there is an alternative that could increase the lookup speed by a factor of three to five or more.
The standard shelve module will give you a persistent dictionary that is stored in a dbm style database. Providing that your keys are strings and your values are picklable (since you're using pickle already, this must be true), this could be a better solution that simply storing the entire dictionary in a single pickle.
Example:
>>> import shelve
>>> d = shelve.open('mydb')
>>> d['key1'] = 12345
>>> d['key2'] = value2
>>> print d['key1']
12345
>>> d.close()
I'd also recommend Durus, but that requires some extra learning on your part. It'll let you create a PersistentDictionary. From memory, keys can be any pickleable object.
To get fast lookups, use the standard Python dbm module (see http://docs.python.org/library/dbm.html) to build your database file, and do lookups in it. The dbm file format may not be cross-platform, so you may want to to distrubute your data in Pickle or repr or JSON or YAML or XML format, and build the dbm database the user runs your program.
How much memory can your application reasonably use? Is this going to be running on each user's desktop, or will there just be one deployment somewhere?
A python dictionary in memory can certainly cope with two million keys. You say that you've got a subset of the data; do you have the whole lot? Maybe you should throw the full dataset at it and see whether it copes.
I just tested creating a two million record dictionary; the total memory usage for the process came in at about 200MB. If speed is your primary concern and you've got the RAM to spare, you're probably not going to do better than an in-memory python dictionary.
See this solution at SourceForge, esp. the "endnotes" documentation:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
Here are three things you can try:
Compress the pickled dictionary with zlib. pickle.dumps(dict).encode("zlib")
Make your own serializing format (shouldn't be too hard).
Load the data in a sqlite database.

Categories