How to save an array of objects to a file in Python - python

I know how to save an array of simple data (like float type values) to a file using numpy.save and numpy.savez.
I also know how to save a single object to a file using the pickle module, although I have not tested it yet.
The question is: How can I save (and load) an array of objects to a file? Can I combine the two methods exposed above in order to achieve this? Is there a better way?

If you know how to pickle a single object, to pickle some amount of objects you can create a structure that contains all these objects (list, set, dict with these objects as values for example, or your own class which contains them in some way) and this structure will be a single object awailable to pickle.

Related

Is converting numpy array to .npy format an example of serialization of data?

I understand that serialization of data means converting a data structure or object state to a form which can be stored in a file or buffer, can be transmitted, and can be reconstructed later (https://www.tutorialspoint.com/object_oriented_python/object_oriented_python_serialization.htm). Based on this definition, converting a numpy array to .npy format should be serialization of the numpy array data object. However, I could not find this assertion anywhere, when I looked up on the internet. Most of the related links were mentioning about how pickle format does serialization of data in python. My questions is - is converting numpy array to .npz format an example of serialization of a python data object. If not, what are the reasons?
Well, according to Wikipedia:
In computing, serialization (or serialisation) is the process of
translating data structures or object state into a format that can be
stored (for example, in a file or memory buffer) or transmitted (for
example, across a network connection link) and reconstructed later
(possibly in a different computer environment).
And according to Numpy Doc:
Binary serialization
NPY format
A simple format for saving numpy arrays to disk with the full
information about them.
The .npy format is the standard binary file format in NumPy for
persisting a single arbitrary NumPy array on disk. The format stores
all of the shape and dtype information necessary to reconstruct the
array correctly even on another machine with a different architecture.
The .npz format is the standard format for persisting multiple NumPy
arrays on disk. A .npz file is a zip file containing multiple .npy
files, one for each array.
So, putting this definitions together you can come up with an answer to your question. Yes, is a way of serialization. Also the process of storing and reading is fast
np.save(filename, arr) writes the array to a file. Since a file a linear structure it is a form of serialization. But often 'serialization' refers to creating a string that can be sent to a database or over some 'pipeline'. I think you can save to a string buffer, but it takes a bit of trickery.
But in Python most objects have a pickle method, which creates a string which can be written to a file. In that sense pickle is a 2 step process - serialize and then write to file. The pickle for a numpy array is actually a save compatible form. (conversely, np.save of a non-array object uses that object's pickle).
savez writes a zip archive, containing one npy file for each array. It may in addition be compressed. There are OS tools for transferring zip archives to other computers.

How to load multiple objects (list of list) in HDF5?

I am currently trying to use keras to predict ... stuffs. I am using a HDF5 file as input. The file contains 195 objects, each one is a list of matrices with one attribute. I would like keras to learn on the list of matrices and predict the attribute. But here is the issue, so far I have seen that one object can only be assigned to one variable. That would be meaningless in my case.
I would like to know whether or not it was possible to load all of these objects at once, say under one variable, in keras to predict the attribute ? For instance here are some objects,
['10gs', '1a30', '1bcu',..., '4tmn']
I know I can assign one variable to one object,
dataset=infile['1a30']
However I am not sure how to assign several objects to one variable? Do I need to create a list of objects ? Here's what I am trying to get,
dataset=infile['all of my objects'].
In fine, I will be using it in keras but I am not too sure whether it is necessary as it seems to me it is a HDF file issue (misunderstanding).

System/Convention for code-independent serialization

I'm training neural networks. I want to save them in a code-independent way so they can be loaded by someone using different software.
Just pickling my objects is no good, because the pickle breaks if it's loaded in an environment where the code has changed or moved (which it always does).
So I've been converting my objects into dicts of primitive types and pickling those. I maintain a module that can convert these dicts back into objects (the type of object is defined by a "class" key of the dict). My current solution feels messy.
So I was wondering if there's some package or design pattern that's made to handle this kind of "code-independent serialization"
If you are using numpy/scipy for your project, you could save your weight matrixes in matlab format.

What are the basic difference between pickle and yaml in Python?

I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?
YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.
Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.
So basically:
YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner
There's more to it than that, but you asked for the "basic" difference.
pickle is a special python serialization format when a python object is converted into a byte stream and back:
“Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream is converted back into an object
hierarchy.
The main point is that it is python specific.
On the other hand, YAML is language-agnostic and human-readable serialization format.
FYI, if you are choosing between these formats, think about:
serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.
See also:
Python serialization - Why pickle?
Lightweight pickle for basic types in python?
If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.
YAML files are more readable as mentioned above, but also slower and larger in size.
I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.
Serialization/deserialization method
Average time, s
Size of file, kB
PyYAML
1.73
1149.358
pickle
0.004
690.658
As you can see, yaml is 1,67 times heavier. And 432,5 times slower.
P. S. This is for my data. In your case, it may be different. But that's enough for comparison.

How can I make a large python data structure more efficient to unpickle?

I have a list of ~1.7 million "token" objects, along with a list of ~130,000 "structure" objects which reference the token objects and group them into, well, structures. It's an ~800MB memory footprint, on a good day.
I'm using __slots__ to keep my memory footprint down, so my __getstate__ returns a tuple of serializable values, which __setstate__ bungs back into place. I'm also not pickling all the instance data, just 5 items for tokens, 7-9 items for structures, all strings or integers.
Of course, I'm using cPickle, and HIGHEST_PROTOCOL, which happens to be 2 (python 2.6). The resulting pickle file is ~120MB.
On my development machine, it takes ~2 minutes to unpickle the pickle. I'd like to make this faster. What methods might be available to me, beyond faster hardware and what I'm already doing?
Pickle is not the best method for storing large amounts of similar data. It can be slow for large data sets, and more importantly, it is very fragile: changing around your source can easily break all existing datasets. (I would recommend reading what pickle at its heart actually is: a bunch of bytecode expressions. It will frighten you into considering other means of data storage/retrieval.)
You should look into using PyTables, which uses HDF5 (cross-platform and everything) to store arbitrarily large amounts of data. You don't even have to load everything off of a file into memory at once; you can access it piecewise. The structure you're describing sounds like it would fit very well into a "table" object, which has a set field structure (comprised of fixed-length strings, integers, small Numpy arrays, etc.) and can hold large amounts very efficiently. For storing metadata, I'd recommend using the ._v_attrs attribute of your tables.

Categories