What are the basic difference between pickle and yaml in Python? - python

I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?

YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.
Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.
So basically:
YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner
There's more to it than that, but you asked for the "basic" difference.

pickle is a special python serialization format when a python object is converted into a byte stream and back:
“Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream is converted back into an object
hierarchy.
The main point is that it is python specific.
On the other hand, YAML is language-agnostic and human-readable serialization format.
FYI, if you are choosing between these formats, think about:
serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.
See also:
Python serialization - Why pickle?
Lightweight pickle for basic types in python?

If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.
YAML files are more readable as mentioned above, but also slower and larger in size.
I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.
Serialization/deserialization method
Average time, s
Size of file, kB
PyYAML
1.73
1149.358
pickle
0.004
690.658
As you can see, yaml is 1,67 times heavier. And 432,5 times slower.
P. S. This is for my data. In your case, it may be different. But that's enough for comparison.

Related

System/Convention for code-independent serialization

I'm training neural networks. I want to save them in a code-independent way so they can be loaded by someone using different software.
Just pickling my objects is no good, because the pickle breaks if it's loaded in an environment where the code has changed or moved (which it always does).
So I've been converting my objects into dicts of primitive types and pickling those. I maintain a module that can convert these dicts back into objects (the type of object is defined by a "class" key of the dict). My current solution feels messy.
So I was wondering if there's some package or design pattern that's made to handle this kind of "code-independent serialization"
If you are using numpy/scipy for your project, you could save your weight matrixes in matlab format.

storing matrices in golang in compressed binary format

I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient
This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)
Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.
If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

How can I make a large python data structure more efficient to unpickle?

I have a list of ~1.7 million "token" objects, along with a list of ~130,000 "structure" objects which reference the token objects and group them into, well, structures. It's an ~800MB memory footprint, on a good day.
I'm using __slots__ to keep my memory footprint down, so my __getstate__ returns a tuple of serializable values, which __setstate__ bungs back into place. I'm also not pickling all the instance data, just 5 items for tokens, 7-9 items for structures, all strings or integers.
Of course, I'm using cPickle, and HIGHEST_PROTOCOL, which happens to be 2 (python 2.6). The resulting pickle file is ~120MB.
On my development machine, it takes ~2 minutes to unpickle the pickle. I'd like to make this faster. What methods might be available to me, beyond faster hardware and what I'm already doing?
Pickle is not the best method for storing large amounts of similar data. It can be slow for large data sets, and more importantly, it is very fragile: changing around your source can easily break all existing datasets. (I would recommend reading what pickle at its heart actually is: a bunch of bytecode expressions. It will frighten you into considering other means of data storage/retrieval.)
You should look into using PyTables, which uses HDF5 (cross-platform and everything) to store arbitrarily large amounts of data. You don't even have to load everything off of a file into memory at once; you can access it piecewise. The structure you're describing sounds like it would fit very well into a "table" object, which has a set field structure (comprised of fixed-length strings, integers, small Numpy arrays, etc.) and can hold large amounts very efficiently. For storing metadata, I'd recommend using the ._v_attrs attribute of your tables.

HDF5 : storing NumPy data

when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one
I could read *.npy from C code as
simple binary data(I mean *.npy are
binary-compatibly with C structures)
Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?
Does HDF5's numpy are binary-compatibly with C structures too?
UPD :
I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility)
So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab)
And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...
The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.
I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.
If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.
If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.
To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html
With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).
If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.
The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.
HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.
If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.

Categories