Suppose I save a numpy array to a file, "arr.npy", using numpy.save() and that I do this using a particular python version, numpy version, and OS.
Can I load, using numpy.load(), arr.npy on a different OS using a different version of python or numpy? Are there any restrictions, such as backwards compatibility?
Yes. The .npy format is documented here:
https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format
Note this comment in the source code (emphasis mine):
The .npy format is the standard binary file format in NumPy for
persisting a single arbitrary NumPy array on disk. The format stores
all of the shape and dtype information necessary to reconstruct the
array correctly even on another machine with a different architecture.
Related
I understand that serialization of data means converting a data structure or object state to a form which can be stored in a file or buffer, can be transmitted, and can be reconstructed later (https://www.tutorialspoint.com/object_oriented_python/object_oriented_python_serialization.htm). Based on this definition, converting a numpy array to .npy format should be serialization of the numpy array data object. However, I could not find this assertion anywhere, when I looked up on the internet. Most of the related links were mentioning about how pickle format does serialization of data in python. My questions is - is converting numpy array to .npz format an example of serialization of a python data object. If not, what are the reasons?
Well, according to Wikipedia:
In computing, serialization (or serialisation) is the process of
translating data structures or object state into a format that can be
stored (for example, in a file or memory buffer) or transmitted (for
example, across a network connection link) and reconstructed later
(possibly in a different computer environment).
And according to Numpy Doc:
Binary serialization
NPY format
A simple format for saving numpy arrays to disk with the full
information about them.
The .npy format is the standard binary file format in NumPy for
persisting a single arbitrary NumPy array on disk. The format stores
all of the shape and dtype information necessary to reconstruct the
array correctly even on another machine with a different architecture.
The .npz format is the standard format for persisting multiple NumPy
arrays on disk. A .npz file is a zip file containing multiple .npy
files, one for each array.
So, putting this definitions together you can come up with an answer to your question. Yes, is a way of serialization. Also the process of storing and reading is fast
np.save(filename, arr) writes the array to a file. Since a file a linear structure it is a form of serialization. But often 'serialization' refers to creating a string that can be sent to a database or over some 'pipeline'. I think you can save to a string buffer, but it takes a bit of trickery.
But in Python most objects have a pickle method, which creates a string which can be written to a file. In that sense pickle is a 2 step process - serialize and then write to file. The pickle for a numpy array is actually a save compatible form. (conversely, np.save of a non-array object uses that object's pickle).
savez writes a zip archive, containing one npy file for each array. It may in addition be compressed. There are OS tools for transferring zip archives to other computers.
I have python two dimensional numpy arrays and I want to change them into binary format which can be read with C++, As you know, two dimensional array in C++ is kind of one dimensional array with two pointers which are used to locate elements. Could you tell me which function in python can be used to do the job or any other solution?
This is too long for a comment, but probably not complete enough to run on it's own. As Tom mentioned in the comments on your question, using a library that saves and loads to a well defined format (hdf5, .mat) in python and C++ is probably the easiest solution. If you don't want to find and setup such a library, read on.
Numy has the ability to save data using numpy.save (see this),
The format (described here) states there is a header, with information on the datatype and number of array shape, followed by the data. So, unless you want to write a fully featured parser (you don't), you should ensure python consistently saves data as a float64 (or whichever type you want), in c order (fortran ordering is the other option).
Then, the C++ code just needs to check that the array data type is float64, the correct ordering is used, and how large the array is. Allocate the appropriate amount of memory, and you can load that number of bytes from the file directly into the allocated memory. To create 2d indexing you'll need to allocate an array of pointers to each 'row' in the allocated memory.
Or just use a library that will handle all of that for you.
In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.
I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.
I have a small python program which creates a hdf5 file using the h5py module. I want to write a python module to work on the data from the hdf5 file. How could I do that?
More specifically, I can set the numpy arrays to PyArrayObject and read them using PyArg_ParseTuple. This way, I can read elements from the numpy array when I am writing a python module. How to read hdf5 files so that I can access individual elements?
Update: Thanks for the answers below. I need to read hdf5 file from C and not from Python- I know how to do that. For example:
import h5py as t
import numpy as np
f=t.File('\tmp\tmp.h5', 'w')
#this file is 2+GB
ofmat=np.load('offsetmatrix.npy')
f['FileDataset']=ofmat
f.close()
Now I have a hdf5 file called '\tmp\tmp.h5'. What I need to do is read the individual array elements from the hdf5 file using C (and not python) so that I can do something with those elements. This shows how to extend numpy arrays. How to extend hdf5?
Edit: Grammar
h5py gives you a direct interface for reading/writing and manipulating data stored in an hdf5 file. Have you looked at the docs?
http://docs.h5py.org/
I advise starting with these. These have pretty clear examples of how to do simple data access. If there are specific things that you are trying to do that aren't covered by the methods in h5py, could you please give a more specific description of your desired usage?
If you don't actually need a particular structure of HDF5, but you just need the speed and cross-platform compatibility, I'd recommend taking a look at PyTables. It has the built-in ability to read and write Numpy arrays.