partial load of matlab (.mat) files -v7 in python - python

I have a large set of matlab files data files which I need to access in Python.
The files were saved using save with -v6 or -v7 option, but not the -v7.3.
I have to read only one single numerical value from each file, the files are many (100k+) and relatively large (1MB+).
Therefore, I spend 99% of time idling in I/O operations which are useless.
I am looking for something like partial load, which is feasible for -v7.3 files using HDF5 library.
So far, I have bee using the scipy.io.loadmat API.
Documentation says:
v4 (Level 1.0), v6 and v7 to 7.2 matfiles are supported.
You will need an HDF5 python library to read matlab 7.3 format mat files.
Because scipy does not supply one, we do not implement the HDF5 / 7.3 interface here.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
But it looks like it does not allow partial load.
Does anyone have experience with implementing such a feature, or does anyone know how to parse these .mat files at a lower level?
I guess a fseek-like approach could be possible when the structure is known

Use variable_names parameter if you want to read a single variable:
d = loadmat(filename, variable_names=['variable_name'])
then access it as follows:
d['variable_name']
UPDATE: if you need just a first element of an array/matrix you can do this:
val = loadmat(filename, variable_names=['var_name']).get('var_name')[0, 0]
NOTE: it will still read the whole variable into memory, but it will be deleted after first element is assigned to val.

Related

Is there a function similar to ncdisp to view .npy files?

Is there a function similar to ncdisp in MATLAB to view .npy files?
Alternatively, it would be helpful to have some command that would spit out header titles in a .npy file. I can see everything in the file, but it is absolutely enormous. There has to be a way to view what categories of data are in this file.
Looking at the code for np.lib.npyio.load we see that it calls np.lib.format.read_array. That in turn calls as np.lib.format._read_array_header.
This can studied and perhaps even used, but it isn't in the public API.
But if you are such a MATLAB fan as you claim, you already know (?) that you can explore .m files to see the MATLAB code. Same with python/numpy. Read the files/functions until you hit compiled 'builtin' functions.
Since a npy contains only one array,the header isn't that interesting by itself - just the array dtype and shape (and total size). This isn't like the matlab save file with lots of variables. scipy.io.loadmat can read those.
But looking up ncdisp I see that's part of its NetCDF reader. That's a whole different kind of file.

When is it better to use npz files instead of csv?

I'm looking at some machine learning/forecasting code using Keras, and the input data sets are stored in npz files instead of the usual csv format.
Why would the authors go with this format instead of csv? What advantages does it have?
It depends of the expected usage. If a file is expected to have broad use cases including direct access from an ordinary client machines, then csv is fine because it can be directly loaded in Excel or LibreOffice calc which are widely deployed. But it is just an good old text file with no indexes nor any additional feature.
On the other hand is a file is only expected to be used by data scientists or generally speaking numpy aware users, then npz is a much better choice because of the additional features (compression, lazy loading, etc.)
Long story made short, you exchange a larger audience for higher features.
From https://kite.com/python/docs/numpy.lib.npyio.NpzFile
A dictionary-like object with lazy-loading of files in the zipped archive provided on construction.
So, it is a zipped archive (smaller size than CSV on the disk, more than one file can be stored) and files can be loaded from disk only when needed (in CSV, when you only need 1 column, you still have to read whole file to parse it).
=> advantages are: performance and more features

Numpy memory mapping issue

I've been recently working with large matrices. My inputs are stored in form of 15GB .npz files, which I am trying to read incrementally, in small batches.
I am familiar with memory mapping, and having seen numpy also supports these kinds of operations seemed like a perfect solution. However, the problem I am facing is as follows:
I first load the matrix with:
foo = np.load('matrix.npz',mmap_mode="r+")
foo has a single key: data.
When I try to, for example do:
foo['data'][1][1]
numpy seems to endlessly spend the available RAM, almost as if there were no memory mapping. Am I doing anything wrong?
My goal would be, for example, to read 30 lines at a time:
for x in np.arange(0,matrix.shape[1],30):
batch = matrix[x:(x+30),:]
do_something_with(batch)
Thank you!
My guess would be that mmap_mode="r+" is ignored when the file in question is a zipped numpy file. I haven't used numpy in this way, so some of what follows is my best guess. The documentation for load states
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
No mention of what it does with mmap_mode. However in the code for loading .npz files no usage is made of the mmap_mode keyword:
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp, allow_pickle=allow_pickle, pickle_kwargs=pickle_kwargs)
So, your initial guess is indeed correct. Numpy uses all of the ram because there is no memmapping ocurring. This is a limitation of the implementation of load; since the npz format is an uncompressed zip archive it should be possible to memmap the variables (unless of course your files were created with savez_compressed).
Implementing a load function that memmaps npz would be quite a bit of work though, so you may want to take a look at structured arrays. They provide similar usage (access of fields by keys) and are already compatible with memmapping.

storing matrices in golang in compressed binary format

I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.

HDF5 : storing NumPy data

when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one
I could read *.npy from C code as
simple binary data(I mean *.npy are
binary-compatibly with C structures)
Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?
Does HDF5's numpy are binary-compatibly with C structures too?
UPD :
I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility)
So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab)
And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...
The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.
I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.
If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.
If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.
To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html
With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).
If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.
The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.
HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.
If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.

Categories