Is there a function similar to ncdisp to view .npy files? - python

Is there a function similar to ncdisp in MATLAB to view .npy files?
Alternatively, it would be helpful to have some command that would spit out header titles in a .npy file. I can see everything in the file, but it is absolutely enormous. There has to be a way to view what categories of data are in this file.

Looking at the code for np.lib.npyio.load we see that it calls np.lib.format.read_array. That in turn calls as np.lib.format._read_array_header.
This can studied and perhaps even used, but it isn't in the public API.
But if you are such a MATLAB fan as you claim, you already know (?) that you can explore .m files to see the MATLAB code. Same with python/numpy. Read the files/functions until you hit compiled 'builtin' functions.
Since a npy contains only one array,the header isn't that interesting by itself - just the array dtype and shape (and total size). This isn't like the matlab save file with lots of variables. scipy.io.loadmat can read those.
But looking up ncdisp I see that's part of its NetCDF reader. That's a whole different kind of file.

Related

Creating large .npy files with memory efficiency

I'm trying to create very large .npy files and I'm having a bit of difficulty. For instance, I need to create a (500, 1586, 2048, 3) matrix and save it to a npy file. And preferably, I need to put it in an npz_compressed file. I also need this to be memory efficient to be ran on low-memory systems. I've tried a few methods, but none have seemed to work so far. I've written and re-written things so many times, that I don't have code snippets for everything, but I'll describe the methods as best I can with code snippets where I can. Also, apologies for bad formatting.
Create an ndarray with all my data in it, then use savez_compressed to export it.
This gets all my data into the array, but it's terrible for memory efficiency. I filled all 8gb of RAM, plus 5gb of Swap space. I got it to save my file, but it doesn't scale, as my matrix could get significantly larger.
Use " np.memmap('file_name_to_create', mode='w+', shape=(500,1586,2048,3)) " to create the large, initial npy file, then add my data.
This method worked for getting my data in, and it's pretty memory efficient. However, i can no longer use np.load to open the file (get errors associated with pickle, regardless of if allow_pickle is true or false), which means I can't put it into compressed. I'd be happy with this format, if I can get it into the compressed format, but I just can't figure it out. I'm trying to avoid using gzip if possible.
Create a (1,1,1,1) zeros array and save it with np.save. Then try opening it with np.memmap with the same size as before.
This runs into the same issues as method 2. Can no longer use np.load to read it in afterwards
Create 5 [100,...] npy files with method 1, and saving them with np.save. Then read 2 in using np.load(mmap_mode='r+') and then merge them into 1 large npy file.
Creating the individual npy files wan't bad on memory, maybe 1gb to 1.5gb. However, I couldn't figure out how to then merge the npy files without actually loading the entire npy file into RAM. I read in other stackoverflow that npy files aren't really designed for this at all. They mentioned it would be better to use a .h5 file for doing this kind of 'appending'.
Those are the main methods that I've used. I'm looking for feedback on if any of these methods would work, which one would work 'best' for memory efficiency, and maybe some guidance on getting that method to work. I also wouldn't be opposed to moving to .h5 if that would be the best method, I just haven't tried it yet.
try it at Google colab which uses the GPU to run it

When is it better to use npz files instead of csv?

I'm looking at some machine learning/forecasting code using Keras, and the input data sets are stored in npz files instead of the usual csv format.
Why would the authors go with this format instead of csv? What advantages does it have?
It depends of the expected usage. If a file is expected to have broad use cases including direct access from an ordinary client machines, then csv is fine because it can be directly loaded in Excel or LibreOffice calc which are widely deployed. But it is just an good old text file with no indexes nor any additional feature.
On the other hand is a file is only expected to be used by data scientists or generally speaking numpy aware users, then npz is a much better choice because of the additional features (compression, lazy loading, etc.)
Long story made short, you exchange a larger audience for higher features.
From https://kite.com/python/docs/numpy.lib.npyio.NpzFile
A dictionary-like object with lazy-loading of files in the zipped archive provided on construction.
So, it is a zipped archive (smaller size than CSV on the disk, more than one file can be stored) and files can be loaded from disk only when needed (in CSV, when you only need 1 column, you still have to read whole file to parse it).
=> advantages are: performance and more features

Cant identify slice id/ series id using VTK (vtkDICOMImageReader)

I have two flavors of DICOM data, the first works with my existing code (what I built it on), but I can not import the second.
First style has a bottom level folder with all the slices from one scan in that folder (in ordered ".dcm" files). I simply point vtk to the directory using this code:
vtkSmartPointer<vtkDICOMImageReader> reader = vtkSmartPointer<vtkDICOMImageReader>::New();
reader->SetDirectoryName(dicomDirectory.c_str());
reader->Update();
vtkSmartPointer<vtkImageData> sliceData = reader->GetOutput();
double tempIntensity = sliceData->GetScalarComponentAsDouble(x, y, z, 0);
This is not the direct source (I check dimensions and set up iterating through and such). But in short, it works... I have pulled in several different dicom volumes through this method. (and have inspected and manipulated the resulting volume clouds)
This depends on VTK interpreting the directory though. It is stated here: that there are some particulars about what VTK is capable of managing (under detailed description, warning) in terms of DICOM data. (I am not sure my current data violates this spec)
The second style of DICOM has a directory structure where the bottom level of folders is named as A-Z and each one contains 25 files (with no suffix) which are named (in each folder) Z01-Z25.
I can open the files individually using:
reader->SetFileName(tempFile.c_str());
Instead of specifying the directory. If I read all of the 25 in one of the bottom folders, it is a mix of different ordered chunks from different scans. I was prepared to set up a function to skim all folders and files in the directory to find and assemble all slices associated with one scan, but so far I have been unable to find/appropriately implement a function within vtkDICOMImageReader to:
A: detect which unique series set of slices I am in (series label)
nor
B: detect my current slice number in series as well as the total series count (slice count/ series slice total)
I can post more source as necesary, but basically have tried monitoring all parameters in "reader" and "sliceData" while loading slices from different series and as of yet have not gotten anything to provide me with the above data. I am assuming that either I am not appropriately updating in between slice loads and or am not looking at the right object parameters.
Any information on what I am doing wrong in terms of code or even my poor understanding of DICOM structure would be greatly appreciated!
ps: I am working in c++, but am fairly certain the usage is similar in Python
Unfortunately Dicom is horribly complex and things get implemented slightly differently depending on what company's scanner the data is from and how old the system is. In your first example it sounds like you have a simply formatted directory with individual slice files and no extra scout images so VTK is able to read in and render the slices and it looks fine.
Your second example sound like there's a more complex structure that may contain multiple series and possibly things like scout images or even non-image type Dicom files. For dealing with this type of data you'll need some logic to read the meta-data and figure out which files you're interested in and how to assemble them. The meta-data for the entire set is contained in a single file named "dicomdir" which should be in the top level directory. This data is redundant with data in .dcm file headers but reading from this file saves you the trouble of scanning the header from every file individually.
VTK is an image manipulation/display library not a Dicom system. I'm not sure they have good support for complex directory structures. You could try reader->SetFileName('dicomdir'); and see if they have logic to automatically handle this but I'd be a bit surprised if that worked.
If you're going to be working with complex multi-series data like this you'll probably need to use another library to extract the info you want. I highly recommend DCMTK. It's a great open-source C++ library for working with Dicom, just don't expect it to be super simple.
You should not assume the content of a DICOM file by its name or position in the directory structure.
Instead, the root of the DICOM folder should contain a DICOMDIR file, which contains the list of files and their relationship (e.g., patient objects contain studies objects which contain series and then images).
I don't know if VTK offers a way of reading and interpreting DICOMDIR files; if not then you could try to interpret them with dcmtk or Imebra.
Disclosure: I'm the author of Imebra
As I have posted in the comments, in this case, the vtk-dicom is more suitable for your needs. Here are some useful links:
A tutorial on how to use vtk-dicom API to scan directories, organize your data according to series and studies: https://dgobbi.github.io/vtk-dicom/doc/api/directory.html
The vtk-dicom documentation, with installation instructions on the Appendix: http://dgobbi.github.io/vtk-dicom/doc/vtk-dicom.pdf

partial load of matlab (.mat) files -v7 in python

I have a large set of matlab files data files which I need to access in Python.
The files were saved using save with -v6 or -v7 option, but not the -v7.3.
I have to read only one single numerical value from each file, the files are many (100k+) and relatively large (1MB+).
Therefore, I spend 99% of time idling in I/O operations which are useless.
I am looking for something like partial load, which is feasible for -v7.3 files using HDF5 library.
So far, I have bee using the scipy.io.loadmat API.
Documentation says:
v4 (Level 1.0), v6 and v7 to 7.2 matfiles are supported.
You will need an HDF5 python library to read matlab 7.3 format mat files.
Because scipy does not supply one, we do not implement the HDF5 / 7.3 interface here.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
But it looks like it does not allow partial load.
Does anyone have experience with implementing such a feature, or does anyone know how to parse these .mat files at a lower level?
I guess a fseek-like approach could be possible when the structure is known
Use variable_names parameter if you want to read a single variable:
d = loadmat(filename, variable_names=['variable_name'])
then access it as follows:
d['variable_name']
UPDATE: if you need just a first element of an array/matrix you can do this:
val = loadmat(filename, variable_names=['var_name']).get('var_name')[0, 0]
NOTE: it will still read the whole variable into memory, but it will be deleted after first element is assigned to val.

storing matrices in golang in compressed binary format

I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.

Categories