save images in hdf5 files

save images in hdf5 files - python

I'm using python 3.4.2 and I was wondering to know if there is any way to save images to hdf5 files without change the attributes of dataset.
I mean, I want to use the HDFViewer to see this images later.
What I'm doing...
I'm using the h5py package to save images as numpy array in different datasets. Then a need to change the datasets attributes.
My final result is already what I have imagined. But I'm not really happy.
If there is another way, please share here...

Related

Creating large .npy files with memory efficiency

I'm trying to create very large .npy files and I'm having a bit of difficulty. For instance, I need to create a (500, 1586, 2048, 3) matrix and save it to a npy file. And preferably, I need to put it in an npz_compressed file. I also need this to be memory efficient to be ran on low-memory systems. I've tried a few methods, but none have seemed to work so far. I've written and re-written things so many times, that I don't have code snippets for everything, but I'll describe the methods as best I can with code snippets where I can. Also, apologies for bad formatting.
Create an ndarray with all my data in it, then use savez_compressed to export it.
This gets all my data into the array, but it's terrible for memory efficiency. I filled all 8gb of RAM, plus 5gb of Swap space. I got it to save my file, but it doesn't scale, as my matrix could get significantly larger.
Use " np.memmap('file_name_to_create', mode='w+', shape=(500,1586,2048,3)) " to create the large, initial npy file, then add my data.
This method worked for getting my data in, and it's pretty memory efficient. However, i can no longer use np.load to open the file (get errors associated with pickle, regardless of if allow_pickle is true or false), which means I can't put it into compressed. I'd be happy with this format, if I can get it into the compressed format, but I just can't figure it out. I'm trying to avoid using gzip if possible.
Create a (1,1,1,1) zeros array and save it with np.save. Then try opening it with np.memmap with the same size as before.
This runs into the same issues as method 2. Can no longer use np.load to read it in afterwards
Create 5 [100,...] npy files with method 1, and saving them with np.save. Then read 2 in using np.load(mmap_mode='r+') and then merge them into 1 large npy file.
Creating the individual npy files wan't bad on memory, maybe 1gb to 1.5gb. However, I couldn't figure out how to then merge the npy files without actually loading the entire npy file into RAM. I read in other stackoverflow that npy files aren't really designed for this at all. They mentioned it would be better to use a .h5 file for doing this kind of 'appending'.
Those are the main methods that I've used. I'm looking for feedback on if any of these methods would work, which one would work 'best' for memory efficiency, and maybe some guidance on getting that method to work. I also wouldn't be opposed to moving to .h5 if that would be the best method, I just haven't tried it yet.

try it at Google colab which uses the GPU to run it

Keras custom fit_generator for numeric dataframe

I have several CSV files which are placed in a directory. What I want to do is to create a flow from this directory where each file is taken, prepossessed(such as null value fill, outlier treatment etc) and then each data point is passed to keras model and this process should repeat itself for every file placed in the directory. Any suggestions on this to create data flow same as for Image data available in keras. Also this should happen in python :)
Thanks in advance!

I don't think that Keras natively supplies such functionality.
You should make your own converter, using something like glob to go over each file, send it to preprocessing functions, and finally save it as a format readily usable by Keras, such as a numpy array.
You might want to have a look here for an example of inputting multiple files (although in this case they are already numpy arrays, not csv files) to use in the training of a model.

Faster pytorch dataset file

I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays.
I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files.
The IO here isn't great, I open a big file only to get a small numpy array from it.
Any idea how I can store all these arrays so that the IO is better?
I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe.
Any idea how I can achieve this?

I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead.
The header is then used to fetch the data on request, that's how I solved my problem.
Reference:
http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html

One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.

How to save Numpy 4D array to CSV?

I am trying to save a series of images to CSV file so they can be used as a training dataset for Alexnet Keras machine learning. The shape is (15,224,224,3).
So far I am having issue doing this. I have managed to put all data into a numpy array but now I cannot save it to a file.
Please help.

I don't think saving it to a csv file is the correct way to do this, it's used for 1d or 2d data to create a table-like structure, so I'm going to offer another solution. You can use np.save to save any numpy array to a file;
np.save('file_name', your_array)
which can then be loaded using np.load;
loaded_array = np.load('file_name.npy')
Hope that this works for you.

You can try using pickle to save the data. It is much more diverse and easy to handle compare to np.save.

How to extend h5py so that I can access data within a hdf5 file?

I have a small python program which creates a hdf5 file using the h5py module. I want to write a python module to work on the data from the hdf5 file. How could I do that?
More specifically, I can set the numpy arrays to PyArrayObject and read them using PyArg_ParseTuple. This way, I can read elements from the numpy array when I am writing a python module. How to read hdf5 files so that I can access individual elements?
Update: Thanks for the answers below. I need to read hdf5 file from C and not from Python- I know how to do that. For example:
import h5py as t
import numpy as np
f=t.File('\tmp\tmp.h5', 'w')
#this file is 2+GB
ofmat=np.load('offsetmatrix.npy')
f['FileDataset']=ofmat
f.close()
Now I have a hdf5 file called '\tmp\tmp.h5'. What I need to do is read the individual array elements from the hdf5 file using C (and not python) so that I can do something with those elements. This shows how to extend numpy arrays. How to extend hdf5?
Edit: Grammar

h5py gives you a direct interface for reading/writing and manipulating data stored in an hdf5 file. Have you looked at the docs?
http://docs.h5py.org/
I advise starting with these. These have pretty clear examples of how to do simple data access. If there are specific things that you are trying to do that aren't covered by the methods in h5py, could you please give a more specific description of your desired usage?

If you don't actually need a particular structure of HDF5, but you just need the speed and cross-platform compatibility, I'd recommend taking a look at PyTables. It has the built-in ability to read and write Numpy arrays.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.