How to save Numpy 4D array to CSV? - python

I am trying to save a series of images to CSV file so they can be used as a training dataset for Alexnet Keras machine learning. The shape is (15,224,224,3).
So far I am having issue doing this. I have managed to put all data into a numpy array but now I cannot save it to a file.
Please help.

I don't think saving it to a csv file is the correct way to do this, it's used for 1d or 2d data to create a table-like structure, so I'm going to offer another solution. You can use np.save to save any numpy array to a file;
np.save('file_name', your_array)
which can then be loaded using np.load;
loaded_array = np.load('file_name.npy')
Hope that this works for you.

You can try using pickle to save the data. It is much more diverse and easy to handle compare to np.save.

Related

I am trying to find a way to convert numpy array to hdf5 format

I am trying to convert Numpy arrays that are 2D grids varying in time in a HDF5 format for several cases so for example the Numpy array has the following aspects: Case Number (0-100), Time (0-200years), X-grid point location (0-100m), y-grid point location (0-20m) plus the actual data point at this location (e.g. Saturation ranging from 0-100%). I am finding a bit difficult to efficiently store in HDF5 format. Its supposed to be used later to train an RNN model.
I tried just assigning a Numpy to an HDF5 format (don't know if it worked as I didn't retrieve it). I was also confused about the different types of storage options for such a case and the best way to store it such that its easily retrievable to train a NN. I need to use HDF5 format as it seems to optimize the use/retrieval of large data as in the current case..I was also trying to find the best way to learn HDF5 format.. Thank you!
import h5py
import numpy as np
# Create a numpy array
arr = np.random.rand(3,3)
# Create a HDF5 file
with h5py.File('mydata.h5', 'w') as f:
# Write the numpy array to the HDF5 file
f.create_dataset('mydata', data=arr)
You can also use h5py library to append the data to existing hdf5 file instead of creating new one.

error while saving large matrix using scipy.io.savemat

i want to save a large matrix of 20Gb in matlab(.mat) format using scipy.io.savemat function.While i am saving at that time it gives me error as follow:
error is:
scipy.io.savemat matrix too large to save with Matlab 5 format
My code is
scipy.io.savemat('output.mat',mdict={'data':data})
I hope experts may give some suggestion to overcome the above problem.Thanks in advance.
Yes, I agree with #hpaulj. I think there is no other way than saving it in smaller chunks than 20GB... Is that at all possible for you? Maybe you can orient your solution on the layout of your data-structure, i.e. if it's a 3-dimensional matrix saving slices along the 3rd axis.

Pandas DataFrame to multidimensional array conversion

I'm trying out the Kaggle MNIST competition. The data is provided in a CSV, and I want to convert it to a 28x28 Numpy array. The best way I know to import a CSV data set is via pd.read_csv(), but utilizing that, and then a pd.DataFrame.values(), provides me with an array that is (42000,784)--which is no problem, as I don't have to flatten it in TensorFlow, but then my test data is compromised for having a similar shape. Is there a way to take the (42000,784) DataFrame and convert it to a (42000,28,28) array?

Faster pytorch dataset file

I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays.
I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files.
The IO here isn't great, I open a big file only to get a small numpy array from it.
Any idea how I can store all these arrays so that the IO is better?
I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe.
Any idea how I can achieve this?
I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead.
The header is then used to fetch the data on request, that's how I solved my problem.
Reference:
http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.

how can I save super large array into many small files?

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

Categories