I'm trying out the Kaggle MNIST competition. The data is provided in a CSV, and I want to convert it to a 28x28 Numpy array. The best way I know to import a CSV data set is via pd.read_csv(), but utilizing that, and then a pd.DataFrame.values(), provides me with an array that is (42000,784)--which is no problem, as I don't have to flatten it in TensorFlow, but then my test data is compromised for having a similar shape. Is there a way to take the (42000,784) DataFrame and convert it to a (42000,28,28) array?
Related
I am trying to convert Numpy arrays that are 2D grids varying in time in a HDF5 format for several cases so for example the Numpy array has the following aspects: Case Number (0-100), Time (0-200years), X-grid point location (0-100m), y-grid point location (0-20m) plus the actual data point at this location (e.g. Saturation ranging from 0-100%). I am finding a bit difficult to efficiently store in HDF5 format. Its supposed to be used later to train an RNN model.
I tried just assigning a Numpy to an HDF5 format (don't know if it worked as I didn't retrieve it). I was also confused about the different types of storage options for such a case and the best way to store it such that its easily retrievable to train a NN. I need to use HDF5 format as it seems to optimize the use/retrieval of large data as in the current case..I was also trying to find the best way to learn HDF5 format.. Thank you!
import h5py
import numpy as np
# Create a numpy array
arr = np.random.rand(3,3)
# Create a HDF5 file
with h5py.File('mydata.h5', 'w') as f:
# Write the numpy array to the HDF5 file
f.create_dataset('mydata', data=arr)
You can also use h5py library to append the data to existing hdf5 file instead of creating new one.
I am completely stuck, hence I am looking for kind advice.
My aim is to read out many hdf5 files in parallel, extract the multi-dim arrays inside, and store each array in one row, precisely a cell, of a dask dataframe. I don't opt for a pandas df, because I believe it will be too big.
It is not possible to read with read_hdf() from dask hdf5 files created with h5py.
What could I do to import thousands of hdf5-files with dask in paralleL and get access to the multi-dim arrays inside?
I would like to create a dask dataframe, where each 2d-array (extracted from the n-dim arrays inside the hdfs) is stored in one cell of a dask dataframe.
Consequently, the number of row corresponds to the number of total arrays found in all files, here 9. I store in one column the arrays.
In the future I would like to append more columns with other data to this dask dataframe. and I would like to operate on the arrays with another Python library and store the results in other columns of the dask dataframe. The dataframe should contain all the information I extract and manipulate. I would also like to add data from other hdf5 files. Like a minidata base. Is this reasonable?
I can work in parallel because the arrays are independent of each other.
How would you realise this, please? xarray was suggested to me as well, but I don't know what's the best way.
Earlier I tried to collect all arrays in a multi-dimensional dask array, but the conversion to a dataframe is only possible for ndim=2.
Thank you for your advice. Have a good day.
import numpy as np
import h5py
import dask.dataframe as dd
import dask.array as da
import dask
print('This is dask version', dask.__version__)
ra=np.ones([10,3199,4000])
print(ra.shape)
file_list=[]
for i in range(0,4):
#print(i)
fstr='data_{0}.h5'.format(str(i))
#print(fstr)
hf = h5py.File('./'+fstr, 'w')
hf.create_dataset('dataset_{0}'.format(str(i)), data=ra)
hf.close()
file_list.append(fstr)
!ls
print(file_list)
for i,fn in enumerate(file_list):
dd.read_hdf(fn,key='dataset_{0}'.format(str(i))) #breaks here
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using dask.dataframe.from_delayed.
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
def preprocess_hdf_file_to_dataframe(filepath):
# process your data into a dataframe however you want, e.g.
with xr.open_dataset(filepath) as ds:
return ds.to_dataframe()
files = ['file1.hdf5', 'file2.hdf5']
futures = client.map(preprocess_hdf_file_to_dataframe, files)
df = dd.from_delayed(futures)
That said, this seems like a perfect use case for xarray, which can read HDF5 files and work with dask natively, e.g.
ds = xr.open_mfdataset(files)
This dataset is similar to a dask.dataframe, in that it contains references to dask.arrays which are read from the file. But xarray is built to handle N-dimensional arrays natively and can work much more naturally with the HDF5 format.
There are certainly areas where dataframes make more sense than a Dataset or DataArray, though, and converting between them can be tricky with larger-than-memory data, so the first approach is always an option if you want a dataframe.
I am trying to save a series of images to CSV file so they can be used as a training dataset for Alexnet Keras machine learning. The shape is (15,224,224,3).
So far I am having issue doing this. I have managed to put all data into a numpy array but now I cannot save it to a file.
Please help.
I don't think saving it to a csv file is the correct way to do this, it's used for 1d or 2d data to create a table-like structure, so I'm going to offer another solution. You can use np.save to save any numpy array to a file;
np.save('file_name', your_array)
which can then be loaded using np.load;
loaded_array = np.load('file_name.npy')
Hope that this works for you.
You can try using pickle to save the data. It is much more diverse and easy to handle compare to np.save.
I just wanted to ask if it was possible to store a numpy array as a .npy file and then use memmap to look through it at certain rows/columns?
I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.
Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.