I just wanted to ask if it was possible to store a numpy array as a .npy file and then use memmap to look through it at certain rows/columns?
Related
I am trying to convert Numpy arrays that are 2D grids varying in time in a HDF5 format for several cases so for example the Numpy array has the following aspects: Case Number (0-100), Time (0-200years), X-grid point location (0-100m), y-grid point location (0-20m) plus the actual data point at this location (e.g. Saturation ranging from 0-100%). I am finding a bit difficult to efficiently store in HDF5 format. Its supposed to be used later to train an RNN model.
I tried just assigning a Numpy to an HDF5 format (don't know if it worked as I didn't retrieve it). I was also confused about the different types of storage options for such a case and the best way to store it such that its easily retrievable to train a NN. I need to use HDF5 format as it seems to optimize the use/retrieval of large data as in the current case..I was also trying to find the best way to learn HDF5 format.. Thank you!
import h5py
import numpy as np
# Create a numpy array
arr = np.random.rand(3,3)
# Create a HDF5 file
with h5py.File('mydata.h5', 'w') as f:
# Write the numpy array to the HDF5 file
f.create_dataset('mydata', data=arr)
You can also use h5py library to append the data to existing hdf5 file instead of creating new one.
I have about 200 numpy arrays saved as files, and I would like to combine them into one big array. Currently I am doing that by using a loop and concatenating each one individually. But I heard this is memory inefficient, because concatenating also makes a copy
Concatenate Numpy arrays without copying
If you know beforehand how many arrays you need, you can instead start
with one big array that you allocate beforehand, and have each of the
small arrays be a view to the big array (e.g. obtained by slicing).
So I am wondering if I should instead load each numpy array individually, count the row size off all the numpy arrays, create a new numpy array of this new row size, and then copy each smaller numpy array individually, and then delete that numpy array. Or is there some aspect of this I am not taking into account?
I'm trying out the Kaggle MNIST competition. The data is provided in a CSV, and I want to convert it to a 28x28 Numpy array. The best way I know to import a CSV data set is via pd.read_csv(), but utilizing that, and then a pd.DataFrame.values(), provides me with an array that is (42000,784)--which is no problem, as I don't have to flatten it in TensorFlow, but then my test data is compromised for having a similar shape. Is there a way to take the (42000,784) DataFrame and convert it to a (42000,28,28) array?
I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.
Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.
Is there an easier way to load an excel file directly into a Numpy array?
I have looked at the numpy.genfromtxt autoloading function from numpy documentation but it doesn't load excel files directly.
array = np.genfromtxt("Stats.xlsx")
ValueError: Some errors were detected !
Line #3 (got 2 columns instead of 1)
Line #5 (got 5 columns instead of 1)
......
Right now I am using using openpyxl.reader.excel to read the excel file and then append to numpy 2D arrays. This seems to be inefficient.
Ideally I would like to have to excel file directly loaded to numpy 2D array.
Honestly, if you're working with heterogeneous data (as spreadsheets are likely to contain) using a pandas.DataFrame is a better choice than using numpy directly.
While pandas is in some sense just a wrapper around numpy, it handles heterogeneous data very very nicely. (As well as a ton of other things... For "spreadsheet-like" data, it's the gold standard in the python world.)
If you decide to go that route, just use pandas.read_excel.