Load Excel file into numpy 2D array - python

Is there an easier way to load an excel file directly into a Numpy array?
I have looked at the numpy.genfromtxt autoloading function from numpy documentation but it doesn't load excel files directly.
array = np.genfromtxt("Stats.xlsx")
ValueError: Some errors were detected !
Line #3 (got 2 columns instead of 1)
Line #5 (got 5 columns instead of 1)
......
Right now I am using using openpyxl.reader.excel to read the excel file and then append to numpy 2D arrays. This seems to be inefficient.
Ideally I would like to have to excel file directly loaded to numpy 2D array.

Honestly, if you're working with heterogeneous data (as spreadsheets are likely to contain) using a pandas.DataFrame is a better choice than using numpy directly.
While pandas is in some sense just a wrapper around numpy, it handles heterogeneous data very very nicely. (As well as a ton of other things... For "spreadsheet-like" data, it's the gold standard in the python world.)
If you decide to go that route, just use pandas.read_excel.

Related

I am trying to find a way to convert numpy array to hdf5 format

I am trying to convert Numpy arrays that are 2D grids varying in time in a HDF5 format for several cases so for example the Numpy array has the following aspects: Case Number (0-100), Time (0-200years), X-grid point location (0-100m), y-grid point location (0-20m) plus the actual data point at this location (e.g. Saturation ranging from 0-100%). I am finding a bit difficult to efficiently store in HDF5 format. Its supposed to be used later to train an RNN model.
I tried just assigning a Numpy to an HDF5 format (don't know if it worked as I didn't retrieve it). I was also confused about the different types of storage options for such a case and the best way to store it such that its easily retrievable to train a NN. I need to use HDF5 format as it seems to optimize the use/retrieval of large data as in the current case..I was also trying to find the best way to learn HDF5 format.. Thank you!
import h5py
import numpy as np
# Create a numpy array
arr = np.random.rand(3,3)
# Create a HDF5 file
with h5py.File('mydata.h5', 'w') as f:
# Write the numpy array to the HDF5 file
f.create_dataset('mydata', data=arr)
You can also use h5py library to append the data to existing hdf5 file instead of creating new one.

assemble numpy array from different hdf5 datasets in one file

So I have this large hdf5 file that features multiple large datsets. I am accessing it with h5py and want to read parts of every of those datasets into a common ndaray. Unfortunately, slicing across datasets is not supported, so I was wondering, what the most efficient way is, to assemble the ndarray with those datasets given the circumstances?
Currently, I am using something along the following lines:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(assemble_array, range(NrDatasets))
where the function assemble_array reads in the data into a predefined ndarray buffer of appropriate size, but this is not fast enough :/
Can anyone help?

How do I use 1D arrays in a code, which I've generated using a code which is in a different IDLE file in python?

I'm generating data as 1D arrays using some piece of code in Python v3. Next, I want to plot all these 1D arrays, but not necessarily on the same graph for comparison reasons. So how do I store these 1D arrays, and then use them in a different code file that I use to plot these arrays?
Till now, one idea that I have tried is to store these 1D arrays, while generating them, into an excel file as different columns. Next, while executing the code that plots the arrays, I can specifically call whichever array I want to plot from the excel file. However, this seems like a work-around and not an efficient method.
What I expect is this: In Python, I should be able to generate 1D arrays and store them. Later, I should be able to access these arrays in a different code file and plot these arrays or manipulate them.
The problem you want to solve is called serialization. There are many formats, but one of the most common - JSON. Python has a module in standart library to work with it. Here is the small workflow that you can use later:
Save the data in the first module:
import json
arrays = [
[1,2,3,4,5],
[5,4,3,2,1]
]
with open('waka.json', 'w') as f:
f.write(json.dumps(arrays))
And load it in the second module:
import json
arrays = None
with open('waka.json', 'r') as f:
arrays = json.loads(f.read())

Store Numpy as pickled Pandas, Pickled Numpy or HDF5

I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.
Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.

How to extend h5py so that I can access data within a hdf5 file?

I have a small python program which creates a hdf5 file using the h5py module. I want to write a python module to work on the data from the hdf5 file. How could I do that?
More specifically, I can set the numpy arrays to PyArrayObject and read them using PyArg_ParseTuple. This way, I can read elements from the numpy array when I am writing a python module. How to read hdf5 files so that I can access individual elements?
Update: Thanks for the answers below. I need to read hdf5 file from C and not from Python- I know how to do that. For example:
import h5py as t
import numpy as np
f=t.File('\tmp\tmp.h5', 'w')
#this file is 2+GB
ofmat=np.load('offsetmatrix.npy')
f['FileDataset']=ofmat
f.close()
Now I have a hdf5 file called '\tmp\tmp.h5'. What I need to do is read the individual array elements from the hdf5 file using C (and not python) so that I can do something with those elements. This shows how to extend numpy arrays. How to extend hdf5?
Edit: Grammar
h5py gives you a direct interface for reading/writing and manipulating data stored in an hdf5 file. Have you looked at the docs?
http://docs.h5py.org/
I advise starting with these. These have pretty clear examples of how to do simple data access. If there are specific things that you are trying to do that aren't covered by the methods in h5py, could you please give a more specific description of your desired usage?
If you don't actually need a particular structure of HDF5, but you just need the speed and cross-platform compatibility, I'd recommend taking a look at PyTables. It has the built-in ability to read and write Numpy arrays.

Categories