Reading h5py files into tensors

Reading h5py files into tensors - python

So I have a training set and a test set both in h5py format. I also have a data_load function that loads the files and returns NumPy arrays. The main problem is I don't need NumPy as I am working with Tensors. I am expecting to have an x&y tensor of size N(batch size) and D_in(input size for each image) and D_out(Output size of each tensor).
The problem:
x&y do not get converted to tensors of dimensions mentioned below.If anything their types remain to be numpy.ndarray. Any help is appreciated.
def load_data(train_file, test_file):
# Load the training data
train_dataset =h5py.File(train_file, 'r')
# Separate features(x) and labels(y) for training set
train_set_x_orig =np.array(train_dataset["train_set_x"][:])
train_set_y_orig =np.array(train_dataset["train_set_y"][:])
# Load the test data
test_dataset =h5py.File(test_file,'r')
# Separate features(x) and labels(y) for training set
test_set_x_orig =np.array(test_dataset["test_set_x"][:])
test_set_y_orig =np.array(test_dataset["test_set_y"][:])
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = torch.from_numpy(train_set_y_orig.reshape((1, train_set_y_orig.shape[0])))
test_set_y_orig = torch.from_numpy(test_set_y_orig.reshape((1, test_set_y_orig.shape[0])))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
x = torch.Tensor(N, D_in)
y = torch.Tensor(N, D_out)
train_file="data/train_catvnoncat.h5"
test_file="data/test_catvnoncat.h5"
x,y,_,_,_=load_data(train_file,test_file)

Because you did not convert train_set_x_orig to a torch tensor before returning.
Either use torch.from_numpy() on train_set_x_orig before returning as you do with train_set_y_orig or cast it to a tensor before assigning to x.
However, y should be of type torch.tensor.
Below is a demonstration that explains the issue:
# some sample tensor
In [27]: x = torch.Tensor(3, 2)
# check its type
In [28]: type(x)
Out[28]: torch.Tensor
# some sample ndarray
In [29]: arrx = np.arange(6).reshape(3, -1)
# assign array to tensor
# note that now the object `x` refers to the numpy array object
In [30]: x = arrx
# see that the type() of `x` is now numpy ndarray
In [31]: type(x)
Out[31]: numpy.ndarray
Also, as hpaulj pointed out in the comments, there is no need to wrap the sliced objects from h5py in np.array() since the sliced objects are already of type numpy ndarrays. So, you can just get rid of them and the code will look more cleaner!

Related

Value error while converting tensor to numpy array

I'm using the following code to extract the features from image.
def ext():
imgPathList = glob.glob("images/"+"*.JPG")
features = []
for i, path in enumerate(tqdm(imgPathList)):
feature = get_vector(path)
feature = feature[0] / np.linalg.norm(feature[0])
features.append(feature)
paths.append(path)
features = np.array(features, dtype=np.float32)
return features, paths
However, the above code throws the following error,
features = np.array(features, dtype=np.float32)
ValueError: only one element tensors can be converted to Python scalars
How can I be able to fix it?

The error says that your features variable is a list which contains multi dimensional values which cant be converted to tensor, because .append is converting the tensors to list, So some workaround is to use concatenation function of torch as torch.cat() (read here) instead of append method. I tried to replicate the solution with toy example.
I am assuming that features contain 2D tensor
import torch
for i in range(1,11):
alpha = torch.rand(2,2)
if i<2:
beta = alpha #will concatenate second sample
else:
beta = torch.cat((beta,alpha),0)
import numpy as np
features = np.array(beta, dtype=np.float32)

It seems you have a list of tensors you can not convert directly like that.
You need to convert internal tensors into NumPy array first (Use torch.Tensor.numpy to convert tensor into the array) and then list of NumPy array to the final array.
features = np.array([item.numpy() for item in features], dtype=np.float32)

Numpy random functions create inconsistent shapes for given argument

I found an odd behavior of numpy's random number generators.
It seems that they do not generate consistent matrix shapes for a given argument.
It is just super annoying to spend an extra line for conversion afterward which I'd like to circumvent.
How can I tell matlib.randn directly to generate a vector of size (200,)?
import numpy as np
A = np.zeros((200,))
B = np.matlib.randn((200,))
print(A.shape) # prints (200,)
print(B.shape) # prints (1, 200)

Use numpy.random instead of numpy.matlib:
numpy.random.randn(200).shape # prints (200,)
numpy.random.randn can create any shape, whereas numpy.matlib.randn always creates a matrix.

B is a matrix object, not a ndarray. The matrix object doesn't have an 1D equivalent objects and are not recommended to use anymore, so you should use np.random.random instead.

pickle, numpy - reshape parameters?

I'm trying to figure out what the parameters to the reshape() function are below.
I didn't find anything about pickle having a reshape() method, and import cPickle as pickle, import numpy as np were given in the file, so I'm assuming (maybe a bad assumption) that the reshape function is because of numpy. I found the definition of the reshape method for numpy (also below). However, I can't tell which arguments belong to which parameter.
Because this thing is supposed to load in picture data, I'm guessing 32,32 might be the image size, and would correspond to the newshape parameter?
I don't have a clue what 1000,3 are doing: the term "array_like" for the a parameter is confusing, and I don't know why 4 parameters are given if there's only 3 for the method, or how python would know that 32,32 is one argument, if it really is (why no []?)
Basically, what parameter does each argument (passed in) belong to? And how on earth can it tell? And how did X go from being an object from the pickle load that has numpy methods on it? Is that even possible?
datadict = pickle.load(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32)

The documentation you've linked is slightly different than what is actually happening, which may explain your confusion. The actual documentation, which is effectively the same function but set up as an object method instead of a library method, is here.
In this case, the (10000, 3, 32, 32) corresponds to the shape of the output array. So your output is actually a 4-dimensional array with shape (10000, 3, 32, 32). I suspect that if this is supposed to be image data, you could have a 32x32 image with RGB values and 1,000 images.
Additionally, pickle stores type information when you store objects, so this is how it knows that the object is a numpy array!

This loads a dictionary from the file:
datadict = pickle.load(f)
Then select two values from the dictionary. Ordinary dictionary key indexing:
X = datadict['data']
Y = datadict['labels']
Evidently X is a numpy array. reshape is a method (a function that 'belongs' to the array).
X = X.reshape(10000, 3, 32, 32)
A numpy array has a property called shape. After this reshape, X.shape should return (10000, 3, 32, 32), the shape of a 4dimensional array. The numbers are the newshape parameter described in the documentation.
The documentation is for the function version of reshape. It would be used as:
X = np.reshape(X, (10000, 3, 32, 32))
Same functionality, just a different way of invoking it.
To go on from here you probably need to study numpy documentation.
The documentation for the method version is:
a.reshape(shape, order='C')
Returns an array containing the same data with a new shape.
Refer to numpy.reshape for full documentation.

numpy dtype ValueError: invalid shape in fixed-type tuple - how can I get around it?

I use a custom datatype, e.g. datatype = np.dtype('({:n},{:n})f4'.format(10000,100000)) to read data from a binary file using
np.fromfile(filename, dtype=datatype)
However, defining the datatype using np.dtype gives an error for large datasets, as in the example datatype above:
ValueError: invalid shape in fixed-type tuple: dtype size in bytes must fit into a C int
Initializing an array of that size is no problem: a=np.zeros((10000,100000)).
So my question is: Where does that limitation come from and how can I get around it? I can of course use a loop and read chunks at a time, but maybe there is a more elegant way?

When you specify a dtype of '(M, N)f4' you are effectively specifying the final two dimensions of the output array, e.g.
np.zeros(5, np.dtype('(6, 7)f4')).shape
# (5, 6, 7)
You could achieve the same outcome by simply reading in the data as a 1D array, then reshaping it to your desired shape:
x = np.fromfile(filename, np.float32).reshape(-1, 10000, 100000)

Always yield a masked array with netCDF4

Question:
Is there a way of forcing netCDF4 to always output a masked array, regardless of whether it slice contains any fill values?
Background:
I have a netCDF dataset of values on a grid, over time that I read using the netCDF4 package.
nc_data = netCDF4.Dataset('file.nc', 'r')
The initial timesteps yield masked arrays:
var1_t0 = nc_data.variables['var1'][0][:]
var1_t0
masked_array(...)
The later timesteps yield standard ndarrays:
var1_t200 = nc_data.variables['var1'][200][:]
var1_t200
ndarray(...)
Desired result:
I would like masked arrays for the latter with a mask of all False, rather than a standard ndarray.

I don't think this is directly possible, but you can work it around by creating a masked_array if necessary:
var1_t0 = nc_data.variables['var1'][0][:]
if type(var1_t0) is numpy.ma.core.MaskedArray:
var1_t0 = numpy.ma.core.MaskedArray(var1_t0, numpy.zeros(var1_t0.shape, dtype = bool))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading h5py files into tensors - python

Related

Value error while converting tensor to numpy array

Numpy random functions create inconsistent shapes for given argument

pickle, numpy - reshape parameters?

numpy dtype ValueError: invalid shape in fixed-type tuple - how can I get around it?

Always yield a masked array with netCDF4

Categories

Resources