Inexplicable behavior when using vlen with h5py

Inexplicable behavior when using vlen with h5py - python

I am using h5py to build a dataset. Since I want to store arrays with different #of rows dimension, I use the h5py special_type vlen. However, I experience behavior I can't explain, maybe you can me help in understanding what is happening:
>>>> import h5py
>>>> import numpy as np
>>>> fp = h5py.File(datasource_fname, mode='w')
>>>> dt = h5py.special_dtype(vlen=np.dtype('float32'))
>>>> train_targets = fp.create_dataset('target_sequence', shape=(9549, 5,), dtype=dt)
>>>> test
Out[130]:
array([[ 0., 1., 1., 1., 0., 1., 1., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]])
>>>> train_targets[0] = test
>>>> train_targets[0]
Out[138]:
array([ array([ 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.], dtype=float32),
array([ 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.], dtype=float32),
array([ 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.], dtype=float32),
array([ 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32),
array([ 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32)], dtype=object)
I do expect the train_targets[0] to be of this shape, however I can't recognize the rows in my array. They seem to be totally jumbled about, however it is consistent. By which I mean that every time I try the above code, train_targets[0] looks the same.
To clarify: the first element in my train_targets, in this case test, has shape (5,11), however the second element might be of shape (5,38) which is why I use vlen.
Thank you for your help
Mat

I think
train_targets[0] = test
has stored your (11,5) array as an F ordered array in a row of train_targets. According to the (9549,5) shape, that's a row of 5 elements. And since it is vlen, each element is a 1d array of length 11.
That's what you get back in train_targets[0] - an array of 5 arrays, each shape (11,), with values taken from test (order F).
So I think there are 2 issues - what a 2d shape means, and what vlen allows.
My version of h5py is pre v2.3, so I only get string vlen. But I suspect your problem may be that vlen only works with 1d arrays, an extension, so to speak, of byte strings.
Does the 5 in shape=(9549, 5,) have anything to do with 5 in the test.shape? I don't think it does, at least not as numpy and h5py see it.
When I make a file following the string vlen example:
>>> f = h5py.File('foo.hdf5')
>>> dt = h5py.special_dtype(vlen=str)
>>> ds = f.create_dataset('VLDS', (100,100), dtype=dt)
and then do:
ds[0]='this one string'
and look at ds[0], I get an object array with 100 elements, each being this string. That is, I've set a whole row of ds.
ds[0,0]='another'
is the correct way to set just one element.
vlen is 'variable length', not 'variable shape'. While the https://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html documentation is not entirely clear on this, I think you can store 1d arrays with shape (11,) and (38,) with vlen, but not 2d ones.
Actually, train_targets output is reproduced with:
In [54]: test1=np.empty((5,),dtype=object)
In [55]: for i in range(5):
test1[i]=test.T.flatten()[i:i+11]
It's 11 values taken from the transpose (F order), but shifted for each sub array.

Related

numpy: Stop numpy.array() from trying to reconcile elements. Create ndarry from list without trying to merge / reconcile the elements

I have two 2d matrices in a list, which i want to convert to a numpy array. Below are 3 examples a,b,c .
>>> import numpy as np
>>> a = [np.zeros((3,5)), np.zeros((2,9))]
>>> np.array(a)
>>> array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])], dtype=object)
>>> b = [np.zeros((3,5)), np.zeros((3,9))]
np.array(b)
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2019.2.4\helpers\pydev\_pydevd_bundle\pydevd_exec.py", line 3, in Exec
exec exp in global_vars, local_vars
File "<input>", line 1, in <module>
ValueError: could not broadcast input array from shape (3,5) into shape (3)
>>> c = [np.zeros((3,5)), np.zeros((4,9))]
np.array(c)
array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])], dtype=object)
As one can observe case a & c work but b does not. b does throw an exception. The difference is that in example b the first dimension of the 2 matrices match.
I found the following answer, which explains why this behaviour occurs.
If only the first dimension does not match, the arrays are still matched, but as individual objects, no attempt is made to reconcile them into a new (four dimensional) array.
My Question: I don't want numpy to reconcile the matrices. I just want the same behaviour as if the first dimension doesn't match. I want them to be matched as indivudal objects even if they have the same first dimension. How do I achieve this ?

Numpy still complains even if you explicitly pass object as the dtype:
>>> np.array(b, dtype=object)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (3,5) into shape (3)
Essentially, numpy is not really written around using dtype=object, it always assumes you want an array with a primitve numeric or structured dtype.
So I think your only option is something like:
>>> arr = np.empty(len(b), dtype=object)
>>> arr[:] = b
>>> arr
array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])], dtype=object)
And just for fun, you can use the actual np.ndarray type constructor, although this isn't very easy:
>>> np.ndarray(dtype=object, shape=len(b), buffer=np.array(list(map(id, b)),dtype=np.uint64))
array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])], dtype=object)
And note, that relies on a CPython implementation detail, that id is simply the address of the python object. So mostly I'm just showing it for fun.

In the latest version we are starting to see a warning:
In [185]: np.__version__
Out[185]: '1.19.0'
In [187]: np.array([np.zeros((3,5)), np.zeros((2,9))])
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
#!/usr/bin/python3
Out[187]:
array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])], dtype=object)
It still makes the object dtype array. In the matching first dimension case we get the warning and error.
In [188]: np.array([np.zeros((3,5)), np.zeros((3,9))])
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
#!/usr/bin/python3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-188-b6a4475774d0> in <module>
----> 1 np.array([np.zeros((3,5)), np.zeros((3,9))])
ValueError: could not broadcast input array from shape (3,5) into shape (3)
Basically np.array tries, as first step, to make a multidimensional numeric array. Failing that it takes two routes - make an object dtype array or failure. Details are buried in compiled code.
The preallocate and assignment is the best way if you want full control over how the object array is created.
In [189]: res=np.empty(2,object)
In [191]: res[:] = [np.zeros((3,5)), np.zeros((3,9))]

How can I set precision when printing a PyTorch tensor with integers?

I have:
mask = mask_model(input_spectrogram)
mask = torch.round(mask).float()
torch.set_printoptions(precision=4)
print(mask.size(), input_spectrogram.size(), masked_input.size())
print(mask[0][-40:], mask[0].size())
This prints:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0.], grad_fn=<SliceBackward>) torch.Size([400])
But I want it to print 1.0000 instead of 1.. Why won't my set_precision do it? Even when I converted to float()?

Unfortunately, this is simply how PyTorch displays tensor values. Your code works fine, if you do print(mask * 1.1), you can see that PyTorch does indeed print out 4 decimal values when the tensor values can no longer be represented as integers.

Add homogeneous coordinate (x0=1) to images in numpy

I have 7 images of size 29*29, I want to add one homogenous coordinate (augment them
with feature, x0=1) to all 7 images, but I am not sure how to do it.
my original image dimension is
images.shape
#(7, 29, 29)
What I have tried is zipping np.ones() but it ends up making separate array for first feature resulting in dimension 7*2
np.array([list(a) for a in zip(np.ones([7,1]),images_all[:,:])]).shape
#(7,2)
#
#[[array([1.]),
# array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
....
As you can see, it adds 1 as separate array and does append in as the first element.
Also, I tried to loop through images and insert 1 at the first element, but it makes dimension 30 and gives error
for i in range(len(images)):
images[i][0] = np.insert(images[i][0], 0, 1., axis=0)
ValueError: could not broadcast input array from shape (30) into shape (29)

First create a larger array of ones, reshape the original array and update the larger array.
padded_images = np.ones((7,29*29+1))
padded_images[:,1:] = images.reshape(7,29*29)

Removing NaN rows from a three dimensional array

How can I remove the NaN rows from the array below using indices (since I will need to remove the same rows from a different array.
array([[[nan, 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]],
[[ 0., 0., 0., 0.],
[ 0., nan, 0., 0.],
[ 0., 0., 0., 0.]]])
I get the indices of the rows to be removed by using the command
a[np.isnan(a).any(axis=2)]
But using what I would normally use on a 2D array does not produce the desired result, losing the array structure.
a[~np.isnan(a).any(axis=2)]
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
How can I remove the rows I want using the indices obtained from my first command?

You need to reshape:
a[~np.isnan(a).any(axis=2)].reshape(a.shape[0], -1, a.shape[2])
But be aware that the number of NaN-rows at each 2D subarray should be the same to get a new 3D array.

Indexing numpy matrix

So lets say I have a (4,10) array initialized to zeros, and I have an input array in the form [2,7,0,3]. The input array will modify the zeros matrix to look like this:
[[0,0,1,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,1,0,0],
[1,0,0,0,0,0,0,0,0,0],
[0,0,0,1,0,0,0,0,0,0]]
I know I can do that by looping through the input target and indexing the matrix array with something like matrix[i][target in input target], but I tried to do it without a loop doing something like:
matrix[:, input_target] = 1, but that sets me the entire matrix to all 1.
Apparently the way to do it is:
matrix[range(input_target.shape[0]), input_target], the question is why this works and not using the colon ?
Thanks!

You only wish to update one column for each row. Therefore, with advanced indexing you must explicitly provide those row identifiers:
A = np.zeros((4, 10))
A[np.arange(A.shape[0]), [2, 7, 0, 3]] = 1
Result:
array([[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
Using a colon for the row indexer will tell NumPy to update all rows for the specified columns:
A[:, [2, 7, 0, 3]] = 1
array([[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inexplicable behavior when using vlen with h5py - python

Related

numpy: Stop numpy.array() from trying to reconcile elements. Create ndarry from list without trying to merge / reconcile the elements

How can I set precision when printing a PyTorch tensor with integers?

Add homogeneous coordinate (x0=1) to images in numpy

Removing NaN rows from a three dimensional array

Indexing numpy matrix

Categories

Resources