h5py returning unexpected results in indexing - python

I'm attempting to fill an h5py dataset with a series of numpy arrays that I generate in sequence so my memory can handle it.
The h5py array is initialised so that the first dimension can have any magnitude,
f.create_dataset('x-data', (1, maxlen, 50), maxshape=(None, maxlen, 50))
After generating each numpy array X, I am using
f['x-data'][alen:alen + len(data),:,:] = X
Where for example, in the first array, alen=0 and len(data)=10056. I then increment alen so the next array will start from where the last one ended.
print f['x-data'][alen:alen + len(data),:,:].shape, alen, len(data)
(1L, 60L, 50L) 0 10056
Does anyone know why the 0:10056 indexing is being interpreted as 1L?

I replicated your example, but on a much smaller scale. I had to do a resize each time I added elements, e.g.
f['xdata'].resize(50,axis=0)
The first time I tried to add a block I got an error:
TypeError: Can't broadcast (10, 20, 10) -> (1, 20, 10)
But subsequent times, when I'd outgrown the allocated space, it failed silently. No error, it just didn't end up storing the new values.
This is for version 2.2.1

I found the answer from a helpful person on the user group.
The maxshape(None) feature does not mean that the dataset automatically resizes - it must be resized each time new input is added. So the first dimension must be increased before adding new data:
x.resize((x.shape[0] + X.shape[0], X.shape[1], X.shape[2]))
y.resize((y.shape[0] + Y.shape[0], Y.shape[1], Y.shape[2]))
The dataset then adds the values correctly.

Related

np.pad() converting my array into a new array with only values equal to 0

I am working on a NLP machine learning project and I want to add some additional data to the original dataset used in training to compare results. To this end, the shape of my new dataset which is a numpy array (i.e. numpy array with previous data + additional data) needs to match the shape of the test set used in the first round of training.
When I used the code below to pad my numpy array, I noticed that all the float numbers disappeared and I got a new array containing only zeros specially if I reassign with a new variable.
Xfeatures_new_pad =np.pad(Xfeatures_new, (2039, 0), 'constant')
The same happens if I use pad_sequences():
Xfeatures_new_pad=pad_sequences(Xfeatures_new, maxlen=Xfeatures_train.shape[1],
padding='pre')
I have also tried the below code:
result = np.zeros(Xfeatures_train.shape)
result[:Xfeatures_new.shape[0],:Xfeatures_new.shape[1]] =
Xfeatures_new
new_Xfeatures=result[:Xfeatures_new.shape[0]]
new_Xfeatures_train = np.concatenate((Xfeatures_train,
new_Xfeatures), axis=0)
result_y = np.zeros(y_train.shape)
result_y[:y_train_new.shape[0],:y_train_new.shape[1]] =
y_train_new
new_y=result_y[:y_train_new.shape[0]]
new_y_train = np.concatenate((y_train, new_y), axis=0)
But I am getting an error:
ValueError Traceback (most recent
call last)
<ipython-input-33-a696a6ccc21c> in <module>()
1
2 result = np.zeros(Xfeatures_train.shape)
----> 3 result[:Xfeatures_new.shape[0],:Xfeatures_new.shape[1]] =
Xfeatures_new
4 new_Xfeatures=result[:Xfeatures_new.shape[0]]
5 new_Xfeatures_train = np.concatenate((Xfeatures_train,
new_Xfeatures), axis=0)
ValueError: could not broadcast input array from shape
(600,13072) into shape (400,13072)
Any ideas why this is happening?
The problem with my code was that the empty array with zeros created had less rows than the number of rows of the array that was supposed to be inserted in it. The solution was to specify the same number of rows I needed in order to accommodate the shape of the array I wanted to insert in the empty array.
result = np.zeros(600, 15111)
result[:Xfeatures_new.shape[0],:Xfeatures_new.shape[1]] =
Xfeatures_new

error when reshaping array with prime dimension

I want to reshape an array with 1501 (waveform) to (3,500) , but it complains as follow. please, help me to solve this problem.
x = np.array(x_train[2])
print(x.shape)
y = np.reshape(x, (int(len(x) / 500), 500))
print(y.shape)
Here is the output:
(1501,)
ValueError: cannot reshape array of size 1501 into shape (3,500)
If you want to view your data in chunks of 500, but it does not have a multiple of 500 elements, you need to truncate or pad it first. Let's say you are going with truncation here, since you have a wave-form and that last element is probably a repeat that will throw off your FFT anyway.
In that case, you can view less of the data in the buffer, then reshape however you want:
y = x[:(x.size // 500) * 500].reshape(-1, 500)
The nice thing here is that if your data is arranged somewhat sanely in memory, it will not make a copy, but return a contiguous view into the original buffer.

Appending matricies into a single matrix with numpy

I have a function in Python that returns a numpy.mat of shape (100, 1). I am calling this function 4 times in a loop and would like to take the resulting 4 matricies and create a matrix of shape (100, 4). I have looked for sometime at numpy.append, numpy.concatenate, and numpy.insert but have not been able to get this working.
Here is a short SSCCE of my issue
zeros = np.zeros(shape=(100, 4))
for i in range(1, 5):
np.append(zeros, np.empty(shape=(100, 1)))
print(zeros)
Where zeros should results in a matrix of shape (100, 4) with "junk" values from each of the calls to numpy.empty and not all 0..
Do something along these lines -
zeros = np.zeros(shape=(100, 4))
for i in range(1, 5):
data = np.random.rand(100,1) # func that returns (100,1) shaped array
zeros[:,i-1] = data.ravel()
In place of ravel(), we could also use : data[:,0] or np.squeeze(data), basic idea is to feed a 1D array there, because the LHS zeros[:,i-1] expects a 1D array there.
As an alternative, inside the loop, we could also do -
zeros[:,[i-1]] = data
Thus, with that list of column index [i-1] instead of i-1, we are keeping the dimensions into which data is to be assigned (keeps as 2D) and that allows us to feed in data, which is also 2D without any change.

Getting Around "ValueError: operands could not be broadcast together"

The code below yields the following value error.
ValueError: operands could not be broadcast together with shapes (8,8) (64,)
It first arose when I expanded the "training" data set from 10 images to 100. The interpreter seems to be telling me that I can't perform any coordinate-wise operations on these data points because one of the coordinate pairs is missing a value. I can't argue with that. Unfortunately, my work arounds haven't exactly worked out. I attempted to insert an if condition followed by a continue statement (i.e., if this specific coordinate comes up, it should continue from the top of the loop). The interpreter didn't like this idea and muttered something about the truth of that statement not being as cut and dry as I thought. It suggested I try a.any() or a.all(). I checked out examples of both, and tried placing the problematic coordinate pair in the parenthesis and in place of the "a." Both approaches got me nowhere. I'm unaware of any Python functions similar to the functions I would use in C to exclude inputs that don't meet specific criteria. Other answers pertaining to similar problems recommend changing the math one uses, but I was told that this is how I am to proceed, so I'm looking at it as an error handling problem.
Does anyone have any insight concerning how one might handle this issue? Any thoughts would be greatly appreciated!
Here's the code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
#print the 0th image in the image database as an integer matrix
print(digits.images[0])
#plot the 0th image in the database assigning each pixel an intensity of black
plt.figure()
plt.imshow(digits.images[0], cmap = plt.cm.gray_r, interpolation = 'nearest')
plt.show()
#create training subsets of images and targets(labels)
X_train = digits.images[0:1000]
Y_train = digits.target[0:1000]
#pick a test point from images (345)
X_test = digits.images[345]
#view test data point
plt.figure()
plt.imshow(digits.images[345], cmap = plt.cm.gray_r, interpolation = 'nearest')
plt.show()
#distance
def dist(x, y):
return np.sqrt(np.sum((x - y)**2))
#expand set of test data
num = len(X_train)
no_errors = 0
distance = np.zeros(num)
for j in range(1697, 1797):
X_test = digits.data[j]
for i in range(num):
distance[i] = dist(X_train[i], X_test)
min_index = np.argmin(distance)
if Y_train[min_index] != digits.target[j]:
no_errors += 1
print(no_errors)
You need to show us where the error occurs, and some of the error stack.
Then you need to identify which arrays are causing the problem, and examine their shape. Actually the error tells us that. One operand is a 8x8 2d array. The other has the same number of elements but with a 1d shape. You may have to trace some variables back to your own code.
Just to illustrate the problem:
In [381]: x = np.ones((8,8),int)
In [384]: y = np.arange(64)
In [385]: x*y
...
ValueError: operands could not be broadcast together with shapes (8,8) (64,)
In [386]: x[:] = y
...
ValueError: could not broadcast input array from shape (64) into shape (8,8)
Since the 2 arrays have the same number of elements, a fix likely involves reshaping one or the other:
In [387]: x.ravel() + y
Out[387]:
array([ 1, 2, 3, 4, 5, ... 64])
or x-y.reshape(8,8).
My basic point is, you need to understand what array shapes mean, and how arrays of different shape can be used together. You don't 'get around' the error, you fix the inputs so they are 'broadcasting' compatible.
I don't think problem is with the value of a specific element.
The truth value error occurs when you try to test an array in a if context. if expects a simple True or False, not an array of True/False values.
In [389]: if x>0:print('yes')
....
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Concatenate large numpy arrays in RAM

I have some 3D image data and want to build a stack of RGB images out of single channel stacks, i.e. I try to concatenate three arrays of shape (358, 1379, 1042) into one array of shape (358, 1379, 1042, 3). Inspired by skimage.color.gray2rgb I tried
np.concatenate((
stack1[..., np.newaxis],
stack2[..., np.newaxis],
stack3[..., np.newaxis]), axis=-1)
However, even though each of these stacks is only about 1GiB this fills my empty ~12GiB RAM immediately ... So I tried to pre-allocate an array of the final shape and then fill it with the stacks, like
rgb_stack = np.zeros(stack1.shape + (3,))
rgb_stack[:,:,:,0] = stack1
which also exhausted my RAM once I execute the second line. Finally I tried to explicitly copy the data from stack1 into rgb_stack by
rgb_stack = np.zeros(stack1.shape + (3,))
rgb_stack[:,:,:,0] = stack1.copy()
with the same result. What am I doing wrong?
To wrap up what can be learnt from the comments to the question; np.zeros creates an array of float64 which is almost 12GiB big. This by itself does not fill the RAM as Linux over commits and only sets the corresponding RAM aside once the array gets filled, which is in this case once it gets filled with the image data.
Thus creating zeros as another dtype solves the problem, e.g.
rgb_stack = np.zeros(stack1.shape + (3,), dtype=np.uint16)
rgb_stack[:,:,:,0] = stack1.copy()
works fine with uint16 stacks.

Categories