error when reshaping array with prime dimension

error when reshaping array with prime dimension - python

I want to reshape an array with 1501 (waveform) to (3,500) , but it complains as follow. please, help me to solve this problem.
x = np.array(x_train[2])
print(x.shape)
y = np.reshape(x, (int(len(x) / 500), 500))
print(y.shape)
Here is the output:
(1501,)
ValueError: cannot reshape array of size 1501 into shape (3,500)

If you want to view your data in chunks of 500, but it does not have a multiple of 500 elements, you need to truncate or pad it first. Let's say you are going with truncation here, since you have a wave-form and that last element is probably a repeat that will throw off your FFT anyway.
In that case, you can view less of the data in the buffer, then reshape however you want:
y = x[:(x.size // 500) * 500].reshape(-1, 500)
The nice thing here is that if your data is arranged somewhat sanely in memory, it will not make a copy, but return a contiguous view into the original buffer.

Related

Slice based on numpy.argmin results

Let us have a numpy array (float) with shape equal to (36, 2, 400, 400). Let us say the 400 by 400 represents an image. Then for each pixel I would like to find the two values (second dimension) which are when taking the norm over the second dimension, the lowest with respect to the first dimension. So I end up with an array of shape (2, 400, 400).
With np.argmin(np.linalg.norm(array, axis=1), axis=0) I am able to get the index for each of those 2 by 400 by 400 pixels which is almost what I want. But now I want to use this number to slice the original array in the first dimension so I am left with an array of shape (2, 400, 400).
What I can do is loop over all indices and construct the result pixel by pixel, but I am convinced there is a smarter way. Can anyone help me with a smarter way?
A minimal reproducible example as requested where distances is the array:
shape = (400, 400)
centers = np.random.randint(400, size=(36, 2))
distances = np.array([np.indices(shape) - np.array(center)[:, None, None] for center in centers])
nearest_center_index = np.argmin(np.linalg.norm(distances, axis=1), axis=0)
print(distances.shape)
print(nearest_center_index.shape)
plt.imshow(nearest_center_index)
out:
(36, 2, 400, 400)
(400, 400)
I was able, with help from the comments, to produce a somewhat ugly answer, which helped me futher to understand the problem. Let me elaborate. What is possible to do is to flatten the image and argmin results and then use advanced indexing with argmin and indices over the image to produce the results.
flatten_indices = nearest_center_index.reshape(400**2)
image_indices = range(400**2)
results = distances.reshape(36, 2, 400**2)[flatten_indices, :, image_indices].reshape(400, 400, 2).swapaxes(0, 2)
However, I think it happens a lot that you have indices that are shaped as a subset of the dimensions and have values containing indices of another dimension. I would expect a generic method to slice this.
Thus let us have and array with n dimensions with shape = (x1, x2, ..., xn) and let us say we have a array representing indices for a dimension, e.g., xi, that has shape which is a subset of the shape of the original array and not containing xi. Then I would expect a method to slice this array.

The function I was looking for is numpy.take_along_axis().
For the specific example the only thing needed to be done is making sure the nearest_center_index (output of argmin) has equal amount of dimensions as the to be sliced array. In the example this can be achieved by passing keepdims=True to both norm and argmin which then can be directly used as the second argument of the numpy function. The third argument should be the xi axis (in the example axis 0).
Without passing the keepdims=True, following the exact example, the stated objective can be achieved by:
result = np.take_along_axis(distances, nearest_center_index[None,None,:,:], 0)[0]

Why does this euclidean distance calculation method explodes RAM usage?

I'm studying the KNN algorithm to classify images using some material from a 2017 Stanford course. We're given a dataset consisting of many images, later those sets are represented as 2D numpy arrays, and we're supposed to write functions that calculate distances between those images. More specifically, given a 2D array of the test images and a 2D array of the training images, I'm asked to write a L_2 distance function, which takes those two sets as inputs and returns a distance matrix, where every row i represents a test image and every column j represents a training image.
The exercise also asked me to do it without any loops and without using np.abs function. So I gave it a try and tried:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
dists = np.sqrt(np.sum(all_test_subs_sq), axis = 2)
return dists
Apparently that makes Google's Colab environment crash in 6 seconds due to allocating about 60 GB of RAM. I guess I should clarify the training set X_train has a shape of (5000, 3072), and the test set X has shape (500, 3072). I am not sure what happens here that is so RAM intensive, but then again I'm not the smartest guy to figure out space complexity.
I googled a bit and found out a solution that works without the need for a NASA computer, it uses the sum of the squares formula:
dists = np.reshape(np.sum(X**2, axis=1), [num_test,1]) + np.sum(self.X_train**2, axis=1)\
- 2 * np.matmul(X, self.X_train.T)
dists = np.sqrt(dists)
I'm also not really sure why doesn't this solution explode like mine did. I'd really appreciate any insight here, thank you very much for reading.

In the compute_distances_no_loops() function the intermediate array all_test_subs_sq has the shape (500, 3072, 5000), so it consists of 500 * 3072 * 5000 = 7,680,000,000 elements. Assuming that the dtype of X and X_train is float64, each element weights 8 bytes, so the total size of the array is 61,440,000,000 bytes i.e. about 60 GB.
The other solution you included avoids this problem since it does not create such a large intermediate array. The shape of np.reshape(np.sum(X**2, axis=1), [num_test,1]) is (500, 1) and the shape of np.sum(self.X_train**2, axis=1) is (5000,). When you add them you obtain an array of the shape (500, 5000). np.matmul(X, self.X_train.T) also produces an array of the same shape.

The problem is in
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
X[:, np.newaxis] is equivalent to X[:, np.newaxis, :] of shape (50, 1, 3072). After broadcasting, X[:, np.newaxis] - self.X_train yields a dense (500, 5000, 3072) array which is humongous 500 x 5000 x 3072 x 8 bytes ≈ 61.44 GB since you have np.float64.

How to stack matrices in a single column table

I am trying to store 20 automatically generated Matrices in a single column Matrix, so this last Matrix would be a 1x20 Matrix.
For this I am using numpy and vstack, but it doesn't work, it Keep on getting the following error:
ValueError: all the input arrays must have same number of dimensions
Even though all the Matrices tham I'm trying to stack together have the same dimensions (881 x 882)
So I'd like to know what is wrong About this or if there is any other way to stack all the Matrices in a way that if one of them is needed I can access easily to that one.

Try to change dimensions with expand and squeeze functions:
y = np.expand_dims(x, axis=0) # dim 20 become 1x20
y = np.squeeze(x, axis=0) # dim 1x20 become 20

numpy dtype ValueError: invalid shape in fixed-type tuple - how can I get around it?

I use a custom datatype, e.g. datatype = np.dtype('({:n},{:n})f4'.format(10000,100000)) to read data from a binary file using
np.fromfile(filename, dtype=datatype)
However, defining the datatype using np.dtype gives an error for large datasets, as in the example datatype above:
ValueError: invalid shape in fixed-type tuple: dtype size in bytes must fit into a C int
Initializing an array of that size is no problem: a=np.zeros((10000,100000)).
So my question is: Where does that limitation come from and how can I get around it? I can of course use a loop and read chunks at a time, but maybe there is a more elegant way?

When you specify a dtype of '(M, N)f4' you are effectively specifying the final two dimensions of the output array, e.g.
np.zeros(5, np.dtype('(6, 7)f4')).shape
# (5, 6, 7)
You could achieve the same outcome by simply reading in the data as a 1D array, then reshaping it to your desired shape:
x = np.fromfile(filename, np.float32).reshape(-1, 10000, 100000)

h5py returning unexpected results in indexing

I'm attempting to fill an h5py dataset with a series of numpy arrays that I generate in sequence so my memory can handle it.
The h5py array is initialised so that the first dimension can have any magnitude,
f.create_dataset('x-data', (1, maxlen, 50), maxshape=(None, maxlen, 50))
After generating each numpy array X, I am using
f['x-data'][alen:alen + len(data),:,:] = X
Where for example, in the first array, alen=0 and len(data)=10056. I then increment alen so the next array will start from where the last one ended.
print f['x-data'][alen:alen + len(data),:,:].shape, alen, len(data)
(1L, 60L, 50L) 0 10056
Does anyone know why the 0:10056 indexing is being interpreted as 1L?

I replicated your example, but on a much smaller scale. I had to do a resize each time I added elements, e.g.
f['xdata'].resize(50,axis=0)
The first time I tried to add a block I got an error:
TypeError: Can't broadcast (10, 20, 10) -> (1, 20, 10)
But subsequent times, when I'd outgrown the allocated space, it failed silently. No error, it just didn't end up storing the new values.
This is for version 2.2.1

I found the answer from a helpful person on the user group.
The maxshape(None) feature does not mean that the dataset automatically resizes - it must be resized each time new input is added. So the first dimension must be increased before adding new data:
x.resize((x.shape[0] + X.shape[0], X.shape[1], X.shape[2]))
y.resize((y.shape[0] + Y.shape[0], Y.shape[1], Y.shape[2]))
The dataset then adds the values correctly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

error when reshaping array with prime dimension - python

Related

Slice based on numpy.argmin results

Why does this euclidean distance calculation method explodes RAM usage?

How to stack matrices in a single column table

numpy dtype ValueError: invalid shape in fixed-type tuple - how can I get around it?

h5py returning unexpected results in indexing

Categories

Resources