Kmeans Clustering with n-dimensional arrays in Sklearn python

Kmeans Clustering with n-dimensional arrays in Sklearn python - python

I have an array in Python composed by several different arrays with different dimensions, for instance:
KB=[[[1,2],[2,4],[2,4,5,3],[5,4,3,2,1]],[[1,2],[2,4],[2,4,5,3],
[5,4,3,2,1]],........]
Basically, each entry in that array has a fixed number of sub arrays which could be characterized by different dimensions (the first entry has 2-D, the third entry has 4-D and so on).
Now, using sklearn in python with kmeans I obtained an error like this:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
This due to the different dimensions about the entries within the main array.
How can I compute clusters for a given input containing sub-arrays with different dimensions?

K-means requires computing means.
What would be the mean vector of
[1,2]
[3,4,5,6]
In order to compute centroids, k-means requires a d-dimensional vector field.

Related

Flatten only part of a dataframe shape for Euclidean calculation?

I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don't find the distance of row 1 and 5....and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i'm not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.

The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)

Applying function on multiple dimensions of higher dimensional array

Suppose you have a higher dimensional array (3 or greater) which is composed of a series of 2d images. If this array is called x, then a 2d image will be represented as x[0,0,:,:]. Now what I want to do is apply a function that takes in a 2d image and outputs a scalar, on this higher dimensional array so that I would convert the dimension of the original array to one that is 2 dimensions lower. How would I do such a thing?
In other words, what is the faster numpy way of doing this: np.array([[f(x[i,j,:,:]) for i in range(x.shape[0])] for j in range(x.shape[1])]) for a list of axes and some function f that takes in an array.
I've looked at numpy.apply_along_axis but that only acts on a 1d array and the shape must be identical. numpy.apply_on_axes also doesn't work since it doesn't reduce the amount of dimensions which are given to the function (it gives my function a 4d array, not a 2d array which I can work with). numpy.vectorize doesn't work because it doesn't ever apply on more than one element at once.

Error getting more than two eigenvalues in PCA

I am trying to perform PCA from scratch on a subset of MNIST data(digits 0 and 1) using Python.
(NOTE: x_train_0_scaled has dimensions : 5923x784 where 5923 is the number of images and 784 is the 28*28 flattened pixel values)
Here's my code to find eigenvalues:
# matrix multiplication using numpy
covar_matrix = np.matmul(x_train_0_scaled.T, x_train_0_scaled)
print("The shape of variance matrix = ", covar_matrix.shape)
# the parameter 'eigvals' is defined (low value to heigh value)
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783)(index) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782, 783))
print("Shape of eigen vectors = ", vectors.shape)
However when I try to get more than two eigenvalues, I get this error:
values, vectors = eigh(covar_matrix, eigvals=(782, 783, 781))
File "/usr/local/lib/python3.8/site-packages/scipy/linalg/decomp.py", line 484, in eigh
lo, hi = [int(x) for x in subset_by_index]
ValueError: too many values to unpack (expected 2)
The reason I want more than two eigenvectors is because as per the image below, I guess my data is not clearly seperable so I want to find more dimensions to plot my data on. Is my intuition correct?

The issue is resolved. eigvals takes arguments as (lo, hi). So instead of specifying (781,782, 783) I needed to specify lo=781 and hi=783 to get the top 3 eigenvalues.

Normalize function in Sklearn requires 2D array

In linear algebra, vectors are normalized when they are divided by their norm, that is, the squared sum of their components.
Yet, sklearn.preprocessing.normalize method does not accept vectors, only matrices of at least two columns:
"ValueError: Expected 2D array, got 1D array instead"
Why?

normalize works on a data set, not a vector. You have the wrong definition of "normalize" for this function. It works on individual vectors. If you give it a 2D array of a single column (shape of [N, 1]), you can get your vector normalized in the "normal" fashion.

According to the documentation for sklearn.preprocessing.normalize, the parameter x is the data to normalize, element by element, and has the shape [n_samples, n_features]. The function normalize perform this operation on a single array-like dataset, either using the L1 or L2 norms.

What does (n,) mean in the context of numpy and vectors?

I've tried searching StackOverflow, googling, and even using symbolhound to do character searches, but was unable to find an answer. Specifically, I'm confused about Ch. 1 of Nielsen's Neural Networks and Deep Learning, where he says "It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector."
At first I thought (n,) referred to the orientation of the array - so it might refer to a one-column vector as opposed to a vector with only one row. But then I don't see why we need (n,) and (n, 1) both - they seem to say the same thing. I know I'm misunderstanding something but am unsure.
For reference a refers to a vector of activations that will be input to a given layer of a neural network, before being transformed by the weights and biases to produce the output vector of activations for the next layer.
EDIT: This question equivocates between a "one-column vector" (there's no such thing) and a "one-column matrix" (does actually exist). Same for "one-row vector" and "one-row matrix".
A vector is only a list of numbers, or (equivalently) a list of scalar transformations on the basis vectors of a vector space. A vector might look like a matrix when we write it out, if it only has one row (or one column). Confusingly, we will sometimes refer to a "vector of activations" but actually mean "a single-row matrix of activation values transposed so that it is a single-column."
Be aware that in neither case are we discussing a one-dimensional vector, which would be a vector defined by only one number (unless, trivially, n==1, in which case the concept of a "column" or "row" distinction would be meaningless).

In numpy an array can have a number of different dimensions, 0, 1, 2 etc.
The typical 2d array has dimension (n,m) (this is a Python tuple). We tend to describe this as having n rows, m columns. So a (n,1) array has just 1 column, and a (1,m) has 1 row.
But because an array may have just 1 dimension, it is possible to have a shape (n,) (Python notation for a 1 element tuple: see here for more).
For many purposes (n,), (1,n), (n,1) arrays are equivalent (also (1,n,1,1) (4d)). They all have n terms, and can be reshaped to each other.
But sometimes that extra 1 dimension matters. A (1,m) array can multiply a (n,1) array to produce a (n,m) array. A (n,1) array can be indexed like a (n,m), with 2 indices, x[:,0] where as a (n,) only accepts x[0].
MATLAB matrices are always 2d (or higher). So people transfering ideas from MATLAB tend to expect 2 dimensions. There is a np.matrix subclass that supposed to imitate that.
For numpy programmers the distinctions between vector, row vector, column vector, matrix are loose and relatively unimportant. Or the use is derived from the application rather than from numpy itself. I think that's what's happening with this network book - the notation and expectations come from outside of numpy.
See as well this answer for how to interpret the shapes with respect to the data stored in ndarrays. It also provides insight on how to use .reshape: https://stackoverflow.com/a/22074424/3277902

(n,) is a tuple of length 1, whose only element is n. (The syntax isn't (n) because that's just n instead of making a tuple.)
If an array has shape (n,), that means it's a 1-dimensional array with a length of n along its only dimension. It's not a row vector or a column vector; it doesn't have rows or columns. It's just a vector.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.