Normalize function in Sklearn requires 2D array - python

In linear algebra, vectors are normalized when they are divided by their norm, that is, the squared sum of their components.
Yet, sklearn.preprocessing.normalize method does not accept vectors, only matrices of at least two columns:
"ValueError: Expected 2D array, got 1D array instead"
Why?

normalize works on a data set, not a vector. You have the wrong definition of "normalize" for this function. It works on individual vectors. If you give it a 2D array of a single column (shape of [N, 1]), you can get your vector normalized in the "normal" fashion.

According to the documentation for sklearn.preprocessing.normalize, the parameter x is the data to normalize, element by element, and has the shape [n_samples, n_features]. The function normalize perform this operation on a single array-like dataset, either using the L1 or L2 norms.

Related

MATLAB/Python array difference?

I got the following:
redIdx: a 2x1 matrix with values (289485, 289486).
image: 366x791x3 uint8 matrix (an image).
image2: zeros matrix with the same sape as image.
In MATLAB, if I do image2(redIdx) it returns a 2x1 matrix with values (0,0) and if I do image(redIdx) it returns a 2x1 matrix with values (94, 83).
But in Python, if I do image2[redIdx] or image[redIdx], it returns the next error: index 2879485 is out of bounds for axis 0 with size 366.
How can I get the same result as MATLAB?
MATLAB, when indexing an array with a single index (as opposed to multiple ones) uses linear indexing. Python, in the same situation, uses the index to index into the first dimension, returning a slice. The fact that redIdx contains multiple values is irrelevant, it's a 1D indexing operation.
To replicate linear indexing in Python, you can flatten the array, then index:
image.flatten('K')[redIdx]
This Q&A shows how to compute indices from the single linear index, which would be a more complex alternative to the above.

Kmeans Clustering with n-dimensional arrays in Sklearn python

I have an array in Python composed by several different arrays with different dimensions, for instance:
KB=[[[1,2],[2,4],[2,4,5,3],[5,4,3,2,1]],[[1,2],[2,4],[2,4,5,3],
[5,4,3,2,1]],........]
Basically, each entry in that array has a fixed number of sub arrays which could be characterized by different dimensions (the first entry has 2-D, the third entry has 4-D and so on).
Now, using sklearn in python with kmeans I obtained an error like this:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
This due to the different dimensions about the entries within the main array.
How can I compute clusters for a given input containing sub-arrays with different dimensions?
K-means requires computing means.
What would be the mean vector of
[1,2]
[3,4,5,6]
In order to compute centroids, k-means requires a d-dimensional vector field.

One-hot representation of a matrix in numpy

What is the easiest/smartest way of going from a matrix of values to one hot representation of the same thing in 3d tensor? For example if the matrix is the index after argmax in a tensor like:
indices=numpy.argmax(mytensor,axis=2)
Where tensor is 3D [x,y,z] and indices will naturally be [x,y]. Now you want to go to a 3D [x,y,z] tensor that has 1s in the place of maxes in axis=2 and 0 in any other place.
P.S. I know the answer for vector to 1-hot matrix, but this is matrix to 1-hot tensor.
One of the perfect setups to use broadcasting -
indices[...,None] == np.arange(mytensor.shape[-1])
If you need in ints of 0s and 1s, append with .astype(int)

What does (n,) mean in the context of numpy and vectors?

I've tried searching StackOverflow, googling, and even using symbolhound to do character searches, but was unable to find an answer. Specifically, I'm confused about Ch. 1 of Nielsen's Neural Networks and Deep Learning, where he says "It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector."
At first I thought (n,) referred to the orientation of the array - so it might refer to a one-column vector as opposed to a vector with only one row. But then I don't see why we need (n,) and (n, 1) both - they seem to say the same thing. I know I'm misunderstanding something but am unsure.
For reference a refers to a vector of activations that will be input to a given layer of a neural network, before being transformed by the weights and biases to produce the output vector of activations for the next layer.
EDIT: This question equivocates between a "one-column vector" (there's no such thing) and a "one-column matrix" (does actually exist). Same for "one-row vector" and "one-row matrix".
A vector is only a list of numbers, or (equivalently) a list of scalar transformations on the basis vectors of a vector space. A vector might look like a matrix when we write it out, if it only has one row (or one column). Confusingly, we will sometimes refer to a "vector of activations" but actually mean "a single-row matrix of activation values transposed so that it is a single-column."
Be aware that in neither case are we discussing a one-dimensional vector, which would be a vector defined by only one number (unless, trivially, n==1, in which case the concept of a "column" or "row" distinction would be meaningless).
In numpy an array can have a number of different dimensions, 0, 1, 2 etc.
The typical 2d array has dimension (n,m) (this is a Python tuple). We tend to describe this as having n rows, m columns. So a (n,1) array has just 1 column, and a (1,m) has 1 row.
But because an array may have just 1 dimension, it is possible to have a shape (n,) (Python notation for a 1 element tuple: see here for more).
For many purposes (n,), (1,n), (n,1) arrays are equivalent (also (1,n,1,1) (4d)). They all have n terms, and can be reshaped to each other.
But sometimes that extra 1 dimension matters. A (1,m) array can multiply a (n,1) array to produce a (n,m) array. A (n,1) array can be indexed like a (n,m), with 2 indices, x[:,0] where as a (n,) only accepts x[0].
MATLAB matrices are always 2d (or higher). So people transfering ideas from MATLAB tend to expect 2 dimensions. There is a np.matrix subclass that supposed to imitate that.
For numpy programmers the distinctions between vector, row vector, column vector, matrix are loose and relatively unimportant. Or the use is derived from the application rather than from numpy itself. I think that's what's happening with this network book - the notation and expectations come from outside of numpy.
See as well this answer for how to interpret the shapes with respect to the data stored in ndarrays. It also provides insight on how to use .reshape: https://stackoverflow.com/a/22074424/3277902
(n,) is a tuple of length 1, whose only element is n. (The syntax isn't (n) because that's just n instead of making a tuple.)
If an array has shape (n,), that means it's a 1-dimensional array with a length of n along its only dimension. It's not a row vector or a column vector; it doesn't have rows or columns. It's just a vector.

Decomposing 3rd Order Tensor in Python

I have a tensor in the shape (n_samples, n_steps, n_features). I want to decompose this into a tensor of shape (n_samples, n_components).
I need a method of decomposition that has a .fit(...) so that I can apply the same decomposition to a new batch of samples. I have been looking at Tucker Decomposition and PARAFAC Decomposition, but neither have that crucial .fit(...) and .transform(...) functionality. (Or at least I think they don't?)
I could use PCA and train it on a representative sample and then call .transform(...) on the remaining samples, but I would rather have some sort of tensor decomposition that can handle all of the samples at once, so as to get a better idea of the differences between each sample.
This is what I mean by "tensor":
In fact tensors are merely a generalisation of scalars and vectors; a scalar is a zero rank tensor, and a vector is a first rank tensor. The rank (or order) of a tensor is defined by the number of directions (and hence the dimensionality of the array) required to describe it.
If you have any questions, please ask, I'll try to clarify my problem if needed.
EDIT: The best solution would be some type of kernel but I have yet to find a kernel that can deal with n-rank Tensors and not just 2D data
You can do this using the development (master) version of TensorLy. Specifically, you can use the new partial_tucker function (it is not yet updated in the documentation...).
Note that the following solution preserves the structure of the tensor, i.e. a tensor of shape (n_samples, n_steps, n_features) is decomposed into a (smaller) tensor of shape (n_samples, n_components_1, n_components_2).
Code
Short answer: this is a very basic class that does what you want (and it would work on tensors of arbitrary order).
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
class TensorPCA:
def __init__(self, ranks, modes):
self.ranks = ranks
self.modes = modes
def fit(self, tensor):
self.core, self.factors = partial_tucker(tensor, modes=self.modes, ranks=self.ranks)
return self
def transform(self, tensor):
return tl.tenalg.multi_mode_dot(tensor, self.factors, modes=self.modes, transpose=True)
Usage
Given an input tensor, you can use the previous class by first instantiating it with the desired ranks (size of the core tensor) and modes on which to perform the decomposition (in your 3D case, 1 and 2 since indexing starts at zero):
tpca = TensorPCA(ranks=[4, 5], modes=[1, 2])
tpca.fit(tensor)
Given a new tensor originally called new_tensor, you can project it using the transform method:
tpca.transform(new_tensor)
Explanation
Let's go through the code with an example: first let's import the necessary bits:
import numpy as np
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
We then generate a random tensor:
tensor = np.random.random((10, 11, 12))
The next step is to decompose it along its second and third dimensions, or modes (as the first dimension corresponds to the samples):
core, factors = partial_tucker(tensor, modes=[1, 2], ranks=[4, 5])
The core corresponds to the transformed input tensor while factors is a list of two projection matrices, one for the second mode and one for the third mode. Given a new tensor, you can project it to the same subspace (the transform method) by projecting each of its last two dimensions:
tl.tenalg.multi_mode_dot(tensor, factors, modes=[1, 2], transpose=True)
The transposition here is equivalent to an inverse since the factors are orthogonal.
Finally, a note on the terminology: in general, even though it is sometimes done, it is probably best to not use interchangeably order and rank of a tensor. The order of a tensor is simply its number of dimensions while the rank of a tensor is usually a much more complicated notion which you could think of as a generalization of the notion of matrix rank.

Categories