I've seen another StackOverflow thread talking about the various implementations for calculating the Euclidian norm and I'm having trouble seeing why/how a particular implementation works.
The code is found in an implementation of the MMD metric: https://github.com/josipd/torch-two-sample/blob/master/torch_two_sample/statistics_diff.py
Here is some beginning boilerplate:
import torch
sample_1, sample_2 = torch.ones((10,2)), torch.zeros((10,2))
Then the next part is where we pick up from the code above.. I'm unsure why the samples are being concatenated together..
sample_12 = torch.cat((sample_1, sample_2), 0)
distances = pdist(sample_12, sample_12, norm=2)
and are then passed to the pdist function:
def pdist(sample_1, sample_2, norm=2, eps=1e-5):
r"""Compute the matrix of all squared pairwise distances.
Arguments
---------
sample_1 : torch.Tensor or Variable
The first sample, should be of shape ``(n_1, d)``.
sample_2 : torch.Tensor or Variable
The second sample, should be of shape ``(n_2, d)``.
norm : float
The l_p norm to be used.
Returns
-------
torch.Tensor or Variable
Matrix of shape (n_1, n_2). The [i, j]-th entry is equal to
``|| sample_1[i, :] - sample_2[j, :] ||_p``."""
here we get to the meat of the calculation
n_1, n_2 = sample_1.size(0), sample_2.size(0)
norm = float(norm)
if norm == 2.:
norms_1 = torch.sum(sample_1**2, dim=1, keepdim=True)
norms_2 = torch.sum(sample_2**2, dim=1, keepdim=True)
norms = (norms_1.expand(n_1, n_2) +
norms_2.transpose(0, 1).expand(n_1, n_2))
distances_squared = norms - 2 * sample_1.mm(sample_2.t())
return torch.sqrt(eps + torch.abs(distances_squared))
I am at a loss for why the euclidian norm would be calculated this way. Any insight would be greatly appreciated
Let's walk through this block of code step by step. The definition of Euclidean distance, i.e., L2 norm is
Let's consider the simplest case. We have two samples,
Sample a has two vectors [a00, a01] and [a10, a11]. Same for sample b. Let first calculate the norm
n1, n2 = a.size(0), b.size(0) # here both n1 and n2 have the value 2
norm1 = torch.sum(a**2, dim=1)
norm2 = torch.sum(b**2, dim=1)
Now we get
Next, we have norms_1.expand(n_1, n_2) and norms_2.transpose(0, 1).expand(n_1, n_2)
Note that b is transposed. The sum of the two gives norm
sample_1.mm(sample_2.t()), that's the multiplication of the two matrix.
Therefore, after the operation
distances_squared = norms - 2 * sample_1.mm(sample_2.t())
you get
In the end, the last step is taking the square root of every element in the matrix.
Related
In an Linear Program I am minimizing the distance between weighted input vectors and a target vector. I used Scipyto compute values for the weights I need. Currently they are between zero and one, but I'd like them to be zero if they are smaller than .2 for example, so x_i should be 0 or [.2; 1]. I was pointed to mixed integer linear programming but I still can't find any approach for my problem. How can I fix this?
tldr: i want to use (0,0) or (.3,1) as bounds for each x, how do i implement this?
Here is my SciPy code:
# minimize the distance between weighted input vectors and a target vector
def milp_objective_function(weights):
scaled_matrix = input_matrix * weights[:, np.newaxis] # scale input_matrix columns by weights
sum_vector = sum(scaled_matrix) # sum weighted_input_matrix columns
difference_vector = sum_vector - target_vector
return np.sqrt(difference_vector.dot(difference_vector)) # return the distance between the sum_vector and the target_vector
# sum of weights should equal 100%
def milp_constraint(weights):
return sum(weights) - 1
def main():
# bounds should be 0 or [.2; 1] -> mixed integer linear programming?
weight_bounds = tuple([(0, 1) for i in input_matrix])
# random guess, will implement later
initial_guess = milp_guess_weights()
constraint_obj = {'type': 'eq', 'fun': milp_constraint}
result = minimize(milp_objective_function, x0=initial_guess, bounds=weight_bounds, constraints=constraint_obj)
Variables that are in {0} ∪ [L,U] are called semi-continuous variables. Advanced MIP solvers have built-in support for these types of variables.
Note that SciPy does not have a MIP solver at all.
I also want to note that if your MIP solver does not support semi-continuous variables you can simulate them with binary variables:
L ⋅ δ(i) ≤ x(i) ≤ U ⋅ δ(i)
δ(i) ∈ {0,1}
I'm working on an assignment where I am tasked to implement PCA in Python for an online course. Unfortunately, when I try to run a comparison (provided by the course) between my implementation and SKLearn's, my results appear to differ too greatly.
After many hours of review, I am still unsure where it is going wrong. If someone could take a look and determine what step I have coded or interpreted incorrectly, I would greatly appreciate it.
def normalize(X):
"""
Normalize the given dataset X to have zero mean.
Args:
X: ndarray, dataset of shape (N,D)
Returns:
(Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
with mean 0; mean is the sample mean of the dataset.
Note:
You will encounter dimensions where the standard deviation is zero.
For those ones, the process of normalization results in normalized data with NaN entries.
We can handle this by setting the std = 1 for those dimensions when doing normalization.
"""
# YOUR CODE HERE
### Uncomment and modify the code below
mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise. Setting it to 1 will compute the mean across rows.
std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.
std_filled = std.copy()
std_filled[std == 0] = 1
# Compute the normalized data as Xbar
Xbar = (X - mu)/std_filled
return Xbar, mu, # std_filled
def eig(S):
"""
Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.
Args:
S: ndarray, covariance matrix
Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors
Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
# YOUR CODE HERE
# Uncomment and modify the code below
# Compute the eigenvalues and eigenvectors
# You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
eigvals, eigvecs = np.linalg.eig(S)
# The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
# We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
# of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
sort_indices = np.argsort(eigvals)[::-1]
# Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
return eigvals[sort_indices], eigvecs[:, sort_indices]
def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by the columns of `B`
Args:
B: ndarray of dimension (D, M), the basis for the subspace
Returns:
P: the projection matrix
"""
# YOUR CODE HERE
P = B # (np.linalg.inv(B.T # B)) # B.T
return P
def select_components(eig_vals, eig_vecs, num_components):
"""
Selects the n components desired for projecting the data upon.
Args:
eig_vals: The eigenvalues sorted in descending order of magnitude.
eig_vecs: The eigenvectors sorted in order relative to that of the eigenvalues.
num_components: the number of principal components to use.
Returns:
The number of desired components to keep for projection of the data upon.
"""
principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]
return principal_vals, principal_components
def PCA(X, num_components):
"""
Projects normalized data onto the 'n' desired principal components.
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
the reconstructed data, the sample mean of the X, principal values
and principal components
"""
# Normalize to have mean 0 and variance 1.
Z, mean_vec = normalize(X)
# Calculate the covariance matrix
S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables. Set bias = True to ensure normalization is done with N and not N-1
# Calculate the (unit) eigenvectors and eigenvalues of S. Sort them in descending order of importance relative to the magnitude of the eigenvalues.
eig_vals, eig_vecs = eig(S)
# Keep only the n largest Principle Components of the sorted unit eigenvectors.
principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
# Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.
#P = projection_matrix(eig_vecs[:, :num_components])
P = projection_matrix(principal_components)
# Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
X_reconst = (P # X.T).T
return X_reconst, mean_vec, principal_vals, principal_components
And here is the test case I'm supposed to pass:
random = np.random.RandomState(0)
X = random.randn(10, 5)
from sklearn.decomposition import PCA as SKPCA
for num_component in range(1, 4):
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver="full")
sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
reconst, _, _, _ = PCA(X, num_component)
# The difference in the result should be very small (<10^-20)
print(
"difference in reconstruction for num_components = {}: {}".format(
num_component, np.square(reconst - sklearn_reconst).sum()
)
)
np.testing.assert_allclose(reconst, sklearn_reconst)
As far as I can tell, there are a few things wrong with your code.
Your projection matrix is wrong.
If the eigenvectors of your covariance matrix is B with dimension D x M where M is the number of components you select and D is the dimension of the original data, then the projection matrix is just B # B.T.
In standard implementation of PCA, we typically do not scale the data by the inverse of the standard deviation. You seem to be trying to do an approximation of a whitened PCA (ZCA), but even then it looks wrong.
As a quick test, you can compute the normalized data without dividing by the standard deviation, and when you compute the covariance matrix, set bias=False.
You should also subtract the mean from the data before multiplying it by the projection operator, and adding it back after that, i.e.,
X_reconst = (P # (X - mean_vec).T).T + mean_vec.
PCA essentially is just a change of basis, followed by discarding coordinates corresponding to directions with low variance. The eigenvectors of the covariance matrix corresponds to the new orthogonal basis, and the eigenvalues tells you the variance of the data along the direction of the corresponding eigenvectors. P = B # B.T is just the change of basis followed to the new basis (and discarding some coordinates), B, followed by a change back to the original basis.
Edit
I'm curious to know which online course teaches people to implement PCA this way.
I have 60000 vectors of 784 dimensions. This data has 10 classes.
I must evaluate a function that takes out one dimension and computes the distance metric again. This function is computing the distance of each vector to it's classes' mean. In code:
def objectiveFunc(self, X, y, indices):
subX = np.array([X[:,i] for i in indices]).T
d = np.zeros((10,1))
for n in range(10):
C = subX[np.where(y == n)]
u = np.mean(C, axis = 0)
Sinv = pinv(covariance(C))
d[n] = np.mean(np.apply_along_axis(mahalanobis, axis = 1, arr=C, v=u, VI=Sinv))
where indices are fed in with one index removed during each iteration.
As you can imagine, I am computing a lot of individual components during the computation for Mahalanobis distance. Is there a way for me to store all the 784 component distances?
Alternatively, what's the fastest way to compute Mahalanobis distance?
First of all and to make it easier to understand, this is the Mahalanobis Distance formula:
So, to compute the mahalanobis distance for each element according to its class, we can do:
X_train=X_train.reshape(-1,784)
def mahalanobis(element,classe):
part=np.where(y_train==classe)[0]
ave=np.mean(X_train[part])
distance_example=np.sqrt(((np.mean(X_train[part[[element]]])-ave)**2)/np.var(X_train[part]))
return distance_example
mahalanobis(20,2)
# Out[91]: 0.13947337027828757
Then you can create a for statement to calculate all distances. For instance, class 0:
[mahalanobis(i,0) for i in range(0,len(X_train[np.where(y_train==0)[0]]))]
I am attempting to write a custom loss function in Keras from this paper. Namely, the loss I want to create is this:
This is a type of ranking loss for multi-class multi-label problems. Here are the details:
Y_i = set of positive labels for sample i
Y_i^bar = set of negative labels for sample i (complement of Y_i)
c_j^i = prediction on i^th sample at label j
In what follows, both y_true and y_pred are of dimension 18.
def multilabel_loss(y_true, y_pred):
""" Multi-label loss function.
More complete description here...
"""
zero = K.tf.constant(0, dtype=tf.float32)
where_one = K.tf.not_equal(y_true, zero)
where_zero = K.tf.equal(y_true, zero)
Y_p = K.tf.where(where_one)
Y_n = K.tf.where(where_zero)
n = K.tf.shape(y_true)[0]
loss = 0
for i in range(n):
# Here i is the ith sample; for a specific i, I find all locations
# where Y_p, Y_n belong to the ith sample; axis 0 denotes
# the sample index space
Y_p_i = K.tf.equal(Y_p[:,0], K.tf.constant(i, dtype=tf.int64))
Y_n_i = K.tf.equal(Y_n[:,0], K.tf.constant(i, dtype=tf.int64))
# Here I plug in those locations to get the values
Y_p_i = K.tf.where(Y_p_i)
Y_n_i = K.tf.where(Y_n_i)
# Here I get the indices of the values above
Y_p_ind = K.tf.gather(Y_p[:,1], Y_p_i)
Y_n_ind = K.tf.gather(Y_n[:,1], Y_n_i)
# Here I compute Y_i and its complement
yi = K.tf.shape(Y_p_ind)[0]
yi_not = K.tf.shape(Y_n_ind)[0]
# The value to normalize the inner summation
normalizer = K.tf.divide(1, K.tf.multiply(yi, yi_not))
# This creates a matrix of all combinations of indices k, l from the
# above equation; then it is reshaped
prod = K.tf.map_fn(lambda x: K.tf.map_fn(lambda y: K.tf.stack( [ x, y ] ), Y_n_ind ), Y_p_ind )
prod = K.tf.reshape(prod, [-1, 2, 1])
prod = K.tf.squeeze(prod)
# Next, the indices are fed into the corresponding prediction
# matrix, where the values are then exponentiated and summed
y_pred_gather = K.tf.gather(y_pred[i,:].T, prod)
s = K.tf.cast(K.sum(K.tf.exp(K.tf.subtract(y_pred_gather[:,0], y_pred_gather[:,1]))), tf.float64)
loss = loss + K.tf.multiply(normalizer, s)
return loss
My questions are the following:
When I go to compile my graph, I get an error revolving around n. Namely, TypeError: 'Tensor' object cannot be interpreted as an integer. I've looked around, but I can't find a way to stop this. My hunch is that I need to avoid a for loop altogether, which brings me to
How can I write this loss without for loops? I'm fairly new to Keras and have spent a solid few hours writing this custom loss myself. I'd love to write it more concisely. What's blocking me from using all matrices is the fact that Y_i and its complement can take on different sizes for each i.
Please let me know if you'd like me to elaborate more on my code. Happy to do so.
UPDATE 3
As per #Parag S. Chandakkar 's suggestions, I have the following:
def multi_label_loss(y_true, y_pred):
# set consistent casting
y_true = tf.cast(y_true, dtype=tf.float64)
y_pred = tf.cast(y_pred, dtype=tf.float64)
# this get all positive predictions and negative predictions
# it also exponentiates them in their respective Y_i classes
PT = K.tf.multiply(y_true, tf.exp(-y_pred))
PT_complement = K.tf.multiply((1-y_true), tf.exp(y_pred))
# this step gets the weight vector that we'll normalize by
m = K.shape(y_true)[0]
W = K.tf.multiply(K.sum(y_true, axis=1), K.sum(1-y_true, axis=1))
W_inv = 1./W
W_inv = K.reshape(W_inv, (m,1))
# this step computes the outer product of two tensors
def outer_product(inputs):
"""
inputs: list of two tensors (of equal dimensions,
for which you need to compute the outer product
"""
x, y = inputs
batchSize = K.shape(x)[0]
outerProduct = x[:,:, np.newaxis] * y[:,np.newaxis,:]
outerProduct = K.reshape(outerProduct, (batchSize, -1))
# returns a flattened batch-wise set of tensors
return outerProduct
# set up inputs to outer product
inputs = [PT, PT_complement]
# compute final loss
loss = K.sum(K.tf.multiply(W_inv, outer_product(inputs)))
return loss
This is not an answer but more like my thought process which should help you to write a concise code.
Firstly, I don't think you should worry about that error for now because by the time you eliminate for loops, your code may look very different.
Now, I haven't looked at the paper but the predictions c_j^i should be the raw values that come out of the last non-softmax layer (that is what I assume).
So you can add an additional exp layer and compute exp(c_j^i) for each prediction. Now, the for loop comes because of the summation. If you look closely, all it is doing is first forming pairs of all the labels and then subtracting their corresponding predictions. Now, first express the subtraction as exp(c_l^i) * exp(-c_k^i). To see what is happening, take a simple example.
import numpy as np
a = [1, 2, 3]
a = np.reshape(a, (3,1))
Following above explanation, you want the following result.
r1 = sum([1 * 2, 1 * 3, 2 * 3]) = sum([2, 3, 6]) = 11
You could get the same result by matrix multiplication, which is a way to elimiate for loops.
r2 = a * a.T
# r2 = array([[1, 2, 3],
# [2, 4, 6],
# [3, 6, 9]])
Extract the upper triangular part, i.e. 2, 3, 6 and sum the array to get 11, which is the result you want. Now, there may be some differences, for example, you may need to exhaustively form all the pairs. You should be able to convert it in the form of matrix multiplication.
Once you have taken care of the summation term, the normalization term can be easily computed if you pre-compute the quantities |Y_i| and \bar{Y_i} for each sample i. Pass them as input arrays and pass them into loss as a part of y_pred. The final summation over i will be done by Keras.
Edit 1: Even if |Y_i| and \bar{Y_i} take on different values, you should be able to build a generic formula for extracting the upper triangular part irrespective of the matrix size once you have pre-computed |Y_i| and \bar{Y_i}.
Edit 2: I don't think you understood me completely. In my opinion, NumPy shouldn't be used at all in the loss function. This is (mostly) doable using only Tensorflow. I will explain once more, while preserving my earlier explanation.
I now know that there is a cartesian product between the positive labels and negative labels (i.e. |Y_i| and \bar{Y_i}, respectively). So first, put a layer of exp after the raw predictions (in TF, not in Numpy).
Now, you need to know which indices out the 18 dimensions of y_true correspond to positive and which ones correspond to negative. If you are using one hot encoding, you can find this out on-the-fly by using tf.where and tf.gather (see here).
By now, you should know the indices j (in c_j^i) that correspond to positive and negative labels. All you need to do is compute \sum_(k, l) {exp(c_k^i) * (1 / exp(c_l^i))} for pairs (k, l). All you need to do is form one tensor consisting of exp(c_k^i) for all k (call it A) and another one consisting of exp(c_l^i) for all l (call it B). Then compute sum(A * B^T). No need to extract the upper triangular part too if you are taking cartesian product. By now, you should have the result of inner-most summation.
Contrary to what I said before, I think you could also compute the normalization factor on-the-fly from y_true.
You only have to figure out how to extend this to three dimensions to handle multiple samples.
Note: The usage of Numpy is probably possible by using tf.py_func but does not seem necessary here. Just use functions of TF.
I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]
scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y]
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)
Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum
binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)
# Average
dW /= num_train
# Regularize
dW += reg*W
return loss, dW
Let us recap the scenario and the loss function first, so we are on the same page:
Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.
The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.
There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.
Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as
sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])
or equivalently if we write #A[i] for the number of elements in A[i]
(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]
The partial derivatives with respect to the score are thus simply
| -#A[i] if j == Y[i]
dloss / dY0[i, j] = { 1 if j in A[i]
| 0 else
which is precisely what the first four lines you say you don't understand compute.
The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.
It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.
Personally, I found it much easier to understand the whole gradient calculation through looking at the analytic derivation of the loss function in more detail. To extend on the given answer, I would like to point to the derivatives of the loss function
with respect to the weights as follows:
Loss gradient wrt w_yi (correct class)
Hence, we count the cases where w_j is not meeting the margin requirement and sum those cases up. This negative sum is then specified as weight for the position of the correct class w_yi. (we later need to multiply this value with xi, this is what you do in your code in line 5)
2) Loss gradient wrt w_j (incorrect classes)
where 1 is the indicator function, 1 if true, else 0.
In other words, "programatically" we need to apply equation (2) to all cases where the margin requirement is not met, and adding the negative sum of all unmet requirements to the true class column (as in (1)).
So what you did in the first 3 lines of your code is to determine the cases where the margin is not met, as well as adding the negative sum of these cases to the correct class column (j). In the 5 line, you do the final step where you multiply the x_i's to the other term - and this completes the gradient calculations as in (1) and (2).
I hope this makes it easier to understand, let me know if anything remains unclear. source