Linear Discriminant Analysis with scikit learn in Python - python

I am getting into machine learning and recently I have studied classification of linear separable data using linear Discriminant Analysis. To do so I have used the scikit-learn package and the function
.discriminant_analysis.LinearDiscriminantAnalysis
On data from MNIST database of handwritten digits. I have used the database to fit the model and do predictions on test data by doing like this:
LDA(n_components=2)
LDA_fit(data,labels)
LDA_predict(testdata)
Which works just fine. I get a nice accuracy rate of 95%. However the predict function uses data from all 784 dimensions (corresponding to images of 28x28 pixels). I don’t understand why all dimensions are used for the prediction?
I though the purpose of the linear Discriminant analysis is to find a projection on the low dimension space that allows maximizes class separation allowing, such that ideally data is linear separable and classification is easy.
What’s the point of LDA and determining the projection matrix if all 784 dimensions are used for prediction anyway?

From documentation:
discriminant_analysis.LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is, in general, a rather strong dimensionality reduction, and only makes sense in a multiclass setting.
This is implemented in discriminant_analysis.LinearDiscriminantAnalysis.transform. The desired dimensionality can be set using the n_components constructor parameter. This parameter has no influence on discriminant_analysis.LinearDiscriminantAnalysis.fit or discriminant_analysis.LinearDiscriminantAnalysis.predict.
Meaning n_components is used only for transform or fit_transform. You can use dimensionality reduction for removing noise from your data or for visualization.

The low dimension which you had mentioned is actually n_classes in terms of classification.
If you use this for dimension reduction technique you can chose n_components dimensions, if you had specified it (it must be < n_classes). This has no impact on prediction as mentioned in documentation.
Hence, once you give input data, it will transform the data into n_classes dimensional space, then use this space for training/prediction. Reference - _decision_function() is used for prediction.
You can use Transform(X) to view the new lower dimensional space learned by the model.

Applying LDA on mnist data with reduced dimensions:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(data_1000, labels_1000).transform(data_1000)
# LDA before tsne
plt.figure()
colors = ['brown','black','deepskyblue','red','yellow','darkslategrey','navy','darkorange','deeppink', 'lawngreen']
target_names = ['0','1','2','3','4','5','6','7','8','9']
lw = 2
y = labels_1000
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2,3,4,5,6,7,8,9], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of MNIST dataset before TSNE')
plt.show()

Related

Dividing the training/validation sets by spatial parcellation using KMeans: making sure each subset of training/validation includes all labels

I am hoping to get some help here. I want to use Kmeans clustering to divide the training set for the supervised learning in the next stage, which is called spatial parcellation in the literature.
I have extracted features (X_train) from 3D image data (within a dilated mask), which includes the x, y, and z (positions of voxels), and mask includes two labels (y_train={0 : background,1: obj1}). It means each voxel can be background or object with the mask
I used Kmeans clustering in scikit-learn, and clustered the training dataset's voxel position ([x,y,z]) into 80 different clusters.
Problem the problem is that once I have divided the training set based on 80 clusters,
from sklearn.cluster import KMeans
pos_ind=[0,1,2]
kmeans_model= KMeans(n_clusters=80, random_state=rng).fit(X_TRAIN[:,pos_ind])
later I load the kmeans_model and assign the validation set to the clusters:
Dval_clusters = kmeans_model.predict(X_DVAL[:, pos_ind])
and then I find the row indices of validation set that it falls into cidx th cluster
cluster_idx = np.unique(kmeans_model.labels_)
Dval_rows = np.where(Dval_clusters==cluster_idx[cidx])[0] # find the rows of X_dval that belongs to cidx th cluster
X_dval = X_Dval[Dval_rows]
y_dval = y_Dval[Dval_rows]
Once I am fitting the validation set on the trained model, the trained model may have both labels 0 and 1, however, it is showing an error during the fitting the validation set, which means during assigning the validation samples to a cluster, it has only picked up voxels either with background or obj voxels (i.e., all samples in one validation set of a cluster has only 1 class label).
ValueError: could not broadcast input array from shape (30527,1) into shape (30527,2)
Question I know that clustering does not use labels, but is it possible to enforce the clustering that during the clustering, both labels are sampled? Or is there any trick for doing so? because background is just a class label and we need to to separate the object class (class:1) from the background (class:0)
I would really appreciate if you leave your expert opinion here.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

100% Accuracy using SVC classification, something must be wrong?

Context to what I'm trying to achieve:
I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).
import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC
def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature
trnArray = np.zeros([10000,324])
tstArray = np.zeros([1000,324])
for i in range (0, 10000 ):
trnFeatures = computeFeatures(trnImages[:,:,:,i])
trnArray[i,:] = trnFeatures
for i in range (0, 1000):
tstFeatures = computeFeatures(tstImages[:,:,:,i])
tstArray[i,:] = tstFeatures
pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels
train_data = trnModel
train_labels = trnLabels
C = 1
model = SVC(kernel='linear', C=C)
model.fit(train_data, train_labels.ravel())
y_pred = model.predict(test_data)
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))
Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?
First of all,
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
this is wrong. You have to use:
tstModel = pca.transform(tstArray)
Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.
Just for interest, check the balance of classes.
Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.
The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.
I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?
The PCA is standard for images, but should be followed by something very simple such as K-means.
The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.
Otherwise the calculation looks fine.
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).
I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question.
BTW could some explain, "shape[0]"please? Is it needed?
One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.
This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.
Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?

Practical determination of anomaly threshold in (variational) autoencoders

Although not strictly a programming question, I haven't found anything about this topic on this site. I currently dealing with (variational) autoencoders ((V)AE), and plan to deploy them to detect anomalies. For testing purposes, I've implemented an VAE in tensorflow for detecting handwritten digits.
The training went well and the reconstructed images are very similar to the originals. But for actually using the autoencoder, I have to use some kind of measure to determine if a new image fed to the autoencoder is a digit or not by comparing it to a threshold value.
At this point, I have two major questions:
1.) For training, I used a loss consisting of two components. First one is the reconstruction error, which is a crossentropy function:
# x: actual input
# x_hat: reconstructed input
epsilon = 1e-10 # <-- small number for numeric stability within log
recons_loss = - f.reduce_sum( x * tf.log( epsilon + x_hat) + (1 - x) * tf.log( epsilon + 1 - x_hat),
axis=1)
The second one is KL-divergence, which is a measure of how similar two probability distributions are, as we are demanding that the latent variable space is a distribution similar to a Gaussian.
# z_mean: vector representing the means of the latent distribution
# z_log_var: vector representing the variances of the latent distribution
KL_div = -0.5 * tf.reduce_sum( 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
axis=1)
For determining the reconstruction error of a new image, do I have to use both parts of the training loss? Intuitively, I would say no and just go with the recon_loss.
2.) How do I determine the threshold value? Is there already a tf functionality implemented that I can use?
If you have some good source for anything related, please share the link!
Thanks!
I had a similar problem recently. VAEs are very well in projecting a high dimensional data into a lower dimensional latent space. Altering the latent vector and feeding it to the decoder part creates new samples.
I hope I get your question right, you try to do an anomaly detection with the encoder part on the lower dimensional latent space?
I guess you have trained your VAE on MNIST. What you can do is getting all latent vectors of the MNIST-digits and compare the latent vector of your new digit via euclidian distance to them. The threshold would be a max distance set by you.
The code would be something like this:
x_mnist_encoded = encoder.predict(x_mnist, batch_size=batch_size) #array of MNIST latent vectors
test_digit_encoded = encoder.predict(x_testdigit, batch_size=1) #your testdigit latent vector
#calc the distance
from scipy.spatial import distance
threshold = 0.3 #min eucledian distance
for vector in x_mnist_encoded:
dst = distance.euclidean(vector,test_digit_encoded[0])
if dst <= threshold:
return True
VAE code is from https://blog.keras.io/building-autoencoders-in-keras.html

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?
Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.
Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.

Categories