Dimension reduction by identifying important Variables - python

I have a np.array with 400 entries, each containing the values of a spectrum with 1000 points.
I want to identify the n most interesting indices of the spectrum and return them. So I can visualize and use them as an input vector for my classificator.
Is it best to calculate the variance, apply a PCA or are there better-suited algorithms?
And how do I compute the accounted variance for that selection?
Thanks

Dimensionality reduction can be performed in two different broad ways: feature extraction and feature selection. PCA is more suited for feature extraction, while "identify the n most interesting indices" is a feature selection problem. More details and how to code it up here: sklearn.feature_selection

Related

Standardize same-scale variables?

thinking about a problem… should you standardize two predictors that are already on the same scale (say kilograms) but may have different ranges? The model is a KNN
I think you should because the model will give the predictor eith the higher range more importance in calculating distance
It is better to standardize the data even though being on same scale. Standardizing would reduce the distance (specifically euclidean) that would help weights to not vary much from the point intial to them. Having huge seperated distance would rather have more calculation involved. Also distance calculation done in KNN requires feature values to scaling is always prefered.

Pre Processing spiral dataset to use for Logistic Regression

So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.
Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two
Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.

KNN when using a precomputed affinity matrix in Scikit's spectral clustering?

I have a similarity matrix that I have calculated between a large number of objects, and each object can have a non-zero similarity with any other object. I generated this matrix for another task, and would now like to cluster it for a new analysis.
It seems like scikit's spectral clustering method could be a good fit, because I can pass in a precomputed affinity matrix. I also know that spectral clustering typically uses some number of nearest neighbors when building the affinity matrix, and my similarity matrix does not have that same constraint.
If I pass in a matrix that allows any number of edges between nodes in the affinity matrix, will scikit limit each node to having only a certain number of nearest neighbors? If not, I guess I will have to make that change to my pre-computed affinity matrix.
You don't have to compute the affinity yourself to do some spectral clustering, sklearn does that for you.
When you call sc = SpectralClustering(),, the affinity parameter allows you to chose the kernel used to compute the affinity matrix. rbf seems to be the kernel by default and doesn't use a particular number of nearest neighbours. However, if you decide to chose another kernel, you might want to specify that number with the n_neighboursparameter.
You can then use sc.fit_predict(your_matrix) to compute the clusters.
Spectral clustering does not require a sparsified matrix.
But if I'm not mistaken it's faster to find the dmallest non-zero Eigenvectors of a sparse matrix rather than of a dense matrix. Worst case may remain O(n^3) though - spectral clustering is one of the slowest methods you can find.

Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

I have an array of TF-IDF feature vectors. I'd like to find similar vectors in the array using two methods:
Cosine similarity
k-means clustering
Using Scikit Learn, this process is pretty simple.
Now I'd like to weight certain features so that they will influence the results more than the other features. For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features.
How can I meaningfully weight certain features in my feature vectors? Is the process for weighting certain features the same for each of the similarity algorithms I listed above?
As I understand, low values in the TFIDF matrix mean that the words are less significant. So one approach is to lower the values in the matrix for those columns you considered.
The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/.

k-fold Cross Validation for determining k in k-means?

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.
To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.
Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.
Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

Categories