Weighted distance in sklearn KNN - python

I'm making a genetic algorithm to find weights in order to apply them to the euclidean distance in the sklearn KNN, trying to improve the classification rate and removing some characteristics in the dataset (I made this with changing the weight to 0).
I'm using Python and the sklearn's KNN.
This is how I'm using it:
def w_dist(x, y, **kwargs):
return sum(kwargs["weights"]*((x-y)*(x-y)))
KNN = KNeighborsClassifier(n_neighbors=1,metric=w_dist,metric_params={"weights": w})
KNN.fit(X_train,Y_train)
neighbors=KNN.kneighbors(n_neighbors=1,return_distance=False)
Y_n=Y_train[neighbors]
tot=0
for (a,b)in zip(Y_train,Y_vecinos):
if a==b:
tot+=1
reduc_rate=X_train.shape[1]-np.count_nonzero(w)/tamaƱo
class_rate=tot/X_train.shape[0]
It's working really well, but it's very slow. I have been profiling my code and the slowest part is the evaluation of the distance.
I want to ask if there is some different way to tell KNN to use weights in the distance (I must use the euclidean distance, but I remove the square root).
Thanks!

There is indeed another way, and it's inbuilt into scikit-learn (so should be quicker). You can use the wminkowski metric with weights. Below is an example with random weights for the features in your training set.
knn = KNeighborsClassifier(metric='wminkowski', p=2,
metric_params={'w': np.random.random(X_train.shape[1])})

Related

Do I need to extract feature vectors from MNIST before using Kmeans

I am practicing with MNIST by sklearn.cluster.KMeans.
Intuitively, I just fit the training data to the sklearn function. But I have got pretty low accuracy. I am wondering what step I have missed. Should I extract feature vectors by PCA in the first place? Or should I change a bigger n_clusters?
from sklearn import cluster
from sklearn.metrics import accuracy_score
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
I got poor 0.137 as result. Any recommendation? Thanks!
How are you passing the images in? Are pixels flattened or kept in the 2d format?Are pixels being normalized to between 0-1?
As you are running clustering I would advise against PCA regardless and instead opt for T-SNE which keeps neighbourhood info but you should not need to do so before running K-Means.
The best way to debug is to see what your fitted model is predicting as the clusters. You can see an example here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
With this info, you can get an idea of where mistakes might be. Good luck!
Adding a note: K-Means also probably is not the best model for your purposes. It's best for unsupervised contexts to cluster data. Whereas, MNIST is a classification usecase. KNN would be a better option while still allowing you to experiment with neighbours and such.
Here is an example I created with KNN: https://gist.github.com/andrew-x/0bb997b129647f3a7b7c0907b7e836fc
Unless I'm missing something: you are comparing clustering labels which are arbitrarily numbered 0-9, to labels which are unarbitrarily numbered 0-9. The 0s in your clustering might not end up in cluster number 0, yet this is the comparison you make. Clustering results are evaluated differently because of this. Some options to get a correct evaluation:
Generate a contingency matrix and plot it
Calculate the adjusted rand index

What is the best way to minimize the RMSE?

I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.
Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.
RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

k-fold Cross Validation for determining k in k-means?

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.
To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.
Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.
Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

Regression confidence using SVMs in python

I'm using regression SVMs in python and I am wondering if there is any way to get a "confidence-measure" value for its predictions.
Previously, when using SVMs for binary classification, I was able to compute a confidence-type value from the 'margin'. Here is some pseudo-code showing how I got a confidence value:
# Begin pseudo-code
import svm as svmlib
prob = svmlib.svm_problem(labels, data)
param = svmlib.svm_parameter(svm_type=svmlib.C_SVC, kernel_type = svmlib.RBF)
model = svmlib.svm_model(prob, param)
# get confidence
confidence = self.model.predict_values_raw(sample_to_classify)
I imagine that the further the new sample is from the training data, the worse the confidence, but I'm looking for a function that might help compute a reasonable estimate for this.
My (high-level) problem is as follows:
I have a function F(x), where x is a high-dimensional vector
F(x) can be computed but it is very slow
I want to train a regression SVM to approximate it
If I can find values of 'x' that have low prediction confidence, I can add these points and retrain (aka. active learning)
Has anyone obtained/used regression-SVM confidence/margin values before?
Have a look at this similar response on Stack back in January. The chosen answer was spot on regarding how hard it is to get confidence measures on non-parametric fitting methods. There's probably some Bayesian type thing you could do, but that's probably not possible with the Python SVM library: Prefer one class in libsvm (python).

Categories