scikit-learn: fitting KNeighborsClassifier without labels - python

I'm trying to fit a simple KNN classifier, and wanted to use the scikit-learn implementation in order to benefit from their efficient implementation (multiprocessing, tree-based algorithms).
However, what I want to get as a result is just the list of distances and nearest neighbours for each data point, rather than the predicted label.
I will then compute the label separately in a non-standard way.
The kneighbors method seems exactly what I need, however I cannot call it without fitting the model with fit first. The issue is, fit() requires the labels (y) as a parameter.
Is there a way to achieve what I'm after? Perhaps I can pass fake labels in the fit() method - is there any issues I'm missing by doing this? E.g. is this going to affect the results (of the computed distances and list of nearest neighbours for each datapoint) in any way? I wouldn't expect so but I'm not familiar with the workings of the scikit-learn implementation.

There is another algorithm for what you desire: NearestNeighbors.
This algorithm is unsupervised (you don't need the y labels); moreover, there is one method (kneighbors) that calculates distances to points and which sample is.
Check the link, it is quite clear.

Related

How to get different clusters using OPTICS in python by varying the parameter xi?

I am trying to fit OPTICS clustering model to my data using python's sklearn
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(data.loc[:, features])
op = OPTICS(max_eps=20, min_samples=10, xi=0.1)
op = op.fit(x)
From this fitted model, I get the reachability distances (op.reachability_) and the ordering (op.ordering_) of the points and also the cluster labels (op.labels_)
Now, I want to check how the clusters would vary by changing the parameter xi (0.01 in this case). Can I do this without fitting the model again and again with different xi's (which takes a lot of time)?
Or, in other words, is there a scikit-learn function that takes the reachability distances (op.reachability_), the ordering (op.ordering_) of the points and xi as input and outputs the cluster labels?
I found a function cluster_optics_dbscan which "performs DBSCAN extraction for an arbitrary epsilon given reachability-distances, core-distances and ordering and epsilon" (Not quite what I want)
A priori, you need to call the fit method, which is doing the actual cluster computation, as stated in the function description.
However, if you look at the optics class, the cluster_optics_xi function "automatically extract clusters according to the Xi-steep method", calling both the _xi_cluster and _extract_xi_labels functions, which both take the xi parameter as input. So, by using them and refactoring a bit, you may be able to achieve what you want.

Why do I need to call fit() before transform() when using PolynomialFeatures?

Hello to all you great minds,
I'm trying to understand more rigorously the way polynomial fitting works with scikit. More specifically, what I'm trying to do is break down the process, and to only show a dataframe with the new polynomial features generated based on a single value.
So I have data which with several entries, each is 1-dimensional. I want to generate a design matrix suitable for polynomial fitting. What I am currently doing is along these lines:
pd.DataFrame(PolynomialFeatures(k).fit_transform(X))
And this works as expected.
However, what I'm struggling with is the role of fit_transform(). As far as I am concerned, and I not trying to fit anything quiet yet, merely produce a dataframe with the newly constructed polynomial features. Naively I tried changing fit_transform() to transform(), but apparently I have to use fit before I am allowed to transform.
I would appreciate it if anyone could point me to my error. I am not yet trying to fit a model on the data, only to create a design matrix with the polynomial features, so why do I have to use fit() (or fit_transform(), to that matter)? In fact, I don't really understand what fit() actually does here, and the documentation didn't help me wrap my head around it.
Thank you!
I think the reason for this is to be consistent with their API. When doing preprocessing you still want to "fit" to some train data and apply the same preprocessing step to the train AND the test data.
An example where it becomes more clear is Standardscaling (which is a different preprocessing step). You calculate the mean and std from the train data and apply the same scaling (X - mean) / std to the train AND test data (with the mean and std taken from the train data.
Therefore the two methods fit and transform are separated.
In your case of polynomial features it probably makes no sense to "fit", because no information is extracted from the train data and the step can directly be applied to the test data without knowing the train data. But including the fit in PolynomialFeatures makes it consistent with their whole API. The consistency becomes necessary when you pipe multiple preprocessing steps.

Custom Criterion for DecisionTreeRegressor in sklearn

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?
I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.

Does sklearn have group lasso?

I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?
Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.
I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.

How to know the factor by which a feature affects a model's prediction

I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.

Categories