I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!
I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.
You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).
Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.
Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.
Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).
Related
So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.
Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two
Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.
Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!
I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?
Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.
I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.
Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?
First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline
I would like to ask if anyone has an idea or example of how to do support vector regression in python with high dimensional output( more than one) using a python binding of libsvm? I checked the examples and they are all assuming the output to be one dimensional.
libsvm might not be the best tool for this task.
The problem you describe is called multivariate regression, and usually for regression problems, SVM's are not necessarily the best choice.
You could try something like group lasso (http://www.di.ens.fr/~fbach/grouplasso/index.htm - matlab) or sparse group lasso (http://spams-devel.gforge.inria.fr/ - seems to have a python interface), which solve the multivariate regression problem with different types of regularization.
Support Vector Machines as a mathematical framework is formulated in terms of a single prediction variable. Hence most libraries implementing them will reflect this as using one single target variable in their API.
What you could do is train a single SVM model for each target dimension in your data.
on the plus side, you can train them in // on a cluster as each model is independent of one another
on the minus side, sub-models will share nothing and won't benefit from what they individually discovered in the structure of the input data and potentially need a lot of memory to store the model as they will have no shared intermediate representation
Variant of SVMs can probably be devised in a multi-task learning setting to learn some common kernel-based intermediate representation suitable for reuse to predict multi-dimensional targets however this is not implemented in libsvm AFAIK. Google for multi task learning SVM if you want to learn more.
Alternatively, multi-layer perceptrons (a kind of feed forward neural networks) can naturally deal with multi-dimensional outcomes and hence should be better at sharing intermediate representations of the data reused across targets, especially if they are deep enough with the first layers pre-trained in an unsupervised manner using an autoencoder objective function.
You might want to have a look at http://deeplearning.net/tutorial/ for a nice introduction to various neural network architectures and practical tools and examples to implement them efficiently.