Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.
Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!
Related
I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library.
This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.
But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.
I expected a different behavior of my classifier.
I mean, for example, suppose that I have a short piece of text like this:
"There are a lot of types of virus and bacterias that cause disease."
But my classifier was trained with three types of texts, let say "maths", "history" and "technology".
So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.
Is there a more appropriate method or model to obtain the results that I just described?
Do I use the wrong way the decision_function?
Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.
I think I need to find some light on these issues (text classification, none binary classification, etc.)
Many thanks in advance for any help!
There are multiple parts to your question I will try to answer as much as I can.
I don't understand why this method always tries to distribute the total of probabilities?
That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.
To address your problem i.e. the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes e.g.maths, history and technology as per your example it will be binned to the other class.
Addressing the problem that your data point could possibly belong to multiple classes.
This is what typically multi-label classification is used for.
What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.
Simple example to demonstrate multi-label classification:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'),
categories=categories)
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline
X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
.fit_transform([[newsgroups_train.target_names[i]]
for i in newsgroups_train.target])
model = make_pipeline(TfidfVectorizer(stop_words='english'),
MultiOutputClassifier(LinearSVC()))
model.fit(X, y)
print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']
print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]
print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]
print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]
A common way of dealing with this is to try and cast your text sample into some kind of vector space and measure the "distance" between that and some archetypical positions within that same vector space that represent classifications.
This model of a classifier is convenient because if you collapse your text sample into a vector of vocabulary frequencies, it almost trivially can be expressed as a vector - where the dimensions are defined by the number of vocabulary features you choose to track.
By cluster-analysis of a wider text corpus, you can try and determine central points that commonly occur within clusters, and you can describe these in terms of the vector-positions at which they are located.
And finally, with a handful of cluster-centers defined, you can simply pythagoras your way into finding which of these topic-clusters your chosen sample lies the closest to - but you also have at your fingertips the relative distances between your sample and all the other cluster centres as well - so it's less probabilistic, more a spatial measure.
I am trying to extract the number of features from a model after I had fitted this model to my data.
I have looked through the model's directory and found ways to get the number only for specific models (e.g. looking at the dimensions of support vectors for SVM), but I didn't find a general way I could use for any type of a model.
Say I have my dataset of instances and corresponding classes
X, y # dataset
and use an arbitrary model from the scikit-learn library to fit this data
model.fit(X,y)
Later I want to use this model to find the dimensions of the original dataset, something in the way of
model.n_features_
Is there a quick and general way to do this?
There is no single common attribute for all classifier in Sklearn.
I would recommend the following:
For any sklearn.linear_model/sklearn.svm.svc, you can use the following approach.
>>> clf.coef_.shape[-1]
For any tree based models (DecisionTreeClassifier/RandomForestClassifier/GradientBoostingClassifier), you can use
>>> clf.n_features_
Update:
New in version 1.0.
n_features_in_: int
Number of features seen during fit.
feature_names_in_:
Names of features seen during fit. Defined only when X has feature names that are all strings.
I'm trying to learn how to use some of the helper features in sklearn but am struggling with understanding how to use FeatureUnion
One part of the documentation states this
(A FeatureUnion has no way of checking whether two transformers might
produce identical features. It only produces a union when the feature
sets are disjoint, and making sure they are is the caller’s
responsibility.)
However an example on the Iris dataset shows this
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
How is it ensured that the pca and SelectKBest functions don't select the same feature, or in other words how can the user ensure that the two selections are disjoint?
http://scikit-learn.org/dev/modules/pipeline.html#feature-union
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
I think you pretty much answered your own question with that quote from the docs:
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
The FeatureUnion does not ensure features are different.
In the example of the Iris dataset it is possible (though highly unlikely) that the PCA and the feature selection process will generate identical features. In that case, you just have twice the same feature in the output of the FeatureUnion.
This is usually not a huge deal, though if you can avoid it it's probably cleaner to do so (for instance a random forest model would be biased towards a feature that is present several times, as it would have a higher probability to be picked as a candidate to split a node).
To be a bit clearer, I don't think there's a lot you can do about it beyond avoiding to combine feature extraction processes that obviously create duplicate features in a FeatureUnion.
I have two classifiers for a multimedia dataset. One for visual material and one for textual material. I want to combine the predictions of these classifiers to make a final prediction. I have been reading about bagging, boosting and stacking ensembles and all seem useful and I would like to try them. However, I can only seem to find rather theoretical examples for my specific problem, nothing concrete enough for me to understand how to actually implement it (in python with scikit-learn). My two classifiers both use 10 KFold CV with SVM classification. Both outputting a list of n_samples = 1000 with predictions (either 1's or 0's). Also, I made them both produce a list of probabilities on which the predictions are based, looking like this:
[[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]
....
[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]]
How would I go about combining these in an ensemble. What should I use as input? Ive tried concatenating the label predictions horizontally and input them as features, but with no luck (same for the probabilities).
If you're looking for combining strictly, I recomend using brew because it is built on top of sklearn (meaning that you can use your sklearn classifiers), and, last time I checked, sklearn was good for creating ensembles (Bagging, AdaBoost, RandomForest ...), but not many combining rules were provided for your own custom ensemble (such as hybrid ensembles).
https://github.com/viisar/brew
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = your_list_of_classifiers # [clf1, clf2]
ens = Ensemble(classifiers = clfs)
# create your Combiner
# the rules can be 'majority_vote', 'max', 'min', 'mean' or 'median'
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)
It depends entirely on the ensemble method you want to implement. Have you taken a look at the sklearn-ensemble documentation?
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble
There is a classifier called 'VotingClassifier' in sklearn.ensemble which can be used to club multiple classifiers and the predicted labels will be based on voting from the enlisted classifiers. Here is the example:
I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn