SkLearn model for text classification

SkLearn model for text classification - python

I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library.
This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.
But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.
I expected a different behavior of my classifier.
I mean, for example, suppose that I have a short piece of text like this:
"There are a lot of types of virus and bacterias that cause disease."
But my classifier was trained with three types of texts, let say "maths", "history" and "technology".
So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.
Is there a more appropriate method or model to obtain the results that I just described?
Do I use the wrong way the decision_function?
Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.
I think I need to find some light on these issues (text classification, none binary classification, etc.)
Many thanks in advance for any help!

There are multiple parts to your question I will try to answer as much as I can.
I don't understand why this method always tries to distribute the total of probabilities?
That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.
To address your problem i.e. the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes e.g.maths, history and technology as per your example it will be binned to the other class.
Addressing the problem that your data point could possibly belong to multiple classes.
This is what typically multi-label classification is used for.

What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.
Simple example to demonstrate multi-label classification:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'),
categories=categories)
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline
X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
.fit_transform([[newsgroups_train.target_names[i]]
for i in newsgroups_train.target])
model = make_pipeline(TfidfVectorizer(stop_words='english'),
MultiOutputClassifier(LinearSVC()))
model.fit(X, y)
print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']
print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]
print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]
print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]

A common way of dealing with this is to try and cast your text sample into some kind of vector space and measure the "distance" between that and some archetypical positions within that same vector space that represent classifications.
This model of a classifier is convenient because if you collapse your text sample into a vector of vocabulary frequencies, it almost trivially can be expressed as a vector - where the dimensions are defined by the number of vocabulary features you choose to track.
By cluster-analysis of a wider text corpus, you can try and determine central points that commonly occur within clusters, and you can describe these in terms of the vector-positions at which they are located.
And finally, with a handful of cluster-centers defined, you can simply pythagoras your way into finding which of these topic-clusters your chosen sample lies the closest to - but you also have at your fingertips the relative distances between your sample and all the other cluster centres as well - so it's less probabilistic, more a spatial measure.

Related

xgboost: Sample Weights for Imbalanced Data?

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report

First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)

Unstable accuracy of Gaussian Mixture Model classifier from sklearn

I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))

Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation
method, and expecting that its components will magically align with
your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians
into your data, but there is nothing forcing it to place them
according to the labeling (which is not even provided in the fit
call). From time to time this will work - but only for trivial
problems, where classes are so well separated that even Naive Bayes
would work, in general however it is simply invalid tool for the
problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you
are using "GMM classifier" correctly is impossible to answer. Using
GMM as a classifier is incorrect by definition, there is no "valid"
way of using it in such a problem as it is not what this model is
designed to do. What you could do is to build a proper generative
model per class. In other words construct your own classifier where
you fit one GMM per label and then use assigned probability to do
actual classification. Then it is a proper classifier. See
github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train) - all usage is in the form of classifier.fit(X_train), i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit method:
fit (self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...

In sklearn, the labels of clusters in gmm do not mean anything. So, each time you run a gmm, the labels may vary. It might be one reason the results are not robust.

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.

Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

Sklearn: How to make an ensemble for two binary classifiers?

I have two classifiers for a multimedia dataset. One for visual material and one for textual material. I want to combine the predictions of these classifiers to make a final prediction. I have been reading about bagging, boosting and stacking ensembles and all seem useful and I would like to try them. However, I can only seem to find rather theoretical examples for my specific problem, nothing concrete enough for me to understand how to actually implement it (in python with scikit-learn). My two classifiers both use 10 KFold CV with SVM classification. Both outputting a list of n_samples = 1000 with predictions (either 1's or 0's). Also, I made them both produce a list of probabilities on which the predictions are based, looking like this:
[[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]
....
[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]]
How would I go about combining these in an ensemble. What should I use as input? Ive tried concatenating the label predictions horizontally and input them as features, but with no luck (same for the probabilities).

If you're looking for combining strictly, I recomend using brew because it is built on top of sklearn (meaning that you can use your sklearn classifiers), and, last time I checked, sklearn was good for creating ensembles (Bagging, AdaBoost, RandomForest ...), but not many combining rules were provided for your own custom ensemble (such as hybrid ensembles).
https://github.com/viisar/brew
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = your_list_of_classifiers # [clf1, clf2]
ens = Ensemble(classifiers = clfs)
# create your Combiner
# the rules can be 'majority_vote', 'max', 'min', 'mean' or 'median'
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)

It depends entirely on the ensemble method you want to implement. Have you taken a look at the sklearn-ensemble documentation?
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

There is a classifier called 'VotingClassifier' in sklearn.ensemble which can be used to club multiple classifiers and the predicted labels will be based on voting from the enlisted classifiers. Here is the example:

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on

Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.

Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.

from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.