Multilabel text classification with Sklearn

Multilabel text classification with Sklearn - python

I have already tried everything that I can think of in order to solve my multilabel text classification in Python and I would really appreciate any help. I have based my result in here using multilabelbinarizer and in this web page .
I am trying to predict certain categories in a dataset written in Spanish where I have 7 different labels, where my dataset is shown here. I have a message written and different labels for each of the rows. Each of the text messages has either one or two labels, depending on the message.
df2=df.copy()
df2.drop(["mensaje", "pregunta_parseada", "tags_totales"], axis=1, inplace=True)
# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split(df['pregunta_parseada'],
df2,
test_size=0.15,
random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
lr = LogisticRegression(solver='sag', n_jobs=1)
clf = OneVsRestClassifier(lr)
# fit model on train data
clf.fit(features_train, labels_train)
# make predictions for validation set
y_pred = clf.predict(features_test)
So far, so good, but when I try to validate the problem it seems as almost every category is classified as "None"
y_pred[2]
accuracy_score(y_test,y_pred)
Output
array([0, 0, 0, 0, 0, 0, 0])
0.2574626865671642
I also tried with MultiLabelBinarizer and I had the same problem, what am I doing wrong? Trying with MultiLabelBinarizer raised the following results:
z=[["Generico"],["Mantenimiento"],["Motor"],["Generico"],["Motor"],
["Generico"],["Motor"],["Generico","Configuracion"],["Generico"],
["Motor"],["Consumo"],...,["Consumo"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y=mlb.fit_transform(z)
message = df["pregunta_parseada"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(message,
y,
test_size=0.15,
random_state=42)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
accuracy_score(y_test, predicted)
#predicted[150]
all_labels = mlb.inverse_transform(predicted)
all_labels
With the following output
(),
(),
(),
(),
('Generico',),
(),
(),
(),
(),
('Compra',),
('Motor', 'extras'),
Thank you so much for your help

The problem I think is with your data. It could be too sparse.
I see you're using OneVsRestClassifier, so it builds multiple binary classifiers to decide the tags.
I think, there's no straight-forward bug in your code, but the choice of model is just not right for the task.
The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples (n) per class (c), the binary classifier will divide the data into n vs (n-1) x c samples for the positive and negative class.
So, obviously there is more data in negative class than positive class for all the classifiers. They are biased towards the negative class, as a result each binary classifier tends to predict (All in oneVsall scenario) for most of the cases.
If you don't want to change your setup, then one thing you can do is:
Instead of predict, use predict_proba to get the probability per class and set a lower threshold (<0.5) to decide which set of classes to choose.
Your test accuracy is pretty low, maybe re-adjust the threshold to get better accuracy.
Use Deep Learning based approach if possible like Bert which will give much better performance.

Related

Combination of CalibratedClassifierCV with RandomOverSampler

When using a classifier like GaussianNB(), the resulting .predict_proba() values are sometimes poorly calibrated; that's why I'd like to wrap this classifier into sklearn's CalibratedClassifierCV.
I have now a binary classification problem with only a very few positive samples - so few that CalibratedClassifierCV fails because there are less samples than folds (the resulting error is then Requesting 5-fold cross-validation but provided less than 5 examples for at least one class.). Thus, I'd like to upsample the minority class before applying the classifier. I use imblearn's pipeline for this as it ensures that resampling takes place only during fit and not during inference.
However, I do not find a way to upsample my training data and combine it with CalibratedClassifierCV while ensuring that upsampling only takes place during fit and not during inference.
I tried the following reproducible example, but it seems that CalibratedClassifierCV wants to split the data first, prior to upsampling - and it fails.
Is there a way to correctly upsample data while using CalibratedClassifierCV?
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
X, y = make_classification(
n_samples = 100,
n_features = 10,
n_classes = 2,
weights = (0.95,), # 5% of samples are of class 1
random_state = 10,
shuffle = True
)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size = 0.2,
random_state = 10,
shuffle = True,
stratify = y
)
pipeline = Pipeline([
("resampling", RandomOverSampler(
sampling_strategy=0.2,
random_state=10
)),
("model", GaussianNB())
])
m = CalibratedClassifierCV(
base_estimator=pipeline,
method="isotonic",
cv=5,
n_jobs=-1
)
m.fit(X_train, y_train) # results in error

I guess I understand my conceptual error: the cross-validation split has to happen BEFORE upsampling and not after (otherwise there would be information leakage from validation to training). But if it happens before, I cannot have more folds than samples of the positive class... Thus, oversampling does not save me from having not enough samples for CalibratedClassifierCV.
So I indeed have to reduce the number of folds, as #NMH1013 suggests.

Can I use GridSearchCV with KNeighboursRegressor?

I have a data set with some float column features (X_train) and a continuous target (y_train).
I want to run KNN regression on the data set, and I want to (1) do a grid search for hyperparameter tuning and (2) run cross validation on the training.
I wrote this code:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'weights': ['uniform','distance'],'p':[1,2,5]}]
# Search for best hyperparameters
grid = GridSearchCV(estimator=KNeighborsRegressor(),
param_grid=hp_candidates,
cv=cv_method,
verbose=1,
scoring='accuracy',
return_train_score=True)
grid.fit(X_train,y_train)
The error I get is:
Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I understand the error, that I can only do this method for KNN in classification, not regression.
But what I can't find is how to edit this code to make it suitable for KNN regression? Can someone explain to me how this could be done?
(So the ultimate aim is I have a data set, I want to tune the parameters, do cross validation, and output the best model based on above and get back some accuracy scores, ideally scores that have comparable scores in other algorithms and are not specific to KNN, so I can compare accuracy).
Also just to mention, this is my first attempt at KNN in scikitlearn, so all comments/critic is welcome.

Yes you can use GridSearchCV with the KNeighboursRegressor.
As you have a metric choice problem,
you can read the metrics documentation here : https://scikit-learn.org/stable/modules/model_evaluation.html
The metrics appropriate for a regression problem are different than from classification problems, and you have the list here for appropritae regression metrics:
‘explained_variance’
‘max_error’
‘neg_mean_absolute_error’
‘neg_mean_squared_error’
‘neg_root_mean_squared_error’
‘neg_mean_squared_log_error’
‘neg_median_absolute_error’
‘r2’
‘neg_mean_poisson_deviance’
‘neg_mean_gamma_deviance’
‘neg_mean_absolute_percentage_error’
So you can chose one to replace "accuracy" and test it.

F-Score difference between cross_val_score and StratifiedKFold

I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within Cross Validation and if it makes a difference I also manually cross validated with StratifiedKFold. I would expect not same but somewhat similar results. As this is not the case I guess that I wrongly use one method but I don´t understand which one. Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import f1_score
rfc = RandomForestClassifier(n_estimators = 200,
criterion = "gini",
max_depth = None,
min_samples_leaf = 1,
max_features = "auto",
random_state = 42,
class_weight = "balanced")
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify=y)
I also tried the Classifier without the class_weight argument. From here I proceed to compare both methods with the f1-score
cv = cross_val_score(estimator=rfc,
X=X_train_val,
y=y_train_val,
cv=10,
scoring="f1")
print(cv)
The 10 f1-scores from cross validation are all around 65%.
Now the StratifiedKFold:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X_train_val, y_train_val):
X_train, X_val = X_train_val[train_index], X_train_val[test_index]
y_train, y_val = y_train_val[train_index], y_train_val[test_index]
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_val)
print("F1-Score: ", round(f1_score(y_val, rfc_predictions),3))
The 10 f1-scores from StratifiedKFold gets me values around 90%. This is where I get confused as I don´t understand the large deviations between both methods. If I just fit the Classifier to the train data and apply it to the test data I get f1-scores of around 90% as well which lets me believe that my way of applying cross_val_score is not correct.

One possible reason for the difference is that cross_val_score uses StratifiedKFold with the default shuffle=False parameter, whereas in your manual cross-validation using StratifiedKFold you have passed shuffle=True. Therefore it could just be an artifact of the way your data is ordered that cross-validating without shuffling produces worse F1 scores.
Try passing shuffle=False when creating the skf instance to see if the scores match the cross_val_score, and then if you want to use shuffling when using cross_val_score just manually shuffle the training data before applying cross_val_score.

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?

Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.

It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.

Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

Restricted Boltzmann Machine in Scikit-learn: Iris Classification

I'm working on an example of applying Restricted Boltzmann Machine on Iris dataset. Essentially, I'm trying to make a comparison between RMB and LDA. LDA seems to produce a reasonable correct output result, but the RBM isn't. Following a suggestion, I binarized the feature inputs using skearn.preprocessing.Binarizer, and also tried different threshold parameter values. I tried several different ways to apply binarization, but none seemed to work for me.
Below is my modified version of the code based on this user's version User: covariance.
Any helpful comments are greatly appreciated.
from sklearn import linear_model, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.lda import LDA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:,:2] # we only take the first two features.
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Models we will use
rbm = BernoulliRBM(random_state=0, verbose=True)
binarizer = preprocessing.Binarizer(threshold=0.01,copy=True)
X_binarized = binarizer.fit_transform(X_train)
hidden_layer = rbm.fit_transform(X_binarized, Y_train)
logistic = linear_model.LogisticRegression()
logistic.coef_ = hidden_layer
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
lda = LDA(n_components=3)
#########################################################################
# Training RBM-Logistic Pipeline
logistic.fit(X_train, Y_train)
classifier.fit(X_binarized, Y_train)
#########################################################################
# Get predictions
print "The RBM model:"
print "Predict: ", classifier.predict(X_test)
print "Real: ", Y_test
print
print "Linear Discriminant Analysis: "
lda.fit(X_train, Y_train)
print "Predict: ", lda.predict(X_test)
print "Real: ", Y_test

RBM and LDA are not directly comparable, as RBM doesn't perform classification on its own. Though you are using it as a feature engineering step with logistic regression at the end, LDA is itself a classifier - so the comparison isn't very meaningful.
The BernoulliRBM in scikit learn only handles binary inputs. The iris dataset has no sensible binarization, so you aren't going to get any meaningful outputs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multilabel text classification with Sklearn - python

Related

Combination of CalibratedClassifierCV with RandomOverSampler

Can I use GridSearchCV with KNeighboursRegressor?

F-Score difference between cross_val_score and StratifiedKFold

How can I explain this drop in performance on test data?

Restricted Boltzmann Machine in Scikit-learn: Iris Classification

Categories

Resources