sklearn classification with multiple label output - python

Hi I am studying AI to build chatbot, i am testing now classification with sklearn, i manage to get good results with following code.
def tuned_nominaldb():
global Tuned_Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer=text_process)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
Tuned_Pipeline = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=10)
Tuned_Pipeline.fit(cumle_train, tur_train)
my labels are:
Bad Language
Politics
Religious
General
when i enter any sentence i got most of the time correct label as output. but my problem is, i want to get multiple labels like, if i combine bad language and politics, than it only predicts bad language, how can i get multi label like, bad language + Politics.
I tried to add following code, but i got error that string was not expected for fit mothod.
multiout = MultiOutputClassifier(Tuned_Pipeline, n_jobs=-1)
multiout.fit(cumle_train, tur_train)
print(multiout.predict(cumle_test))
Thanks a lot for your help

As you are using the OneVsRestClassifier, it trains one binary classifier for each label used, this implies that you can use multiple estimators in a same sentence and get multiple labels from it. I suggest you check this links:
OneVsRestClassifier documentation
Example with multiple classifications
estimators_ attribute

Related

How to use GridSearchCV with MultiOutputClassifier(MLPClassifier) Pipeline

I am trying out scikit-learn for the first time, for a Multi-Output Multi-Class text classification problem. I am attempting to use GridSearchCV to optimize the parameters of MLPClassifier for this purpose.
I will admit that I am shooting in the dark here, having no prior experience. Please let me know if this makes sense.
Below is what I currently have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
df = pd.read_csv('data.csv')
df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)
x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]
x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)
pipe = Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(MLPClassifier()))])
pipe.fit(x_train, y_train)
pipe.score(x_test, y_test)
pipe.score gives a score of ~0.837, which seems to suggest that the above code is doing something. Running pipe.predict() on some test strings seems to yield relatively adequate output results.
However, even after looking at plenty examples, I don't understand how to implement GridSearchCV for this Pipeline. (Additionally, I would like advice on which parameters to search).
I doubt it makes sense to post my attempts with GridSearchCV since they have been varied and all unsuccessful. But a brief example from a Stack Overflow answer could be:
grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [(100,),(200,)]
}
]
grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)
grid_search.fit(x_train, y_train)
This gives the error:
ValueError: Invalid parameter activation for estimator
Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of
available parameters with estimator.get_params().keys().
I'm not sure what causes this, nor exactly how to utilize estimator.get_params().keys() to figure out which parameters are faulty.
Perhaps my uses of 'cv', CountVectorizer() or 'mlpc', MultiOutputClassifier(estimator=MLPClassifier())) are incorrect in relation to the grid parameters.
I believe I need to use CountVectorizer() here because my inputs (and desired label outputs) are all strings.
I very much appreciate an example of how GridSearchCV should be used for a Pipeline presumably utilizing CountVectorizer() and MLPClassifier in the correct way, and which grid parameters may be advisable to search.
TL;DR Try something like this:
mlpc = MLPClassifier(solver='adam',
learning_rate_init=0.01,
max_iter=300,
activation='relu',
early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
('scale', StandardScaler()),
('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
'cv__max_df': (0.9, 0.95, 0.99),
'cv__min_df': (0.01, 0.05, 0.1),
'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
(64, 32, 16), (128, 64, 32)),
'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}
Discussion:
[Edit] For multi-output binary classification only, MLPClassifier supports multi-output classification, and having interrelating outputs, I wouldn't recommend using MultiOutputClassifier as it trains separate MLPClassifier instances without taking into account the relationship between outputs. Training only one MLPClassifier is faster, cheaper, and usually more accurate.
The ValueError is due to improper parameter grid names. See Nested parameters.
With a modest workstation and/or large training data, set solver='adam'
to use a cheaper, first-order method as opposed to a second-order 'lbfgs'. Alternatively, try solver='sgd'---even cheaper to compute---but then also tune momentum. I anticipate that your data will be sparse and of different scales after CountVectorizer, and momentum/solver='adam' is a way to tackle variant gradients.
Insert one of the standardization transformers (I guess StandardScaler will work better) after CountVectorizer as MLPs are sensitive to feature scaling. Although, solver='adam' would probably handle imbalanced bag of words well. Still, I believe it won't hurt to standardize your data.
I think tuning activation is needles. Set activation='relu'.
Use early_stopping=True, specify a large enough max_iter, and tune tol to prevent overfitting.
Definitely tune learning_rate_init with solver='sgd'; for solver='adam', I assume higher learning rates will be OK and adam won't require comprehensive learning-rate tuning.
Prefer deeper nets to wider ones (e.g., hidden_layer_sizes=(128, 64, 32) to hidden_layer_sizes=(256, 192)).
Always tune alpha.
Optimal hidden_layer_sizes may depend on a document-term dimension.
Try setting higher batch_sizes but take into account computational expenses.
If you wish to optimize CountVectorizer, tune max_df and min_df but not ngram_range; I believe at least a two-layer MLP will handle unigram relationships itself in hidden layers without need to process n-grams explicitly.
Optimize the hyperparameters in the code sample above first. But note that the remaining hyperparameters can also affect both computational performance and predictive power.
Disclaimer: Most of the remarks are based on my (insubstantialšŸ¤”) assumptions about your data and pertain only to scikit-learn's MLPs. Refer to docs to learn more about neural networks and experiment with other tips. And remember, There is No Free Lunch.

Hyperparameters tuning using GridSearchCV

I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic.
I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier?
Also, what hyperparameters are you suggesting worth tuning for my problem?
As is, the pipeline looks for a parameter learning_rate in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}

Machine Learning - How to extract features from pipeline

I am totaly new to the field and currently I am stuck. Here is What I want and what I did:
I have a Dataframe tht is solit in Train and Test dataset. The Training features are twitter messages, the lables are assigned categories. I set up a tokenizer (called clean_text) that keeps only relevant words and strips the messages down to the core information. The model including a grid search, that looks as follows:
def build_model():
pipeline = Pipeline([
('vectorizer', CountVectorizer(tokenizer=clean_text)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(
RandomForestClassifier()
))
])
# parameters to grid search
parameters = { 'vectorizer__max_features' : [50],#, 72, 144, 288, 576, 1152],
'clf__estimator__n_estimators' : [100]}#, 100] }
# initiating GridSearchCV method
model = GridSearchCV(pipeline, param_grid=parameters, cv = 5)
return model
The fitting works fine, as well as the evaluation.
Not I am not sure, if the model is set up correctly and if the features are the most used tokens in the messsages (in the above case 50) or if there is an error.
Now the question:
Is there a way to print the 50 features and see if they look right?
Best
Felix
With no sample information, this is the best guess. Please check if the following works. If you have sample data, we can help you better.
print(vectorizer.vocabulary_)
this should work, or share sample dataframe
model.estimator.named_steps['vectorizer'].get_feature_names()

how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

TfidfVectorizer provides an easy way to encode & transform texts into vectors.
My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?
update:
Maybe I should have put more details on the question:
What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)
If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.
You can do that in sklearn easily with the GridSearchCV and Pipeline objects
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)
print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

scikit-learn pipeline

Each sample in my (iid) dataset looks like this:
x = [a_1,a_2...a_N,b_1,b_2...b_M]
I also have the label of each sample (This is supervised learning)
The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those)
I am using scikit-learn, and I want to use GridSearchCV with pipeline.
The question: is it possible to use one CountVectorizer on features type a and another CountVectorizer on features type b?
What I want can be thought of as:
pipeline = Pipeline([
('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
('clf', SGDClassifier()), #will use all features to classify
])
parameters = {
'vect1__max_df': (0.5, 0.75, 1.0), # type a features only
'vect1__ngram_range': ((1, 1), (1, 2)), # type a features only
'vect2__max_df': (0.5, 0.75, 1.0), # type b features only
'vect2__ngram_range': ((1, 1), (1, 2)), # type b features only
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__n_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)
Is that possible?
A nice idea was presented by #Andreas Mueller.
However, I want to keep the original non-chosen features as well... therefore, I cannot tell the column index for each phase at the pipeline upfront (before the pipeline begins).
For example, if I set CountVectorizer(max_df=0.75), it may reduce some terms, and the original column index will change.
Thanks
Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them.
One way to do that is to make a pipeline of a transformer that selects the columns (you need to write that yourself) and the CountVectorizer. There is an example that does something similar here. That example actually separates the features as different values in a dictionary, but you don't need to do that.
Also have a look at the related issue for selecting columns which contains code for the transformer that you need.
It would looks something like this with the current code:
make_pipeline(
make_union(
make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
SGDClassifier())

Categories