how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering - python

TfidfVectorizer provides an easy way to encode & transform texts into vectors.
My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?
update:
Maybe I should have put more details on the question:
What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.
You can do that in sklearn easily with the GridSearchCV and Pipeline objects
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)
print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

Related

Hyperparameters tuning using GridSearchCV

I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic.
I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier?
Also, what hyperparameters are you suggesting worth tuning for my problem?
As is, the pipeline looks for a parameter learning_rate in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}

sklearn classification with multiple label output

Hi I am studying AI to build chatbot, i am testing now classification with sklearn, i manage to get good results with following code.
def tuned_nominaldb():
global Tuned_Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer=text_process)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
Tuned_Pipeline = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=10)
Tuned_Pipeline.fit(cumle_train, tur_train)
my labels are:
Bad Language
Politics
Religious
General
when i enter any sentence i got most of the time correct label as output. but my problem is, i want to get multiple labels like, if i combine bad language and politics, than it only predicts bad language, how can i get multi label like, bad language + Politics.
I tried to add following code, but i got error that string was not expected for fit mothod.
multiout = MultiOutputClassifier(Tuned_Pipeline, n_jobs=-1)
multiout.fit(cumle_train, tur_train)
print(multiout.predict(cumle_test))
Thanks a lot for your help
As you are using the OneVsRestClassifier, it trains one binary classifier for each label used, this implies that you can use multiple estimators in a same sentence and get multiple labels from it. I suggest you check this links:
OneVsRestClassifier documentation
Example with multiple classifications
estimators_ attribute

Working with, preparing bag-of-word data for Regression

Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.
Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.
I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)
Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]
My data is generally sparse like in the Example.
When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the
distribution of my data (age/amount):
I used diffrerent Regression Models: Linear Regression, Lasso,
SGDRegressor from scikit-learn with no improvement.
So the questions are:
1.How do I improve the r² score?
2.Do I have to change the data to fit the Regression better? If yes with what method?
3.Which Regressor/Methods should I use for text classification?
To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.
None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.
I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.
Here I show an example script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD()),
('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
'tfidf__max_df': (0.5, 0.75, 1.0),
'tfidf__ngram_range': ((1,1), (1,2)),
'svd__n_components': (50, 100, 150, 200),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, params, scoring='r2',
n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

scikit-learn pipeline

Each sample in my (iid) dataset looks like this:
x = [a_1,a_2...a_N,b_1,b_2...b_M]
I also have the label of each sample (This is supervised learning)
The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those)
I am using scikit-learn, and I want to use GridSearchCV with pipeline.
The question: is it possible to use one CountVectorizer on features type a and another CountVectorizer on features type b?
What I want can be thought of as:
pipeline = Pipeline([
('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
('clf', SGDClassifier()), #will use all features to classify
])
parameters = {
'vect1__max_df': (0.5, 0.75, 1.0), # type a features only
'vect1__ngram_range': ((1, 1), (1, 2)), # type a features only
'vect2__max_df': (0.5, 0.75, 1.0), # type b features only
'vect2__ngram_range': ((1, 1), (1, 2)), # type b features only
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__n_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)
Is that possible?
A nice idea was presented by #Andreas Mueller.
However, I want to keep the original non-chosen features as well... therefore, I cannot tell the column index for each phase at the pipeline upfront (before the pipeline begins).
For example, if I set CountVectorizer(max_df=0.75), it may reduce some terms, and the original column index will change.
Thanks
Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them.
One way to do that is to make a pipeline of a transformer that selects the columns (you need to write that yourself) and the CountVectorizer. There is an example that does something similar here. That example actually separates the features as different values in a dictionary, but you don't need to do that.
Also have a look at the related issue for selecting columns which contains code for the transformer that you need.
It would looks something like this with the current code:
make_pipeline(
make_union(
make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
SGDClassifier())

Add Features to An Sklearn Classifier

I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline:
data = fetch_20newsgroups(subset='train', categories=None)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'tfidf__use_idf': (True, False),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)
print(grid_search.best_score_)
You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces
There is a nice example in the documentation https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html which I think exactly fits your requirements. See TextStats transformer.
[Update: the example was for scikit learn =< 0.18]
Regards,

Categories