Each sample in my (iid) dataset looks like this:
x = [a_1,a_2...a_N,b_1,b_2...b_M]
I also have the label of each sample (This is supervised learning)
The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those)
I am using scikit-learn, and I want to use GridSearchCV with pipeline.
The question: is it possible to use one CountVectorizer on features type a and another CountVectorizer on features type b?
What I want can be thought of as:
pipeline = Pipeline([
('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
('clf', SGDClassifier()), #will use all features to classify
])
parameters = {
'vect1__max_df': (0.5, 0.75, 1.0), # type a features only
'vect1__ngram_range': ((1, 1), (1, 2)), # type a features only
'vect2__max_df': (0.5, 0.75, 1.0), # type b features only
'vect2__ngram_range': ((1, 1), (1, 2)), # type b features only
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__n_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)
Is that possible?
A nice idea was presented by #Andreas Mueller.
However, I want to keep the original non-chosen features as well... therefore, I cannot tell the column index for each phase at the pipeline upfront (before the pipeline begins).
For example, if I set CountVectorizer(max_df=0.75), it may reduce some terms, and the original column index will change.
Thanks
Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them.
One way to do that is to make a pipeline of a transformer that selects the columns (you need to write that yourself) and the CountVectorizer. There is an example that does something similar here. That example actually separates the features as different values in a dictionary, but you don't need to do that.
Also have a look at the related issue for selecting columns which contains code for the transformer that you need.
It would looks something like this with the current code:
make_pipeline(
make_union(
make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
SGDClassifier())
Related
I'm working on preprocessing the Titanic data set in order to run it through some regressions.
It is the case that the "Age" column in the train and test sets is only populated for around 80% of the rows in each set.
Rather than just eliminate the rows that don't have an "Age" I'd like to use the SimpleImputer (from sklearn.impute import SimpleImputer) to fill in the missing values in those columns.
SimpleImputer has three options for the 'method' parameter that work with numeric data. These are mean, median, and most frequent (mode). (There's also the option to use a custom value, but because I'm trying to avoid "binning" the values I don't want to use this option.)
At its most basic, my approach would involve manually setting up the required datasets. I'd have to run one of each kind of imputer (imputer = SimpleImputer(strategy="xxxxxx") where xxxxxx = 'mean', 'median', or 'most frequent') on each of the train and test datasets and then end up with six different datasets that I'd then have to feed through my RandomForestRegressor one at a time.
I know that GridSearchCV can be used to exhaustively compare various combinations of parameter values in a regressor, so I'm wondering if anyone knows a way to use it or something similar to run through the various 'method' options of the imputer?
I'm thinking something along the lines of the following pseduocode -
param_grid = [
{'method': ['mean','median', 'most frequent']},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error')
grid_search.fit(titanic_features[method], titanic_values[method])
Is there a clean way to compare options like this?
Is there a better way to compare the three options than to build all six data sets, run them through the RF regressor and see what comes out?
Sklearn Pipeline are exactly meant for this. You have to create a pipeline with imputer component preceding the regressor. You can the then use grid search parameter grid with __ to pass the component specific parameters.
Sample code (documented inline)
# Sample/synthetic data shape 1000 X 2
X = np.random.randn(1000,2)
y = 1.5*X[:,0]+3.2*X[:, 1]+2.4
# Randomly make 200 data points in each axis as nan's
X[np.random.randint(0,1000, 200), 0] = np.nan
X[np.random.randint(0,1000, 200), 1] = np.nan
# Simple pipeline which has an imputer followed by regressor
pipe = Pipeline(steps=[('impute', SimpleImputer(missing_values=np.nan)),
('regressor', RandomForestRegressor())])
# 3 different imputers and 2 different regressors
# a total of 6 different parameter combination will be searched
param_grid = {
'impute__strategy': ["mean", "median", "most_frequent"],
'regressor__max_depth': [2,3]
}
# Run girdsearch
search = GridSearchCV(pipe, param_grid)
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
Sample output:
Best parameter (CV score=0.730):
{'impute__strategy': 'median', 'regressor__max_depth': 3}
So with GridSearchCV we are able to find that the best impute strategy for our sample data is median with combination if max_dept of 3.
You can keep extending the pipeline with other components.
I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic.
I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier?
Also, what hyperparameters are you suggesting worth tuning for my problem?
As is, the pipeline looks for a parameter learning_rate in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}
Hi I am studying AI to build chatbot, i am testing now classification with sklearn, i manage to get good results with following code.
def tuned_nominaldb():
global Tuned_Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer=text_process)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
Tuned_Pipeline = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=10)
Tuned_Pipeline.fit(cumle_train, tur_train)
my labels are:
Bad Language
Politics
Religious
General
when i enter any sentence i got most of the time correct label as output. but my problem is, i want to get multiple labels like, if i combine bad language and politics, than it only predicts bad language, how can i get multi label like, bad language + Politics.
I tried to add following code, but i got error that string was not expected for fit mothod.
multiout = MultiOutputClassifier(Tuned_Pipeline, n_jobs=-1)
multiout.fit(cumle_train, tur_train)
print(multiout.predict(cumle_test))
Thanks a lot for your help
As you are using the OneVsRestClassifier, it trains one binary classifier for each label used, this implies that you can use multiple estimators in a same sentence and get multiple labels from it. I suggest you check this links:
OneVsRestClassifier documentation
Example with multiple classifications
estimators_ attribute
TfidfVectorizer provides an easy way to encode & transform texts into vectors.
My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?
update:
Maybe I should have put more details on the question:
What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)
If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.
You can do that in sklearn easily with the GridSearchCV and Pipeline objects
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)
print("Best parameters set:")
print grid_search_tune.best_estimator_.steps
How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
Whereas my other usage is like:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])
The easy way:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other])
clf = LinearSVC().fit(X, y)
The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)