I am totaly new to the field and currently I am stuck. Here is What I want and what I did:
I have a Dataframe tht is solit in Train and Test dataset. The Training features are twitter messages, the lables are assigned categories. I set up a tokenizer (called clean_text) that keeps only relevant words and strips the messages down to the core information. The model including a grid search, that looks as follows:
def build_model():
pipeline = Pipeline([
('vectorizer', CountVectorizer(tokenizer=clean_text)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(
RandomForestClassifier()
))
])
# parameters to grid search
parameters = { 'vectorizer__max_features' : [50],#, 72, 144, 288, 576, 1152],
'clf__estimator__n_estimators' : [100]}#, 100] }
# initiating GridSearchCV method
model = GridSearchCV(pipeline, param_grid=parameters, cv = 5)
return model
The fitting works fine, as well as the evaluation.
Not I am not sure, if the model is set up correctly and if the features are the most used tokens in the messsages (in the above case 50) or if there is an error.
Now the question:
Is there a way to print the 50 features and see if they look right?
Best
Felix
With no sample information, this is the best guess. Please check if the following works. If you have sample data, we can help you better.
print(vectorizer.vocabulary_)
this should work, or share sample dataframe
model.estimator.named_steps['vectorizer'].get_feature_names()
Related
I made a GridsearchCV with a pipeline and I want to extract one attribute (n_iter_) of a component of the pipeline (MLPRegressor) for the best model.
I'm using Python 3.0.
Creation of the pipeline
pipeline_steps = [('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', early_stopping=True, validation_fraction=0.1, max_iter=10000))]
MLPR_parameters = {'MLPR__hidden_layer_sizes':[(50,), (100,), (50,50)], 'MLPR__alpha':[0.001, 10, 1000]}
MLPR_pipeline = Pipeline(pipeline_steps)
gridCV_MLPR = GridSearchCV(MLPR_pipeline, MLPR_parameters, cv=kfold)
gridCV_MLPR.fit(X_train, y_train)
When I want to extract the best model with gridCV_GBR.best_params_, I only have the result for the GridsearchCV :
{'MLPR__alpha': 0.001, 'MLPR__hidden_layer_sizes': (50,)}
But I want to have the number of iteration of MLPRegressor used by the best model of gridCV_MLPR.
How is it possible to use the n_iter_ attribute designed for MLPRegressor() through the pipeline with GridsearhCV ?
Thanks for your help,
I found the solution :
gridCV_MLPR.best_estimator_.named_steps['MLPR'].n_iter_
As the gridCV_MLPR.best_estimator_ is a pipeline, we need to select the MLPRegressor parameters with .named_steps['MLPR'].
Thanks a lot for your very, very quick answer ...
I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?
I'm not able to do something and I would like to know if it's a bug or normal way.
I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case.
As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.
( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )
I proceed that way:
pipe = Pipeline([('pca', PCA()),
('clf', SVC()),
])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)
# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)
And get some:
File : _split.py", line 489, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.
In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.
Can I have some clues on it?
Have a nice day,
Best regards
I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:
outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
cv=inner_cv, verbose=2, random_state=42,
n_jobs=-1, scoring=my_squared_score)
# Fit the random search model
rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
print(rf_random.best_params_)
nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))
print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))
I hope this helps.
Best regards,
Felix
I have the following code which works as expected:
clf = Pipeline([
('vectorizer', DictVectorizer(sparse=False)),
('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:size], y[:size])
score = clf.score(X_test, y_test)
I wanted to do the same logic without using Pipeline:
v = DictVectorizer(sparse=False)
Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])
clf.score(Xdv_test, y_test)
But I receive the following error:
ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303
It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.
Dont call fit_transform again.
Do this:
Xdv_test = v.transform(X_test)
When you do fit() or fit_transform(), the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.
Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test) on the pipeline.
:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?
Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.
why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline