Related
I have created a pipeline using sklearn so that multiple models will go through it. Since there is vectorization before fitting the model, I wonder if this vectorization is performed always before the model fitting process? If yes, maybe I should take this preprocessing out of the pipeline.
log_reg = LogisticRegression()
rand_for = RandomForestClassifier()
lin_svc = LinearSVC()
svc = SVC()
# The pipeline contains both vectorization model and classifier
pipe = Pipeline(
[
('vect', tfidf),
('classifier', log_reg)
]
)
# params dictionary example
params_log_reg = {
'classifier__penalty': ['l2'],
'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0],
'classifier__class_weight': ['balanced', class_weights],
'classifier__solver': ['lbfgs', 'newton-cg'],
# 'classifier__verbose': [2],
'classifier': [log_reg]
}
params = [params_log_reg, params_rand_for, params_lin_svc, params_svc] # param dictionaries for each model
# Grid search for to combine it all
grid = GridSearchCV(
pipe,
params,
cv=skf,
scoring= 'f1_weighted')
grid.fit(features_train, labels_train[:,0])
When you are running a GridSearchCV, pipeline steps will be recomputed for every combination of hyperparameters. So yes, this vectorization process will be done every time the pipeline is called.
Have a look at the sklearn Pipeline and composite estimators.
To quote:
Fitting transformers may be computationally expensive. With its memory
parameter set, Pipeline will cache each transformer after calling fit.
This feature is used to avoid computing the fit transformers within a
pipeline if the parameters and input data are identical. A typical
example is the case of a grid search in which the transformers can be
fitted only once and reused for each configuration.
So you can use the memory flag to cache the transformers.
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
I'm using XGBRegressor with Pipeline. Pipeline contains preprocessing steps and model (XGBRegressor).
Below is complete preprocessing steps. (I have already defined numeric_cols and cat_cols)
numerical_transfer = SimpleImputer()
cat_transfer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transfer, numeric_cols),
('cat', cat_transfer, cat_cols)
])
And the final pipeline is
my_model = Pipeline(steps = [('preprocessor', preprocessor), ('model', model)])
When I tried to fit without using early_stopping_rounds code is working fine.
(my_model.fit(X_train, y_train))
But when I use early_stopping_rounds as shown below I'm getting error.
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid, y_valid)])
I'm getting error at:
model__eval_set=[(X_valid, y_valid)]) and the error is
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, CentralAir, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, MiscFeature, SaleType, SaleCondition
Did it mean that I should preprocess X_valid before applying to my_model.fit() or I have done something wrong ?
If the problem is we need to preprocess X_valid before applying fit() how to do that with preprocessor I have defined above ?
Edit : I tried to preprocess X_valid without Pipeline, but I got error saying feature mismatch.
The problem is that pipelines do not fit eval_set. So, as you said, you need to preprocess X_valid. To do that the easiest way is using your pipeline without the 'model' step. Use the following code before fitting your pipeline:
# Make a copy to avoid changing original data
X_valid_eval=X_valid.copy()
# Remove the model from pipeline
eval_set_pipe = Pipeline(steps = [('preprocessor', preprocessor)])
# fit transform X_valid.copy()
X_valid_eval = eval_set_pipe.fit(X_train, y_train).transform (X_valid_eval)
Then fit your pipeline after changing model__eval_set as follows:
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid_eval, y_valid)])
I'm running a bunch of models with scikit-learn to solve a classification problem.
Here is the code that should do all the running:
for model_name, classifier, param_grid, cv, cv_name in tqdm(zip(model_names, classifiers, param_grids, cvs, cv_names)):
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier)])
train_and_score_model(model_name, pipeline, param_grid, cv=cv)
My question is, how can I retain the output of my train_and_score_model function? It returns a cv object, i.e. a model.
What I tried to do, but I don't think is right, is create a list cv_names = ['dm_cv', 'lr_cv', 'knn_cv', 'svm_cv', 'dt_cv', 'rf_cv', 'nb_cv'] and set each one as the for loop runs. That is the cv_name iterator in the for loop head.
I don't think that's right though, because wouldn't I be setting a string, instead of a variable? As in, what I should really have is cv_names = [dm_cv, lr_cv, knn_cv, svm_cv, dt_cv, rf_cv, nb_cv], but I don't think I can have a list like that.
Another way I thought of is saving each model in a dictionary, where the keys would be the elements of the list I outlined above. I don't know if I can have a model as a dictionary value though.
Here is the clunky, repetitive code I currently run to do what I want in the for-loop:
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_dm)])
dm_cv = train_and_score_model('Dummy Model', pipeline, param_grid_dm)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_lr)])
lr_cv = train_and_score_model('Logistic Regression', pipeline, param_grid_lr)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_knn)])
knn_cv = train_and_score_model('K Nearest Neighbors', pipeline, param_grid_knn)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_svm)])
svm_cv = train_and_score_model('Support Vector Machine', pipeline, param_grid_svm)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_dt)])
dt_cv = train_and_score_model('Decision Tree', pipeline, param_grid_dt)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_rf)])
rf_cv = train_and_score_model('Random Forest', pipeline, param_grid_rf)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier_nb)])
nb_cv = train_and_score_model('Naive Bayes', pipeline, param_grid_nb)
You can create a dictionary with mappings of classifier names with
their information i.e. objects and paramter grids:
models_list = {'Logistic Regression': (classifier_lr, param_grid_lr),
'K Nearest Neighbours': (classifier_knn, param_grid_knn)}
Iterate through every key-value pair in the dictionary and build your pipelines:
model_cvs = {}
for model_name, model_info in models_list.items():
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', model_info[0])])
model_cvs[model_name] = train_and_score_model(model_name, pipeline, model_info[1])
I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?
Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.
However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.
Sci-kit learn code:
rf_model = Pipeline([
('featextract',FeatureExtractor()),
('union', FeatureUnion(
transformer_list=[
# pipeline for tfidf
('text', Pipeline([
('selector',ItemSelector(key='TEXT')),
('count_vec',TfidfVectorizer(max_features=5000)),
('tfidf', TfidfTransformer())])),
# pipeline for ata
('ata', Pipeline([
('selector', ItemSelector(key="ATA_SYS_NO")),
('atas',convert2dict()),
('vect',DictVectorizer())]))
])),
('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
])
pySpark code
Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])