I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?
Related
I'm using XGBRegressor with Pipeline. Pipeline contains preprocessing steps and model (XGBRegressor).
Below is complete preprocessing steps. (I have already defined numeric_cols and cat_cols)
numerical_transfer = SimpleImputer()
cat_transfer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transfer, numeric_cols),
('cat', cat_transfer, cat_cols)
])
And the final pipeline is
my_model = Pipeline(steps = [('preprocessor', preprocessor), ('model', model)])
When I tried to fit without using early_stopping_rounds code is working fine.
(my_model.fit(X_train, y_train))
But when I use early_stopping_rounds as shown below I'm getting error.
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid, y_valid)])
I'm getting error at:
model__eval_set=[(X_valid, y_valid)]) and the error is
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, CentralAir, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, MiscFeature, SaleType, SaleCondition
Did it mean that I should preprocess X_valid before applying to my_model.fit() or I have done something wrong ?
If the problem is we need to preprocess X_valid before applying fit() how to do that with preprocessor I have defined above ?
Edit : I tried to preprocess X_valid without Pipeline, but I got error saying feature mismatch.
The problem is that pipelines do not fit eval_set. So, as you said, you need to preprocess X_valid. To do that the easiest way is using your pipeline without the 'model' step. Use the following code before fitting your pipeline:
# Make a copy to avoid changing original data
X_valid_eval=X_valid.copy()
# Remove the model from pipeline
eval_set_pipe = Pipeline(steps = [('preprocessor', preprocessor)])
# fit transform X_valid.copy()
X_valid_eval = eval_set_pipe.fit(X_train, y_train).transform (X_valid_eval)
Then fit your pipeline after changing model__eval_set as follows:
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid_eval, y_valid)])
I have the following code which works as expected:
clf = Pipeline([
('vectorizer', DictVectorizer(sparse=False)),
('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:size], y[:size])
score = clf.score(X_test, y_test)
I wanted to do the same logic without using Pipeline:
v = DictVectorizer(sparse=False)
Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])
clf.score(Xdv_test, y_test)
But I receive the following error:
ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303
It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.
Dont call fit_transform again.
Do this:
Xdv_test = v.transform(X_test)
When you do fit() or fit_transform(), the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.
Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test) on the pipeline.
I have been working on a machine learning model and I'm currently using a Pipeline with GridSearchCV. My data is scaled with MinMaxScaler and I'm using an SVR with RBR kernel. My question is now that my model is complete, fitted, and has a decent evaluation score, do I need to also scale new data for predictions with MinMaxScaler or can I just make predictions with the data as is? I've read 3 books on scikit learn but they all focus on feature engineering and fitting. They don't really cover any additional steps in the prediction step other than use the predict method.
This is the code:
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)
Sure, if you get new (in the sense of unprocessed) data you need to do the same preparation steps as you did when training the model. For example if you use MinMaxScaler with default proporties the model is used to have data with zero mean and standard variance in each feature, if you don't preprocess data the model can't produce accurate results.
Keep in mind to use exactly the same MinMaxScaler object you used for the training data. So in case you save your model to a file, save also your preprocessing objects.
I wanted to follow up my question with a solution thanks to pythonic833's answer. I think the proper procedure to scale new data for prediction if you used a pipeline is to do the whole scaling process from beginning to end with the original training data that was used on the pipeline. Even though the pipeline did the scaling for you during the training process, it's necessary to scale the training data manually to be able to have the new data predict accurately and scaled correctly by having a MinMaxScaler object. Below is my code based on pythonic833 answer and some of the other comments such as saving the model with Pickle.
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)
# Pickle the data with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'wb') as file:
pickle.dump(grid, file)
# Load Pickle with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'rb') as file:
model = pickle.load(file)
scaler = MinMaxScaler()
scaler.fit(X_train) # Original training data for Pipeline
X_train_scaled = scaler.transform(X_train)
new_data_scaled = scaler.transform(new_data)
model.predict(new_data_scaled)
Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.
However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.
Sci-kit learn code:
rf_model = Pipeline([
('featextract',FeatureExtractor()),
('union', FeatureUnion(
transformer_list=[
# pipeline for tfidf
('text', Pipeline([
('selector',ItemSelector(key='TEXT')),
('count_vec',TfidfVectorizer(max_features=5000)),
('tfidf', TfidfTransformer())])),
# pipeline for ata
('ata', Pipeline([
('selector', ItemSelector(key="ATA_SYS_NO")),
('atas',convert2dict()),
('vect',DictVectorizer())]))
])),
('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
])
pySpark code
Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])
I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py