Unable to make prediction after loading sklearn model - python

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction.
I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe in my final Flask APP.

The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform() wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)

In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

Related

Given feature/column names do not match the ones for the data given during fit. Error

I wrote following code and it gives me this error :
"Given feature/column names do not match the ones for the data given
during fit."
Train and predict data has the same features.
df_train = data_preprocessing(df_train)
#Split X and Y
X_train = df_train.drop(target_columns,axis=1)
y_train = df_train[target_columns]
#Create a boolean mask for categorical columns
categorical_columns = X_train.columns[X_train.dtypes == 'O'].tolist()
# Create a boolean mask for numerical columns
numerical_columns = X_train.columns[X_train.dtypes != 'O'].tolist()
# Scaling & Encoding objects
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
col_transformers = ColumnTransformer(
# name, transformer itself, columns to apply
transformers=[("scaler_onestep", numeric_transformer, numerical_columns),
("ohe_onestep", categorical_transformer, categorical_columns)])
#Manual PROCESSING
model = MultiOutputClassifier(
xgb.XGBClassifier(objective="binary:logistic",
colsample_bytree = 0.5
))
#Define a pipeline
pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])
pipeline.fit(X_train, y_train)
#Data Preprocessing
predicted = data_preprocessing(predicted)
X_predicted = predicted.drop(target_columns,axis=1)
predictions=pipeline.predict(X_predicted)
I got error on prediction process. How can i fix this problem? I couldn't find any solution.
Try reordering the columns in X_predicted so that they exactly match X_train.
I am guessing the features names in the training dataset are not identical with the predicted dataset.
For example if you have 19 features in the training dataset, it must be the same 19 samples in the predicted dataset. The model cannot test on features, it has not seen before.
Adding to the above answers - if you're using a ColumnTransformer like OP and are unsure about what the column names were at the time the model was fit, you can use pipeline.named_steps['preprocessing']._feature_names_in to figure it out.

Scikit-Learn: Changing columns of trained model to predict with ColumnTransformer and GridSearchCV

I am using a GridSearchCV w/ Pipeline and ColumnTransformer to train a classifier model in sklearn. My data is in a pandas Dataframe. Now that I have my best estimator. I want to apply the predict to similar data (text) in a dataframe with columns of different names. How can I change the column names in ColumnTransformer?
I guess I could change the column names in the applicable dataframe, then change it back? But this seems wrong.
vectpipe = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer())])
# Vectorize column 'Corpus' and OneHotEncoder to column 'Categories'
column_trans = ColumnTransformer([('text', vectpipe, ['Corpus']),
('category', OneHotEncoder(handle_unknown = "ignore"), ['Categories'])
], remainder='drop')
pipe = Pipeline([('preprocess',column_trans),
('classifier', LogisticRegression())])
model = GridSearchCV(pipe, param, return_train_score=True)
# df contains a Text column, a category, and
model.fit(Training_df, y)
Now I have a trained model. I want to apply predict() on a different Dataframes with different column names. ColumnTransformer is going to look for ('Corpus', 'Categories').
model.predict(Applicable_df1)
model.predict(Applicable_df2)

Transfomers for mixed data types

I'm having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.
I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn't seem to work with text data.
TL;DR
How do you create a storable transformer that follows different pipelines for text and numerical data?
Data download and preparation
# imports
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Transforming numerical features: ok
Following the steps in the link above, one can create a transformer for the numerical features as follows:
# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])
# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)
# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape # (1047, 2)
test_feature_set.shape # (262, 2)
Transforming text features: ok
To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):
# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)
# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)
# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape # (1047, 90)
test_feature_set.shape # (262, 90)
How do you do both at once?
I've tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.
Attempt 1: Follow documented strategy
Following the documentation (Column Transformer with Mixed Types) doesn't work, once text data replaces categorical data:
# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
Attempt 2: FeatureUnion on the lists of transformers
# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]
# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns following error message:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
Attempt 3: ColumnTransformer on the lists of transformers
# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list
# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
My question
How do I create a single object that can fit and transform data mixing text and numerical types?
Short answer:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
all_preprocessor = ColumnTransformer(transformers=all_transformers)
all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)
print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)
The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.
So, to explain the errors in the three attempts:
Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.
Attempt 2: FeatureUnion doesn't have the same input format as ColumnTransformer: you shouldn't have triples (name, transformer, columns) in this case. Anyway, FeatureUnion isn't meant for what you're doing here.
Attempt 3: This time you're trying to send 1d data through to the numerical transformer, but those are expecting 2d data.

Do predictions using training data

I have 2 csv files. one is a training dataset and the other is test dataset. Training dataset contains 36 columns. One column of that is the outcome which have A-F as values. The test dataset has 35 columns which does not have the outcome. I want to add an outcome column to the test dataset as well. I searched for several tutorials but did not find the method that I should follow. Can any one tell about the process that I should follow?
You haven't supplied any sample data and the technique you want to use, my below code will have you to understand how can you make prediction in general:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
Assuming you have read 2 csv files, train & test
X_train = train.loc[:, train.columns != 'Outcome'] # <---- Here you exclude the Outcome from train
y_train = train['Outcome'] # <---- This is your Outcome
le = LabelEncoder()
y_train = le.fit_transform(y_train) # <---- Here you convert your A-F values to numeric(0-5)
I am assuming rest of the x variables are numeric.
rf = RandomForestClassifier() # <---- Here you call the classifier
rf.fit(X_train, y_train) # <---- Fit the classifier on train data
rf.score(X_train, y_train) # <---- Check model accuracy
y_pred = pd.DataFrame(rf.predict(test), columns = ['Outcome']) # <---- Make predictions on test data
test_pred = pd.concat([test, y_pred['Outcome']], axis = 1) # <---- Here you add predictions column to your test dataset
test_pred.to_excel(r'path\Test.xlsx')
That depends on how you will find/calculate the outcome you need to add.
One way would be to load the test dataset as a Pandas data frame. Calculate the outcome and add the values to a list which you the add to your Pandas dataframe:
import pandas as pd
data = pd.DataFrame(columns=['Names', 'Age', 'Outcome'])
names = ['John', 'Nicole', 'Evan']
age = [53, 23, 27]
data['Names'] = names
data['Age'] = age
outcome = [6545, 5252, 85665]
data['Outcome'] = outcome

Training a sklearn classifier with more than a single feature

I'm currently training a LinearSVC classifier with a single feature vectorizer. I'm processing news, which are stored in separated files. Those files originally had a title, a textual body, a date, an author and sometimes an image. But I ended up removing everythong but the textual body as a feature. I'm doing it this way:
# Loading the files (Plain files with just the news content. Nor date, author or other features.)
data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING) # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names
# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target
# Vectorizing
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)
# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False)
scaler = scaler.fit(X_train)
normalized_X_train = scaler.transform(X_train)
clf.fit(normalized_X_train, y_train)
normalized_X_test = scaler.transform(X_test)
pred = clf.predict(normalized_X_test)
accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)
But now I would like to include other features, as the date or the author, and all the simpler examples I found are using a single feature. So I'm not really sure how to proceed. Should I have all the information in a single file? How to diferentiate authors from content? Should I use a vectorizer for each feature? If so, should I fit a model with different vectorized features? Or should I have a different classifier for each feature? Can you suggest me something to read (explained to newbies)?
Thanks in advance,
The output of TfidfVectorizer is a scipy.sparse.csr.csr_matrix object. You may use hstack to add more features (like here). Alternatively, you may convert the feature space you already have above to a numpy array or pandas df and then add the new features (which you might have created from other vectorizers) as new columns to it. Either way, your final X_train and X_test should include all the features in one place. You may also need to standardize them before doing the training (here). You do not seem to be doing that here.
I do not have your data so here is an example on some dummy data:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
X_train = pd.DataFrame(X_train.todense())
X_train['has_image'] = [1, 0, 0, 1] # just adding a dummy feature for demonstration

Categories