So I have been tasked with training a model on phone call transcripts. The following code does this. A little background info:
- x is a list of strings, each ith element is an entire transcript
- y is a list of booleans, stating the outcome of a call being positive or negative.
The following code works, but here is my issue. I want to include call duration as a feature to train on. I'd assume after the TFIDF transformer that vectorizes the transcripts, I would just concatenate the call duration feature to the TFIDF output right? Maybe this is easier than I think, but I have the transcripts and the durations all in the pandas data frame you see at the beginning of the code. So if I have that data frame column (numpy array) of durations, what do I need to do to add that feature into my model?
Additional Questions:
Am I missing a fundamental assumption about Naive Bayes model that limits me to vectorized strings?
At which step in my pipeline do I add the new feature?
Can this even be done in a pipeline or do I have to break it apart to do something like this?
Code:
import numpy as np
import pandas as pd
import random
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import SelectPercentile
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import chi2
def main():
filename = 'QA_training.pkl'
splitRatio = 0.67
dataframe = loadData(filename)
x, y = getTrainingData(dataframe)
print len(x), len(y)
x_train, x_test = splitDataset(x, splitRatio)
y_train, y_test = splitDataset(y, splitRatio)
#x_train = np.asarray(x_train)
percentiles = [10, 15, 20, 25, 30, 35, 40, 45, 50]
MNNB_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', MultinomialNB())])
MNNB_param_grid = {
#'vec__max_features': (10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000),
'tfidf__use_idf': (True, False),
'tfidf__sublinear_tf': (True, False),
'vec__binary': (True, False),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
'select__percentile': percentiles
}
MNNB_search = GridSearchCV(MNNB_pipe, param_grid=MNNB_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
MNNB_search = MNNB_search.fit(x_train, y_train)
MNNB_search_best_cv = cross_val_score(MNNB_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
SGDC_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', SGDClassifier())])
SGDC_param_grid = {
#'vec__max_features': [10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000],
'tfidf__use_idf': [True, False],
'tfidf__sublinear_tf': [True, False],
'vec__binary': [True, False],
'tfidf__norm': ['l1', 'l2'],
'clf__loss': ['modified_huber','log'],
'clf__penalty': ['l1','l2'],
'clf__alpha': [1e-3],
'clf__n_iter': [5,10],
'clf__random_state': [42],
'select__percentile': percentiles
}
SGDC_search = GridSearchCV(SGDC_pipe, param_grid=SGDC_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
SGDC_search = SGDC_search.fit(x_train, y_train)
SGDC_search_best_cv = cross_val_score(SGDC_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
# pre_SGDC = SGDC_clf.predict(x_test)
# print (np.mean(pre_SGDC == y_test))
mydata = [{'model': MNNB_search.best_estimator_.named_steps['clf'],'features': MNNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': MNNB_search_best_cv.mean()},
#{'model': GNB_search.best_estimator_.named_steps['classifier'],'features': GNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': GNB_search_best_cv.mean()},
{'model': SGDC_search.best_estimator_.named_steps['clf'],'features': SGDC_search.best_estimator_.named_steps['select'], 'mean_cv_scores': SGDC_search_best_cv.mean()}]
model_results_df = pd.DataFrame(mydata)
model_results_df.to_csv("best_model_results.csv")
As far as I'm aware, sklearn pipelines are API driven -- There's no real magic that happens in the pipeline itself. So, from that perspective, you should be able to create your own wrapper around TfidfVectorizer that does what you want it to do. For example, let's assume that you have a DataFrame that looks like this:
df = pd.DataFrame({'text': ['foo text', 'bar text'], 'duration': [1, 2]})
you could probably implement your transform as follows:
class MyVectorizer(object):
def __init__(self, tfidf_kwargs=None):
self._tfidf = TfidfVectorizer(**(tfidf_kwargs or None))
def fit(self, X, y=None):
self._tfidf.fit(X['text'], y)
return self
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X, copy=False)
def transform(self, X, copy=True):
result = self._tfidf.transform(X['text'], copy=copy)
# result is a sparse matrix. I'm not sure of a clean way
# to add a column to a sparse matrix. If you have the
# memory, you can use a dense matrix instead...
return np.column_stack((result, X['duration']))
And then I think you should be all set to use this instead of the original tfidf vectorizer.
Related
I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')
Below is the code that I am trying to execute
# Train a logistic regression model, report the coefficients and model performance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
clf = LogisticRegression().fit(X_train, y_train)
params = {'penalty':['l1','l2'],'dual':[True,False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True,False],
'solver':['saga']}
gridlog = GridSearchCV(clf, params, cv=5, n_jobs=2, scoring='roc_auc')
cv_scores = cross_val_score(gridlog, X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_) # throws error
The last code line above is where the error is being thrown from. I have used this exact same code to run other models. Any idea why I may be facing this issue?
You need to fit gridlog first. cross_val_score will not do this, it returns the scores & nothing else.
Hence, as gridlog isn't trained, it throws error.
Below code works perfectly fine:
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
diabetes = datasets.load_breast_cancer()
x = diabetes.data[:150]
y = diabetes.target[:150]
clf = LogisticRegression().fit(x, y)
params = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
gridlog = GridSearchCV(clf, params, cv=2, n_jobs=2,
scoring='roc_auc')
gridlog.fit(x,y) # <- missing in your code
cv_scores = cross_val_score(gridlog, x, y)
print(cv_scores)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_)
# result:
Logistic regression parameters: {'C': 1}
Your code should be updated such that the LogisticRegression classifier is passed to the GridSearch (not its fit):
from sklearn.datasets import load_breast_cancer # For example only
X_train, y_train = load_breast_cancer(return_X_y=True)
params = {'penalty':['l1', 'l2'],'dual':[True, False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True, False],
'solver':['saga']}
gridlog = GridSearchCV(LogisticRegression(), params, cv=5, n_jobs=2, scoring='roc_auc')
gridlog.fit(X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ', gridlog.best_params_) # Now it displays all the parameters selected by the grid search
Results
Logistic Regression parameters: {'C': 0.1, 'dual': False, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'saga'}
Note, as #desertnaut pointed out, you don't use cross_val_score for GridSearchCV.
See a complete example of how to use GridSearch here.
The example use a SVC classifier instead of a LogisticRegression, but the approach is the same.
I need to perform leave-one-out cross validation of RF model.
I successfully built a model with high predictive ability.
Now I need to perform LOO test prior to the publication.
Here is my code:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()
# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)
y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
X_train = X.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
X_train.head(21)
y_train = y.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
y_train.head(21)
X_test = X.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
X_test.head(5)
y_test = y.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
y_test.head(5)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)
from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = {"n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100], "max_depth":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100]}
grid_search_cv_clf = GridSearchCV(clf_rf, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)
from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)
grid_search_cv_clf.best_params_
best_clf = grid_search_cv_clf.best_estimator_
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features': X_test_filtered.columns.values,
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(25)
Now I need q2 value.
Finally, I wrote this code and got a reasonably high score 0.9071543776303185
.
from sklearn.model_selection import LeaveOneOut
parameters = {"n_estimators":[4], "max_depth":[20]}
loo_clf = GridSearchCV(best_clf, parameters, cv=LeaveOneOut())
loo_clf.fit(features_important, y_train)
loo_clf.score(features_important, y_train)
I'm not sure if it is q2 or not. How do you think?
I also decided to obtain 5-fold cross-validation score. However, it gives ridiculous values like, for example: -36.58997717, 0.76801832, -1.59900448, 0.1834304 , -2.38256389 and a mean of -7.924019361863889.
from sklearn.model_selection import cross_val_score
cvs=cross_val_score(best_clf, features_important, y_train)
mean_cross_val_score = cvs.mean()
mean_cross_val_score
Probably, there is a way to fix it?
You should not run the hyper-parameters search before to make the model evaluation. Instead, you should the 2 cross-validations, otherwise, you are leaking some information. To know more about this, you should look at the following example from the scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py
Therefore, in your particular use-case, you should use: GridSearchCV, SelectFromModel, and cross_val_score:
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
feature_selector = SelectFromModel(
RandomForestRegressor(n_jobs=-1), threshold="mean"
)
pipe = make_pipeline(
feature_selector, RandomForestRegressor(n_jobs=-1)
)
param_grid = {
# define the grid of the random-forest for the feature selection
"selectfrommodel__estimator__n_estimators": [10, 20],
"selectfrommodel__estimator__max_depth": [3, 5],
# define the grid of the random-forest for the prediction
"randomforestregressor__n_estimators": [10, 20],
"randomforestregressor__max_depth": [5, 8],
}
grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, cv=3)
# You can use the LOO in this way. Be aware that this not a good practise,
# it leads to large variance when evaluating your model.
# scores = cross_val_score(pipe, X, y, cv=LeaveOneOut(), error_score='raise')
scores = cross_val_score(pipe, X, y, cv=2, error_score='raise')
score.mean()
You need to specify the scoring and the cv arguments.
Use this:
from sklearn.model_selection import cross_val_score
mycv = LeaveOneOut()
cvs=cross_val_score(best_clf, features_important, y_train, scoring='r2',cv = mycv)
mean_cross_val_score = cvs.mean()
print(mean_cross_val_score)
This will return the mean cross-validated R2 score using LOOCV.
For more scoring options see here: https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
So the basic requirement is that, I get a dictionary of models from user, and a dictionary of their hyper parameters and give a report. Currently goal is for binary classification, but this can be extended later.
This is what I am currently doing:
import numpy as np
import pandas as pd
# import pandas_profiling as pp
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, roc_auc_score, recall_score, precision_score, make_scorer
from sklearn import datasets
# import joblib
import warnings
warnings.filterwarnings('ignore')
cancer = datasets.load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
target = df['target']
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='target', axis=1), target, test_size=0.4, random_state=13, stratify=target)
def build_model(model_name, model_class, params=None):
"""
return model instance
"""
if 'Ridge' in model_name:
model = model_class(penalty='l2')
elif 'Lasso' in model_name:
model = model_class(penalty='l1')
elif 'Ensemble' in model_name:
model = model_class(estimators=[('rf', RandomForestClassifier()), ('gbm', GradientBoostingClassifier())], voting='hard')
else:
model = model_class()
if params is not None:
print('Custom Model Parameters provided. Implementing Randomized Search for {} model'.format(model_name))
rscv = RandomizedSearchCV(estimator=model, param_distributions=params[model_name],
random_state=22, n_iter=10, cv=5, verbose=1, n_jobs=-1,
scoring=make_scorer(f1_score), error_score=0.0)
return rscv
print('No model parameters provided. Using sklearn default values for {} model'.format(model_name))
return model
def fit_model(model_name, model_instance, xTrain, yTrain):
"""
fit model
"""
if model_name == 'SVM':
scaler = StandardScaler()
model = model_instance.fit(scaler.fit_transform(xTrain), yTrain)
else:
model = model_instance.fit(xTrain, yTrain)
return model
def predict_vals(fitted_model, xTest):
"""
predict and return vals
"""
if model_name == 'SVM':
scaler = StandardScaler()
y_prediction = fitted_model.predict(scaler.fit_transform(xTest))
else:
y_prediction = fitted_model.predict(xTest)
return y_prediction
def get_metrics(yTest, y_prediction):
"""
get metrics after getting prediction
"""
return [recall_score(yTest, y_prediction),
precision_score(yTest, y_prediction),
f1_score(yTest, y_prediction),
roc_auc_score(yTest, y_prediction)]
def model_report(list_of_metrics):
"""
add metrics to df, return df
"""
df = pd.DataFrame(list_of_metrics, columns=['Model', 'Recall', 'Precision', 'f1', 'roc_auc'])
df = df.round(3)
return df
models = {
'Logistic Regression Ridge': LogisticRegression,
'Logistic Regression Lasso': LogisticRegression,
'Random Forest': RandomForestClassifier,
'SVM': SVC,
'GBM': GradientBoostingClassifier,
'EnsembleRFGBM': VotingClassifier
}
model_parameters = {
'SVM': {
'C': np.random.uniform(50, 1, [25]),#[1, 10, 100, 1000],
'class_weight': ['balanced'],
'gamma': [0.0001, 0.001],
'kernel': ['linear']
},
'Random Forest': {
'n_estimators': [5, 10, 50, 100, 200],
'max_depth': [3, 5, 10, 20, 40],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False],
'min_samples_leaf': [np.random.randint(1,10)]
},
'Logistic Regression Ridge': {
'C': np.random.rand(25),
'class_weight': ['balanced']
},
'Logistic Regression Lasso': {
'C': np.random.rand(25),
'class_weight': ['balanced']
},
'GBM': {
'n_estimators': [10, 50, 100, 200, 500],
'max_depth': [3, 5, 10, None],
'min_samples_leaf': [np.random.randint(1,10)]
},
'EnsembleRFGBM': {
'rf__n_estimators': [5, 10, 50, 100, 200],
'rf__max_depth': [3, 5, 10, 20, 40],
'rf__min_samples_leaf': [np.random.randint(1,10)],
'gbm__n_estimators': [10, 50, 100, 200, 500],
'gbm__max_depth': [3, 5, 10, None],
'gbm__min_samples_leaf': [np.random.randint(1,10)]
}
}
Without parameters I get the following report.
# without parameters
lst = []
for model_name, model_class in models.items():
model_instance = build_model(model_name, model_class)
fitted_model = fit_model(model_name, model_instance, X_train, y_train)
y_predicted = predict_vals(fitted_model, X_test)
metrics = get_metrics(y_test, y_predicted)
lst.append([model_name] + metrics)
model_report(lst)
With parameters given as input
# with parameters
lst = []
for model_name, model_class in models.items():
model_instance = build_model(model_name, model_class, model_parameters)
fitted_model = fit_model(model_name, model_instance, X_train, y_train)
y_predicted = predict_vals(fitted_model, X_test)
metrics = get_metrics(y_test, y_predicted)
lst.append([model_name] + metrics)
model_report(lst)
The task given to me right now is as follows.
Take from user, a dictionary of models, and their parameters. If parameters are not provided, then use defaults of the models.
Give as output the report (as seen in images)
I was told that I should change the functions to classes. And avoid for loops if possible.
My challenges:
How do I change all the functions into a class and methods? Basically my senior wants something like
report.getReport # gives the dataFrame of the report
But the above sounds to me like it can be done in a function as follows (I don't understand why/how a class would be beneficial)
customReport(whatever inputs I'd like to give) # gives df of report
How do I avoid for loops to get through the user inputs for various models? What I thought was that I could use sklearn pipeline, since according to my understanding, pipeline is a series of steps, so from user take the params and models, and execute them as a series of steps. This avoids the for loops.
Something like this
customPipeline = Pipeline([ ('rf', RandomForestClassifier(with relevant params from params dict),
'SVC', SVC(with relevant params from params dict)) ] )
Similar solution I found is here but I would like to avoid for loops as such.
Another related solution here is using a class which can switch between different models. But here I would require that the user be able to give option whether he wants to do Gridsearch/RandomizedSearch/CV/None. My thinking is that I use this class, then inherit this to another class which the user can give input to choose Gridsearch/RandomizedSearch/CV/None etc. I'm not sure if I'm thinking in the right direction.
NOTE A full working solution is desirable (would love it) but not mandatory. It is ok if your answer has a skeleton which can give me a direction how to proceed. I am ok with exploring and learning from it.
I have implemented a working solution. I should have worded my question better. I initially misunderstood how GridsearchCV or RandomizedSearchCV works internally. cv_results_ gives all the results of the grid available. I thought only the best estimator was available to us.
Using this, for each type of model, I took the max rank_test_score, and got the parameters making up the model. In this example, it is 4 models. Now I ran each of those models, i.e. the best combination of parameters for each model, with my test data, and predicted the required scores. I think this solution can be extended to RandomizedSearchCV and a lot more other options.
NOTE: This is just a trivial solution. Lot of modifications necessary, like needing to scale data for specific models, etc. This solution will just serve as a starting point which can be modified according to the user's needs.
Credits to this answer for the ClfSwitcher() class.
Following is the implementation of the class (suggestions to improve are welcomed).
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score, roc_auc_score, recall_score, precision_score
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
import warnings
warnings.filterwarnings('ignore')
cancer = datasets.load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
target = df['target']
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='target', axis=1), target, test_size=0.4, random_state=13, stratify=target)
class ClfSwitcher(BaseEstimator):
def __init__(self, model=RandomForestClassifier()):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.model = model
def fit(self, X, y=None, **kwargs):
self.model.fit(X, y)
return self
def predict(self, X, y=None):
return self.model.predict(X)
def predict_proba(self, X):
return self.model.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
class report(ClfSwitcher):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.grid = None
self.full_report = None
self.concise_report = None
self.scoring_metrics = {
'precision': precision_score,
'recall': recall_score,
'f1': f1_score,
'roc_auc': roc_auc_score
}
def griddy(self, pipeLine, parameters, **kwargs):
self.grid = GridSearchCV(pipeLine, parameters, scoring='accuracy', n_jobs=-1)
def fit_grid(self, X_train, y_train=None, **kwargs):
self.grid.fit(X_train, y_train)
def make_grid_report(self):
self.full_report = pd.DataFrame(self.grid.cv_results_)
#staticmethod
def get_names(col):
return col.__class__.__name__
#staticmethod
def calc_score(col, metric):
return round(metric(y_test, col.fit(X_train, y_train).predict(X_test)), 4)
def make_concise_report(self):
self.concise_report = pd.DataFrame(self.grid.cv_results_)
self.concise_report['model_names'] = self.concise_report['param_cst__model'].apply(self.get_names)
self.concise_report = self.concise_report.sort_values(['model_names', 'rank_test_score'], ascending=[True, False]) \
.groupby(['model_names']).head(1)[['param_cst__model', 'model_names']] \
.reset_index(drop=True)
for metric_name, metric_func in self.scoring_metrics.items():
self.concise_report[metric_name] = self.concise_report['param_cst__model'].apply(self.calc_score, metric=metric_func)
self.concise_report = self.concise_report[['model_names', 'precision', 'recall', 'f1', 'roc_auc', 'param_cst__model']]
pipeline = Pipeline([
('cst', ClfSwitcher()),
])
parameters = [
{
'cst__model': [RandomForestClassifier()],
'cst__model__n_estimators': [10, 20],
'cst__model__max_depth': [5, 10],
'cst__model__criterion': ['gini', 'entropy']
},
{
'cst__model': [SVC()],
'cst__model__C': [10, 20],
'cst__model__kernel': ['linear'],
'cst__model__gamma': [0.0001, 0.001]
},
{
'cst__model': [LogisticRegression()],
'cst__model__C': [13, 17],
'cst__model__penalty': ['l1', 'l2']
},
{
'cst__model': [GradientBoostingClassifier()],
'cst__model__n_estimators': [10, 50],
'cst__model__max_depth': [3, 5],
'cst__model__min_samples_leaf': [1, 2]
}
]
my_report = report()
my_report.griddy(pipeline, parameters, scoring='f1')
my_report.fit_grid(X_train, y_train)
my_report.make_concise_report()
my_report.concise_report
Output Report as desired.
You can consider using map(), details here: https://www.geeksforgeeks.org/python-map-function/
Some programmers have the habit of avoiding raw loops - "A raw loop is any loop inside a function where the function serves purpose larger than the algorithm
implemented by the loop". More details here: https://sean-parent.stlab.cc/presentations/2013-09-11-cpp-seasoning/cpp-seasoning.pdf
I think that's the reason you are asked to remove for loop.
I'm trying to introduce LightGBM for text multiclassification.
2 columns in pandas dataframe, where 'category' and 'contents' are set as follows.
Dataframe:
contents category
1 this is example1... A
2 this is example2... B
3 this is example3... C
*Actual data frame consists of approx 600 rows and 2 columns.
Hereby I'm trying to classify text into 3 categories as follows.
Codes:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
stopwords1 = set(stopwords.words('english'))
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
import lightgbm as lgbm
from lightgbm import LGBMClassifier, LGBMRegressor
#--main code--#
X_train, X_test, Y_train, Y_test = train_test_split(df['contents'], df['category'], random_state = 0, test_size=0.3, shuffle=True)
count_vect = CountVectorizer(ngram_range=(1,2), stop_words=stopwords1)
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2', sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
lgbm_train = lgbm.Dataset(X_train_tfidf, Y_train)
lgbm_eval = lgbm.Dataset(count_vect.transform(X_test), Y_test, reference=lgbm_train)
params = {
'boosting_type':'gbdt',
'objective':'multiclass',
'learning_rate': 0.02,
'num_class': 3,
'early_stopping': 100,
'num_iteration': 2000,
'num_leaves': 31,
'is_enable_sparse': 'true',
'tree_learner': 'data',
'max_depth': 4,
'n_estimators': 50
}
clf_gbm = lgbm.train(params, valid_sets=lgbm_eval)
predicted_LGBM = clf_gbm.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test, predicted_LGBM))
Then I got an error as:
ValueError: could not convert string to float: 'b'
I also convert 'category' column ['a', 'b', 'c'] to int as [0, 1, 2] but got an error as
TypeError: Expected np.float32 or np.float64, met type(int64).
What's wrong with my code?
Any advice / suggestions will be greatly appreciated.
Thanks in advance.
I managed to deal with this issue. Very simple but noted here for reference.
Since LightGBM expects float32/64 for input, so 'categories' should be number, rather than str.
And input data should be converted to float32/64 using .astype().
Changes1:
added following 4 lines after X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf = X_train_tfidf.astype('float32')
X_test_counts = X_test_counts.astype('float32')
Y_train = Y_train.astype('float32')
Y_test = Y_test.astype('float32')
changes2:
just convert 'category' column from [A, B, C, ...] to [0.0, 1.0, 2.0, ...]
Maybe just assigning attirbute as TfidfVecotrizer(dtype=np.float32) works in this case.
And putting vectorized data to LGBMClassifier will be much simpler.
Update
Using TfidfVectorizer is much simpler:
tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True)
X_data_tfidf = tfidf_vec.fit_transform(df['contents'])
X_train_tfidf = tfidf_vec.transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)
clf_LGBM = lgbm.LGBMClassifier(objective='multiclass', verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000,)
clf_LGBM.fit(X_train_tfidf, Y_train, verbose=-1)
predicted_LGBM = clf_LGBM.predict(X_test_tfidf)