I am using Python 3.5 and python implementation of XGBoost, version 0.6
I built a forward feature selection routine in Python, which iteratively builds the optimal set of features (leading to the best score, here metric is binary classification error).
On my data set, using xgb.cv routine, I can get down to an error rate of around 0.21 by increasing max_depth (of trees) up to 40...
But then if I do a custom cross-validation, using the same XG Boost parameters, same folds, same metric and same data set, I reach the best score being 0.70 with max_depth of 4 ... if I use the optimal max_depth obtained by my xgb.cv routine, my score drops to 0.65 ... I just don't understand what is happening ...
My best guess is that xgb.cv is using different folds (i.e. shuffles the data before partitioning), but I also think I submit the folds as an input to xgb.cv (with option Shuffle=False) ... so, it might be something completely different ...
Here is the code of the forward_feature_selection (using xgb.cv):
def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):
k_fold = KFold(n_splits=13)
selected_features = []
gain = threshold + 1
previous_best_score = initial_score
train = train.drop(train.columns[to_exclude], axis=1) # df.columns is zero-based pd.Index
features = train.columns.values
selected = np.zeros(len(features))
scores = np.zeros(len(features))
while (gain > threshold): # we start a add-a-feature loop
for i in range(0,len(features)):
if (selected[i]==0): # take only features not yet selected
selected_features.append(features[i])
new_train = train.iloc[:][selected_features]
selected_features.remove(features[i])
dtrain = xgb.DMatrix(new_train, y_train, missing = None)
# dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
if (i % 10 == 0):
print("Launching XGBoost for feature "+ str(i))
xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False)
if params['objective'] == 'binary:logistic':
scores[i] = xgb_cv.tail(1)["test-error-mean"] #classification
else:
scores[i] = xgb_cv.tail(1)["test-rmse-mean"] #regression
else:
scores[i] = initial_score # discard already selected variables from candidates
best = np.argmin(scores)
gain = previous_best_score - scores[best]
if (gain > 0):
previous_best_score = scores[best]
selected_features.append(features[best])
selected[best] = 1
print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score))
return (selected_features, previous_best_score)
and here is my "custom" cross validation:
mean_error_rate = 0
for train, test in k_fold.split(ds):
dtrain = xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
gbm = xgb.train(params, dtrain, 30)
dtest = xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
res.ix[test,"pred"] = gbm.predict(dtest)
cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))
res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"])
print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))
using the following parameters:
params = {"objective": "binary:logistic",
"booster":"gbtree",
"max_depth":4,
"eval_metric" : "error",
"eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"])
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30
And finally the call to my forward feature selection:
selfeat = Forward_Feature_Selection(dc,
dc["bin_spread"],
params,
num_round = num_trees,
threshold = 0,
initial_score=999,
to_exclude = [0,1,5,30,31],
nfold = 13)
Any help to understand what is happening will be greatly appreciated ! Thanks in advance for any tip !
This is normal. I have experienced the same. Firstly, Kfold is splitting differently each time. You have specified the folds in XGBoost but KFold is not splitting consistently, which is normal.
Next, initial state of the model are different each time.
There are inner random states withing XGBoost which can cause this too, try changing the eval metric to see if the variance reduces. If a particular metric suits your needs, try to average the best parameters and use that as your optimal parameters.
Related
I created the following function to use as an evaluation metric to tune hyper parameter
# function to calculate the RMSLE
def get_msle(true, predicted) :
return np.sqrt(msle(true, predicted))
# custom evaluation metric function for the LightGBM
def custom_eval(preds, dtrain):
labels = dtrain.get_label().astype(np.int)
preds = preds.clip(min=0)
return [('rmsle', get_msle(labels, preds))]
I created the following function to train and validate the hyper parameter
def get_n_estimators(evaluation_set, min_r, max_r):
results = []
for n_est in (range(min_r, max_r, 20)):
x = {}
SCORE_TRAIN = []
SCORE_VALID = []
for train, valid in (evaluation_set):
# seperate the independent and target variable from the train and validation set
train_data_x = train.drop(columns= ['WEEK_END_DATE', 'UNITS'])
train_data_y = train['UNITS']
valid_data_x = valid.drop(columns= ['WEEK_END_DATE', 'UNITS'])
valid_data_y = valid['UNITS']
# evaluation sets
# we will evaluate our model on both train and validation data
e_set = [(train_data_x, train_data_y), (valid_data_x, valid_data_y)]
# define the lgbmRegressor Model
model = lgb.LGBMRegressor(n_estimators = n_est,
learning_rate = 0.01,
n_jobs=4,
random_state=0,
objective='regression')
# fit the model
model.fit(train_data_x, train_data_y, eval_metric= custom_eval ,eval_set= e_set, verbose=False)
# store the RMSLE on train and validation sets in different lists
# so that at the end we can calculate the mean of results at the end
SCORE_TRAIN.append(model.evals_result_['validation_0']['rmsle'][-1])
SCORE_VALID.append(model.evals_result_['validation_1']['rmsle'][-1])
# calculate the mean rmsle on train and valid
mean_score_train = np.mean(SCORE_TRAIN)
mean_score_valid = np.mean(SCORE_VALID)
print('With N_ESTIMATORS:\t'+ str(n_est) + '\tMEAN RMSLE TRAIN:\t' + str(mean_score_train) + "\tMEAN RMSLE VALID: "+str(mean_score_valid))
x['n_estimators'] = n_est
x['mean_rmsle_train'] = mean_score_train
x['mean_rmsle_valid'] = mean_score_valid
results.append(x)
return pd.DataFrame.from_dict(results)
However, when I am trying to implement the function I am getting an error, this is the implementation code.
n_estimators_result = get_n_estimators(evaluation_set,min_r = 20, max_r = 901)
This is the error I am getting
'numpy.ndarray' object has no attribute 'get_label'
can you please help me in resolving this error ? I am struck with this for two days now
I am using the hyperclassifiersearch package to run my gridsearch with pipeline. One thing i do not understand is that when i use One Hot encoding ( when i switch to targetencoding i don't get the error) i get this error from running the below:
86
87 print('Search is done.')
---> 88 return best_model # allows to predict with the best model overall
89
90 def evaluate_model(self, sort_by='mean_test_score', show_timing_info=False):
UnboundLocalError: local variable 'best_model' referenced before assignment
The code i used to generate that error is as follows:
#define pipeline
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))]
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols),('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=-1_label_encoder=False)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),('clf', model)])
models = {
'xgb' : pipeline }
params = {
'xgb': { 'clf__n_estimators': [200,300]}
}
cv = StratifiedKFold(n_splits=3, random_state=42,shuffle=True)
search = HyperclassifierSearch(models, params)
gridsearch = search.train_model(X_train, y_train, cv=cv,scoring='recall')
I dont understand this error? Can anybody help https://github.com/janhenner/HyperclassifierSearch <-- repo to package.
Full code is
def train_model(self, X_train, y_train, search='grid', **search_kwargs):
"""
Optimizing over one or multiple classifiers or pipelines.
Input:
X : array or dataframe with features; this should be a training dataset
y : array or dataframe with label(s); this should be a training dataset
Output:
returns the optimal model according to the scoring metric
Parameters:
search : str, default='grid'
define the search
``grid`` performs GridSearchCV
``random`` performs RandomizedSearchCV
**search_kwargs : kwargs
additional parameters passed to the search
"""
grid_results = {}
best_score = 0
for key in self.models.keys():
print('Search for {}'.format(key), '...')
assert search in ('grid', 'random'), 'search parameter out of range'
if search=='grid':
grid = GridSearchCV(self.models[key], self.params[key], **search_kwargs)
if search=='random':
grid = RandomizedSearchCV(self.models[key], self.params[key], **search_kwargs)
grid.fit(X_train, y_train)
self.grid_results[key] = grid
if grid.best_score_ > best_score: # return best model
best_score = grid.best_score_
best_model = grid
print('Search is done.')
return best_model # allows to predict with the best model overall
but it seems in some situations all models may have score <= 0 and then it doesn't run
best_model = grid
so variable best_model is not created and it can't run return best_model.
You should assign default value - i.e. best_model = None - to have this value at start.
Or you should use lower score at start - i.e. best_score = -1
This problem you should send to author of this module.
EDIT:
I added best_model = None but now you have to remember to check if you not get None when you run it.
"""
Optimizing over one or multiple classifiers or pipelines.
Input:
X : array or dataframe with features; this should be a training dataset
y : array or dataframe with label(s); this should be a training dataset
Output:
returns the optimal model according to the scoring metric
Parameters:
search : str, default='grid'
define the search
``grid`` performs GridSearchCV
``random`` performs RandomizedSearchCV
**search_kwargs : kwargs
additional parameters passed to the search
"""
grid_results = {}
best_score = 0
best_model = None # <--- default value at start
for key in self.models.keys():
print('Search for {}'.format(key), '...')
assert search in ('grid', 'random'), 'search parameter out of range'
if search=='grid':
grid = GridSearchCV(self.models[key], self.params[key], **search_kwargs)
if search=='random':
grid = RandomizedSearchCV(self.models[key], self.params[key], **search_kwargs)
grid.fit(X_train, y_train)
self.grid_results[key] = grid
if grid.best_score_ > best_score: # return best model
best_score = grid.best_score_
best_model = grid
print('Search is done.')
return best_model # allows to predict with the best model overall
And later
gridsearch = search.train_model(X_train, y_train, cv=cv,scoring='recall')
if not gridsearch: # check if None
print("Didn't find model")
else:
# ... code ...
I'm trying to perform GridSearchCV to optimize hyperparameters of my classifier, this should be done by optimizing a custom scoring-function. The problem is, that the scoring-function is assigned on a certain cost, that is different each instance (the cost is also a feature of each instance). Like shown in the example below, another array test_amt is needed that holds the cost of each instance (in addition to the 'normal' scoring function that just gets y and y_pred.
def calculate_costs(y_test, y_test_pred, test_amt):
cost = 0
for i in range(1, len(y_test)):
y = y_test.iloc[i]
y_pred = y_test_pred.iloc[i]
x_amt = test_amt.iloc[i]
if y == 0 and y_pred == 0:
cost -= x_amt * 1.1
elif y == 0 and y_pred == 1:
cost += x_amt
elif y == 1 and y_pred == 0:
cost += x_amt * 1.1
elif y == 1 and y_pred == 1:
cost += 0
else:
print("ERROR! No cost could be assigned to the instance: " + str(i))
return cost
When I call this function after training with the three arrays, it perfectly calculates the total cost that results from a model. However integrating this into GridSearchCV is difficult, because the scoring function only expects two parameters. While there is the possibility to pass additional kwargs to the scorer, I have no clue on how to pass a subset that is dependent on the split that GridSearchCV is currently working on.
What I have thought of / tried so far:
Wrapping the whole pipeline in a class with a globally stored pandas.Series object that stores the cost of each instance with an index. Then, it would theoretically be possible to reference the cost of an instance by calling it with the same index. Unfortunately, this does not work as scikit-learn transforms everything into a numpy array.
def calculate_costs_class(y_test, y_test_pred):
cost = 0
for index, _ in y_test.iteritems():
y = y_test.loc[index]
y_pred = y_test_pred.loc[index]
x_amt = self.test_amt.loc[index]
if y == 0 and y_pred == 0:
cost += (x_amt * (-1)) + 5 + (x_amt * 0.1) # -revenue, +shipping, +fees
elif y == 0 and y_pred == 1:
cost += x_amt # +revenue
elif y == 1 and y_pred == 0:
cost += x_amt + 5 + (x_amt * 0.1) + 5 # +revenue, +shipping, +fees, +charge cost
elif y == 1 and y_pred == 1:
cost += 0 # nothing
else:
print("ERROR! No cost could be assigned to the instance: " + str(index))
return cost
Creating a custom PseudoInt class, that is the data type of the label, which inherits all properties from int but is also able to store the cost of an instance (while retaining all its properties for applying logical operations). While even this would work outside of Scikit Learn, the check_classification_targets method in scikit learn raises a ValueError: Unknown label type: 'unknown' error.
class PseudoInt(int):
def __new__(cls, x, cost, *args, **kwargs):
instance = int.__new__(cls, x, *args, **kwargs)
instance.cost = cost
return instance
I haven't tried but thought of: Since the cost is also a feature in the instance set X, it is also available in the __call__ method of _PredictScorer(_BaseScorer) class in Scikit's scorer.py. If I reprogram the call function to also pass the cost array as a subset of X to the score_func I would also have the cost.
Or: I could just implement everything myself.
Is there an "easier" solution?
I found a way to solve the problem by going the path of the 2nd proposed answer: Passing a PseudoInteger to Scikit-Learn that has all the same properties as a normal integer when compared or done mathematical operations with. However, it also acts as a wrapper for the int, and instance variables (such as the cost of an instance) can also be stored. As already stated in the question, this causes Scikit-learn to recognize that the values inside the passed label array are in fact of type object rather than int. So I just replaced the test in the type_of_target(y) method of Scikit-Learn's multiclass.py in line 273 to return 'binary' even though it doesn't pass the test. So that Scikit-Learn just treats the whole problem (as it should be) like a binary classification problem. So line 269-273 in the type_of_target(y) method in multiclass.py now looks like:
# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
not isinstance(y.flat[0], string_types)):
# return 'unknown' # [[[1, 2]]] or [obj_1] and not ["label_1"]
return 'binary' # Sneaky, modified to force binary classification.
My code then looks like this:
import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer
class PseudoInt(int):
# Behaves like an integer, but is able to store instance variables
pass
def grid_search(x, y_normal, x_amounts):
# Change the label set to a np array containing pseudo ints with the costs associated with the instances
y = np.empty(len(y_normal), dtype=PseudoInt)
for index, value in y_normal.iteritems():
new_int = PseudoInt(value)
new_int.cost = x_amounts.loc[index] # Here the cost is added to the label
y[index] = new_int
# Normal train test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
# Classifier
clf = sklearn.tree.DecisionTreeClassifier()
# Custom scorer with the cost function below (lower cost is better)
cost_scorer = make_scorer(cost_function, greater_is_better=False) # Custom cost function (Lower cost is better)
# Define pipeline
pipe = Pipeline([('clf', clf)])
# Grid search grid with any hyper parameters or other settings
param_grid = [
{'sfs__estimator__criterion': ['gini', 'entropy']}
]
# Grid search and pass the custom scorer function
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring=cost_scorer,
n_jobs=1,
cv=5,
refit=True)
# run grid search and refit with best hyper parameters
gs = gs.fit(x_train.as_matrix(), y_train)
print("Best Parameters: " + str(gs.best_params_))
print('Best Accuracy: ' + str(gs.best_score_))
# Predict with retrained model (with best parameters)
y_test_pred = gs.predict(x_test.as_matrix())
# Get scores (also cost score)
get_scores(y_test, y_test_pred)
def get_scores(y_test, y_test_pred):
print("Getting scores")
print("SCORES")
precision = sklearn.metrics.precision_score(y_test, y_test_pred)
recall = sklearn.metrics.recall_score(y_test, y_test_pred)
f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
print("Precision " + str(precision))
print("Recall " + str(recall))
print("Accuracy " + str(accuracy))
print("F1_Score " + str(f1_score))
print("COST")
cost = cost_function(y_test, y_test_pred)
print("Cost Savings " + str(-cost))
print("CONFUSION MATRIX")
cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
print(cnf_matrix)
def cost_function(y_test, y_test_pred):
"""
Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
:param y_test: Has to be an array of PseudoInts containing the cost of each instance
:param y_test_pred: Any array of PseudoInts or ints
:return: Returns total cost
"""
cost = 0
for index in range(len(y_test)):
# print(index)
y = y_test[index]
y_pred = y_test_pred[index]
x_amt = y.cost
if y == 0 and y_pred == 0:
cost -= x_amt # Reducing cot by x_amt
elif y == 0 and y_pred == 1:
cost += x_amt # Wrong classification adds cost
elif y == 1 and y_pred == 0:
cost += x_amt + 5 # Wrong classification adds cost and fee
elif y == 1 and y_pred == 1:
cost += 0 # No cost
else:
raise ValueError("No cost could be assigned to the instance: " + str(index))
# print("Cost: " + str(cost))
return cost
UPDATE
Instead of changing the files in the package directly (which is kind of dirty), I now added to the first import lines of my project:
import sklearn.utils.multiclass
def return_binary(y):
return "binary"
sklearn.utils.multiclass.type_of_target = return_binary
This overwrites the type_of_tartget(y) method in sklearn.utils.multiclass to always return binary. Note that his has to be in front of all the other sklearn-imports.
I'm using scikit-learn for doing Metaheuristics exercises and I have a doubt: I need to use knn, so I have a KNearestNeighbors object with n_jobs=-1. As the docs said, I have to set the multiprocessing mode to forkserver. But the knn is soooo slower with n_jobs=-1 than with n_jobs=1.
This is some piece of code
### Some initialization here ###
skf = StratifiedKFold(target, n_folds=2, shuffle=True)
for train_index, test_index in skf:
data_train, data_test = data[train_index], data[test_index]
target_train, target_test = target[train_index], target[test_index]
start = time()
selected_features, score = SFS(data_train, data_test, target_train, target_test, knn)
end = time()
logger.info("SFS - Time elapsed: " + str(end-start) + ". Score: " + str(score) + ". Selected features: " + str(sum(selected_features)))
if __name__ == "__main__":
import multiprocessing as mp; mp.set_start_method('forkserver', force = True)
main()
This is the SFS function
def SFS(data_train, data_test, target_train, target_test, classifier):
rowsize = len(data_train[0])
selected_features = np.zeros(rowsize, dtype=np.bool)
best_score = 0
best_feature = 0
while best_feature is not None:
end = True
best_feature = None
for idx in range(rowsize):
if selected_features[idx]:
continue
selected_features[idx] = True
classifier.fit(data_train[:,selected_features], target_train)
score = classifier.score(data_test[:,selected_features], target_test)
selected_features[idx] = False
if score > best_score:
best_score = score
best_feature = idx
if best_feature is not None:
selected_features[best_feature] = True
return selected_features, best_score
I don't understand how can n_jobs > 1 be slower than n_jobs = 1. Can anyone explain me that? I've tried with 3 dataset.
I found out many of people like you had same problem : n_jobs is not working in KNearestNeighbors of sklearn. And they also complained that just 1 CPU core was loaded.
In my experiment, fitting process uses just single core whether n_jobs>1 or not. So whether you set n_jobs as large number, if your train data sample is large, the time for training will be huge and not reduced.
And the reason n_jobs>1 is even more slow than n_jobs=1 is because of the cost to distribute resources for multiprocessing.
I am using decision stumps with a BaggingClassifier to classify some data:
def fit_ensemble(attributes,class_val,n_estimators):
# max depth is 1
decisionStump = DecisionTreeClassifier(criterion = 'entropy', max_depth = 1)
ensemble = BaggingClassifier(base_estimator = decisionStump, n_estimators = n_estimators, verbose = 3)
return ensemble.fit(attributes,class_val)
def predict_all(fitted_classifier, instances):
for i, instance in enumerate(instances):
instances[i] = fitted_classifier.predict([instances[i]])
return list(itertools.chain(*instances))
def main(filename, n_estimators):
df_ = read_csv(filename)
col_names = df_.columns.values.tolist()
attributes = col_names[0:-1] ## 0..n-1
class_val = col_names[-1] ## n
fitted = fit_ensemble(df_[attributes].values, df_[class_val].values, n_estimators)
fitted_classifiers = fitted.estimators_ # get the three decision stumps.
compared_ = DataFrame(index = range(0,len(df_.index)), columns = range(0,n_estimators + 1))
compared_ = compared_.fillna(0)
compared_.ix[:,n_estimators] = df_[class_val].values
for i, fitted_classifier in enumerate(fitted_classifiers):
compared_.ix[:,i] = predict_all(fitted_classifier,df_[attributes].values)
I would like to inspect the random subset used to train each decision stump. I have looked at the documentation for both the ensemble and decision tree class, but haven't found any attributes or methods that yield the training subset. Is this a futile task? Or is there some way, perhaps while the tree is training, to output the training subset?
I am very new to pandas, but come from an R background. My code is definitely not optimized, though I can assure that the dataset is very small for my task. Thanks for the help.
It looks like I have answered my own question. the estimators_samples_ attribute of DecisionTreeClassifier is what I want.