I want to apply grid search to identify the numbers of features that should be selected:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
data = load_breast_cancer()
parameters = {'select__k': range(1,11)}
p = Pipeline([('select', SelectKBest(chi2)), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(data.data, data.target)
So, for each fold, it will calculate a ranking. However, instead of calculating this ranking only once, sklearn calculates it number_of_folds * number_of_parameters times. In this case, 100 times instead of just 10 times. Is there a way to give sklearn a hint to avoid recomputation?
Update:
I found a solution but it is pretty hacky. So, if you have any better idea, let me know:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
map_fold2ranking = {}
class WrapperSelection(SelectKBest):
def __init__(self, selection, k=10):
self.k = k
self.selection = selection
def fit(self, X, y=None):
hash_for_fold_ids = np.sum(X.index.values)
if hash_for_fold_ids in map_fold2ranking:
self.scores_ = map_fold2ranking[hash_for_fold_ids]
return self
self.selection.fit(X,y)
map_fold2ranking[hash_for_fold_ids] = self.selection.scores_
self.scores_ = self.selection.scores_
return self
data = load_breast_cancer()
parameters = {'select__k': range(1, 11)}
p = Pipeline([('select', WrapperSelection(SelectKBest(chi2))), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(pd.DataFrame(data.data), data.target)
Thank you in advance.
Best regards,
Felix
However, instead of calculating this ranking only once, sklearn calculates it number_of_folds * number_of_parameters times.
This is exactly the purpose of grid search. The model is evaluated over all the partitions of the data and for all the parameters over each partition.
From the docs:
Exhaustive search over specified parameter values for an estimator.
Related
I'm trying to tune or search for parameters for a scoring function in scikit-learn.
For example, in the pipeline below, I first perform feature selection with SelectKBest, which requires a scoring function (e.g., mutual_info_regression), and finally pass the best features to LinearRegression().
I want to tune the hyperparameter n_neighbors in the mutual_info_regression function, which is the scoring function provided to SelectKBest, but it isn't clear to me how I can tune n_neighbors?
Appreciate any help anyone can provide. Thanks!
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import GridSearchCV
# generate some data
X = np.random.normal(size=(10, 15))
y = np.random.normal(size=10)
# test scoring function
# default hyperparameter is n_neighbors=3
mutual_info_regression(X, y, n_neighbors=3)
# create pipeline
kbest = SelectKBest(mutual_info_regression) # using default n_neighbors value
pipe = make_pipeline(kbest, LinearRegression())
# how to tune search space for mutual_info_regression n_neighbors?
params = {"selectkbest__score_func": []} # how to define n_neighbors?
grid = GridSearchCV(pipe, params)
Below is a solution I've come up with but I'm not sure if it's the simplest/best way to tune such a hyperparameter.
I've used functools.partial to create partial objects when creating the parameters grid dictionary params, and each object has a different n_neighbors value.
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import GridSearchCV
from functools import partial # to create partial objects
X = np.random.normal(size=(10, 15))
y = np.random.normal(size=10)
kbest = SelectKBest(mutual_info_regression)
pipe = make_pipeline(kbest, LinearRegression())
# use functools.partial
params = {
"selectkbest__score_func": [
partial(mutual_info_regression, n_neighbors=n) for n in range(1, 5)
]
}
grid = GridSearchCV(pipe, params)
I am using the iris flower dataset to do the sorting. I need to make a confusion matrix through cross validation (fold = 10) but I don't know how to do it. I generated the confusion matrix of only one round.
# I am using TPOT autoML library for python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import LabelEncoder
tpot_data = pd.read_csv('iris.csv')
tpot_data = tpot_data.apply(LabelEncoder().fit_transform)
features = tpot_data.drop('species', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['species'].values, random_state=10)
exported_pipeline = make_pipeline(StackingEstimator(estimator=GaussianNB()),
MultinomialNB(alpha=0.01, fit_prior=False)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(testing_target, results))
pd.crosstab(testing_target, results, rownames=['Actual Class'], colnames=['Predicted Class'])
from sklearn.model_selection import cross_val_score
array_cross_val_score = cross_val_score(estimator=exported_pipeline, X=training_features,
y=training_target, cv=10, scoring='accuracy')
# I would like the confusion matrix to be based on the average cross-validation
np.mean(array_cross_val_score)
So i have to build a regression model to predict wine quality based on 11 inputs. Currently i am evaluating the Mean Squared Error, Mean absolute error and R2 scores of various algorithms. I want to make a decision on which algorithm to use, but before i do, i want to make sure my data is not being overfitted/underfitted. Below is the link to the dataset i use (its a bit different but the data is exactly the same) as well as my entire code.
Any help is greatly appreciated!
Data:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
Also, the kagggle link where i copied most of my code from:
https://www.kaggle.com/jhansia/regression-models-analysis-on-the-wine-quality
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
wine = pd.read_csv('wineQualityReds.csv', usecols=lambda x: 'Unnamed' not in x,)
wine.head()
y = wine.quality
X = wine.drop('quality',axis = 1)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state = 0, stratify = y)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
models = []
models.append(('DecisionTree', DecisionTreeRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('GradienBoost', GradientBoostingRegressor()))
models.append(('SVR', SVR()))
names = []
for name,model in models:
kfold = model_selection.KFold(n_splits=5,random_state=2)
cv_results = model_selection.cross_val_score(model,train_x_scaled,train_y, cv= kfold, scoring = 'neg_mean_absolute_error')
names.append(name)
msg = "%s: %f" % (name, -1*(cv_results).mean())
print(msg)
model = RandomForestRegressor()
model.fit(train_x_scaled,train_y)
pred_y = model.predict(test_x_scaled)
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(test_y, pred_y))
print('Mean Absolute Error:', metrics.mean_absolute_error(test_y, pred_y))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_y, pred_y)))
print('R2:', metrics.r2_score(test_y, pred_y))
You can use cross validation on the data sets to find whether it is over fitting or under fitting.
I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method.
This code works nicely:
import imblearn.over_sampling import SMOTE
from collections import Counter
def oversample(x_values, y_values):
oversampler = SMOTE(random_state=42, n_jobs=-1)
x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
return x_oversampled, y_oversampled
But I switched to using a pipeline so I can use GridSearchCV to find the best oversampling method (out of ADASYN, SMOTE and BorderlineSMOTE). Therefore I never actually call fit_resample myself and lose my output using something like this:
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)
The upsampling works, but I lose my output on how many entries for each class there are in the training set.
Is there a way of getting a similar output than the first example using a pipeline?
In theory yes. When an over-sampler is fitted, an attribute sampling_strategy_ is created, containing the number of samples from the minority class(es) to be generated when fit_resample is invoked. You can use it to get a similar output as your example above. Here is a modified example based on your code:
# Imports
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
# Create toy dataset
X, y = make_classification(weights=[0.20, 0.80], random_state=0)
init_class_distribution = Counter(y)
min_class_label, _ = init_class_distribution.most_common()[-1]
print(f'Initial class distribution: {dict(init_class_distribution)}')
# Create and fit pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier(random_state=23))])
pipe.fit(X, y)
sampling_strategy = dict(pipe.steps).get('sampler').sampling_strategy_
expected_n_samples = sampling_strategy.get(min_class_label)
print(f'Expected number of generated samples: {expected_n_samples}')
# Fit and resample over-sampler pipeline
sampler_pipe = Pipeline(pipe.steps[:-1])
X_res, y_res = sampler_pipe.fit_resample(X, y)
actual_class_distribution = Counter(y_res)
print(f'Actual class distribution: {actual_class_distribution}')
I would like the cross_val_score from sklearn function to return the accuracy per each of the classes instead of the average accuracy of all the classes.
Function:
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None,
scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)
Reference
How can I do it?
This is not possible with cross_val_score. The approach you suggest would mean cross_val_score would have to return an array of arrays. However, if you look at the source code, you will see that the output of cross_val_score has to be :
Returns
-------
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
As a result, cross_val_score checks if the scoring method you are using is multimetric or not. If it is, it will throw you an error like:
ValueError: scoring must return a number, got ... instead
Edit:
Like it is correctly pointed out by a comment above, an alternative for you is to use cross_validate instead. Here is how it would work on the Iris dataset for instance:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
scoring = {'recall0': make_scorer(recall_score, average = None, labels = [0]),
'recall1': make_scorer(recall_score, average = None, labels = [1]),
'recall2': make_scorer(recall_score, average = None, labels = [2])}
cross_validate(DecisionTreeClassifier(),X,y, scoring = scoring, cv = 5, return_train_score = False)
Note that this is also supported by the GridSearchCV methodology.
NB: You cannot return "accuracy by each class", I guess you meant recall, which is basically the proportions of correct predictions amongst data points that actually belong to a class.