I want to apply grid search to identify the numbers of features that should be selected:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
data = load_breast_cancer()
parameters = {'select__k': range(1,11)}
p = Pipeline([('select', SelectKBest(chi2)), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(data.data, data.target)
So, for each fold, it will calculate a ranking. However, instead of calculating this ranking only once, sklearn calculates it number_of_folds * number_of_parameters times. In this case, 100 times instead of just 10 times. Is there a way to give sklearn a hint to avoid recomputation?
I found a solution but it is pretty hacky. So, if you have any better idea, let me know:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
map_fold2ranking = {}
class WrapperSelection(SelectKBest):
def __init__(self, selection, k=10):
self.k = k
self.selection = selection
def fit(self, X, y=None):
hash_for_fold_ids = np.sum(X.index.values)
if hash_for_fold_ids in map_fold2ranking:
self.scores_ = map_fold2ranking[hash_for_fold_ids]
return self
map_fold2ranking[hash_for_fold_ids] = self.selection.scores_
self.scores_ = self.selection.scores_
return self
data = load_breast_cancer()
parameters = {'select__k': range(1, 11)}
p = Pipeline([('select', WrapperSelection(SelectKBest(chi2))), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(pd.DataFrame(data.data), data.target)
This is exactly the purpose of grid search. The model is evaluated over all the partitions of the data and for all the parameters over each partition.
From the docs:
Exhaustive search over specified parameter values for an estimator.
I'm trying to tune or search for parameters for a scoring function in scikit-learn.
For example, in the pipeline below, I first perform feature selection with SelectKBest, which requires a scoring function (e.g., mutual_info_regression), and finally pass the best features to LinearRegression().
I want to tune the hyperparameter n_neighbors in the mutual_info_regression function, which is the scoring function provided to SelectKBest, but it isn't clear to me how I can tune n_neighbors?
Appreciate any help anyone can provide. Thanks!
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import GridSearchCV
# generate some data
X = np.random.normal(size=(10, 15))
y = np.random.normal(size=10)
# test scoring function
# default hyperparameter is n_neighbors=3
mutual_info_regression(X, y, n_neighbors=3)
# create pipeline
kbest = SelectKBest(mutual_info_regression) # using default n_neighbors value
pipe = make_pipeline(kbest, LinearRegression())
# how to tune search space for mutual_info_regression n_neighbors?
params = {"selectkbest__score_func": []} # how to define n_neighbors?
grid = GridSearchCV(pipe, params)
Below is a solution I've come up with but I'm not sure if it's the simplest/best way to tune such a hyperparameter.
I've used functools.partial to create partial objects when creating the parameters grid dictionary params, and each object has a different n_neighbors value.
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import GridSearchCV
from functools import partial # to create partial objects
X = np.random.normal(size=(10, 15))
y = np.random.normal(size=10)
kbest = SelectKBest(mutual_info_regression)
pipe = make_pipeline(kbest, LinearRegression())
# use functools.partial
params = {
"selectkbest__score_func": [
partial(mutual_info_regression, n_neighbors=n) for n in range(1, 5)
grid = GridSearchCV(pipe, params)
I am using the iris flower dataset to do the sorting. I need to make a confusion matrix through cross validation (fold = 10) but I don't know how to do it. I generated the confusion matrix of only one round.
# I am using TPOT autoML library for python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import LabelEncoder
tpot_data = pd.read_csv('iris.csv')
tpot_data = tpot_data.apply(LabelEncoder().fit_transform)
features = tpot_data.drop('species', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['species'].values, random_state=10)
exported_pipeline = make_pipeline(StackingEstimator(estimator=GaussianNB()),
MultinomialNB(alpha=0.01, fit_prior=False)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(testing_target, results))
pd.crosstab(testing_target, results, rownames=['Actual Class'], colnames=['Predicted Class'])
from sklearn.model_selection import cross_val_score
array_cross_val_score = cross_val_score(estimator=exported_pipeline, X=training_features,
y=training_target, cv=10, scoring='accuracy')
# I would like the confusion matrix to be based on the average cross-validation
So i have to build a regression model to predict wine quality based on 11 inputs. Currently i am evaluating the Mean Squared Error, Mean absolute error and R2 scores of various algorithms. I want to make a decision on which algorithm to use, but before i do, i want to make sure my data is not being overfitted/underfitted. Below is the link to the dataset i use (its a bit different but the data is exactly the same) as well as my entire code.
Any help is greatly appreciated!
Also, the kagggle link where i copied most of my code from:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
wine = pd.read_csv('wineQualityReds.csv', usecols=lambda x: 'Unnamed' not in x,)
y = wine.quality
X = wine.drop('quality',axis = 1)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state = 0, stratify = y)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
models = []
models.append(('DecisionTree', DecisionTreeRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('GradienBoost', GradientBoostingRegressor()))
models.append(('SVR', SVR()))
names = []
for name,model in models:
kfold = model_selection.KFold(n_splits=5,random_state=2)
cv_results = model_selection.cross_val_score(model,train_x_scaled,train_y, cv= kfold, scoring = 'neg_mean_absolute_error')
msg = "%s: %f" % (name, -1*(cv_results).mean())
model = RandomForestRegressor()
pred_y = model.predict(test_x_scaled)
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(test_y, pred_y))
print('Mean Absolute Error:', metrics.mean_absolute_error(test_y, pred_y))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_y, pred_y)))
print('R2:', metrics.r2_score(test_y, pred_y))
You can use cross validation on the data sets to find whether it is over fitting or under fitting.
I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method.
This code works nicely:
import imblearn.over_sampling import SMOTE
from collections import Counter
def oversample(x_values, y_values):
oversampler = SMOTE(random_state=42, n_jobs=-1)
x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
return x_oversampled, y_oversampled
But I switched to using a pipeline so I can use GridSearchCV to find the best oversampling method (out of ADASYN, SMOTE and BorderlineSMOTE). Therefore I never actually call fit_resample myself and lose my output using something like this:
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)
The upsampling works, but I lose my output on how many entries for each class there are in the training set.
Is there a way of getting a similar output than the first example using a pipeline?
In theory yes. When an over-sampler is fitted, an attribute sampling_strategy_ is created, containing the number of samples from the minority class(es) to be generated when fit_resample is invoked. You can use it to get a similar output as your example above. Here is a modified example based on your code:
# Imports
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
# Create toy dataset
X, y = make_classification(weights=[0.20, 0.80], random_state=0)
init_class_distribution = Counter(y)
min_class_label, _ = init_class_distribution.most_common()[-1]
print(f'Initial class distribution: {dict(init_class_distribution)}')
# Create and fit pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier(random_state=23))])
pipe.fit(X, y)
sampling_strategy = dict(pipe.steps).get('sampler').sampling_strategy_
expected_n_samples = sampling_strategy.get(min_class_label)
print(f'Expected number of generated samples: {expected_n_samples}')
# Fit and resample over-sampler pipeline
sampler_pipe = Pipeline(pipe.steps[:-1])
X_res, y_res = sampler_pipe.fit_resample(X, y)
actual_class_distribution = Counter(y_res)
print(f'Actual class distribution: {actual_class_distribution}')
I would like the cross_val_score from sklearn function to return the accuracy per each of the classes instead of the average accuracy of all the classes.
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None,
scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)
How can I do it?
This is not possible with cross_val_score. The approach you suggest would mean cross_val_score would have to return an array of arrays. However, if you look at the source code, you will see that the output of cross_val_score has to be :
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
As a result, cross_val_score checks if the scoring method you are using is multimetric or not. If it is, it will throw you an error like:
ValueError: scoring must return a number, got ... instead
Like it is correctly pointed out by a comment above, an alternative for you is to use cross_validate instead. Here is how it would work on the Iris dataset for instance:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
scoring = {'recall0': make_scorer(recall_score, average = None, labels = [0]),
'recall1': make_scorer(recall_score, average = None, labels = [1]),
'recall2': make_scorer(recall_score, average = None, labels = [2])}
cross_validate(DecisionTreeClassifier(),X,y, scoring = scoring, cv = 5, return_train_score = False)
Note that this is also supported by the GridSearchCV methodology.
NB: You cannot return "accuracy by each class", I guess you meant recall, which is basically the proportions of correct predictions amongst data points that actually belong to a class.