I'm running features selection and I've been using RFECV to find the optimal number of features.
However, there are certain features I'd like to keep...so, I was wondering if there's any way to force the algorithm to keep these selected ones, and run the RFECV on the remaining ones.
So far, I'm running it on all of the features, by using:
def main():
df_data = pd.read_csv(csv_file_path, index_col=0)
X_train, y_train, X_test, y_test = split_data(df_data)
feats_selection(X_train, y_train, X_test, y_test)
def feats_selection(X_train, y_train, X_test, y_test):
nr_splits = 10
nr_repeats = 1
features_step = 1
est = DecisionTreeRegressor()
cv_mode = RepeatedKFold(n_splits=nr_splits, n_repeats=nr_repeats, random_state=1)
rfecv = RFECV(estimator=est, step=features_step, cv=cv_mode, scoring='neg_mean_squared_error', verbose=0)
## >>> here, the RFECV algorithm is automatically selecting the optimal features <<<
X_train_transformed = rfecv.fit_transform(X_train, y_train)
X_test_transformed = rfecv.transform(X_test)
## test on test subset
est.fit(X_train_transformed, y_train)
y_pred = est.predict(X_test_transformed)
rmse = mean_squared_error(y_test, y_pred, squared=False)
RFECV doesn't have such a parameter, no.
Perhaps the cleanest way to accomplish it uses a ColumnTransformer:
cols_to_always_keep = [...] # column names if you'll fit on dataframe, column indices otherwise
col_sel = ColumnTransformer(
transformers=['keep', "passthrough", cols_to_always_keep)],
remainder=rfecv,
)
Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.
I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df['next_count']
scaler = MinMaxScaler()
scaled_input = scaler.fit_transform(input_data)
X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)
#Build Model
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
When I do something like:
scoring = ["accuracy", "balanced_accuracy", "f1", "precision", "recall", "roc_auc"]
scores = cross_validate(SVC(), my_x, my_y, scoring = scoring , cv=5, verbose=3, return_train_score=True, return_estimator=True)
how can I get a confusion matrix of a single validation run, e.g. the first one or ideally the best one?
I don't need a plot or something beautiful, only the numbers. If I could see at least the split, then I could recalculate it.
If you want to use cross-validation to perform something quite specific during each iteration, maybe it is best to use a CV splitter like StratifiedKFold :
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
svm = SVC()
kf = StratifiedKFold(n_splits=5)
scores = []
results = []
for train_index, test_index in kf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
svm.fit(X_train, y_train)
y_pred = svm.predict(y_test)
scores.append(accuracy_score(y_test, y_pred)) # use other scoring as prefered
results.append(confusion_matrix(y_test, y_pred))
This will compute the confusion matrix for each of the five iterations and store them in results. If you want to get the confusion matrix of the best validation round, you can additionally compute the scoring metric in the loop as well (see the scores list) and retrieve the corresponding confusion matrix.
I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.
Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)
You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.
Have a look at this example here:
https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html
For your code, you would want to do this:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)
This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.
I want to have metrics per class label and an aggregate confusion matrix from a cross validation in scikit learn.
I wrote a method that performs a cross-validation for scikit learn that sums the confusion matrices and also stores all the predicted labels. Then, it calls scikit learn methods to print out the metrics.
The code below should run with any recent scikit learn installation, you can test it out with any dataset.
Is below the correct way to gather an aggregate cm and a classification_report when doing StratifiedKFold cross validation?
from sklearn import metrics
from sklearn.cross_validation import StratifiedKFold
import numpy as np
def customCrossValidation(self, X, y, classifier, n_folds=10, shuffle=True, random_state=0):
''' Perform a cross validation and print out the metrics '''
skf = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle, random_state=random_state)
cm = None
y_predicted_overall = None
y_test_overall = None
for train_index, test_index in skf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = y_predicted
y_test_overall = y_test
else:
y_predicted_overall = np.concatenate([y_predicted_overall, y_predicted])
y_test_overall = np.concatenate([y_test_overall, y_test])
cv_cm = metrics.confusion_matrix(y_test, y_predicted)
# sum the cv per fold
if cm is None:
cm = cv_cm
else:
cm += cv_cm
print (metrics.classification_report(y_test_overall, y_predicted_overall, digits=3))
print (cm)