I'd like to make sure my custom score function behaves as expected by comparing it to hand-computation (so to speak) on a pre-defined split using train_test_split.
However I can't seem to pass that split in to cross_val_score. By default it uses 3 fold cross-validation and I can't mimick the splits it used. I think the answer lies in the cv parameter but I can't figure out how to pass in an iterable in the correct form.
If you have a pre-defined split, you can just simply train your model and apply the custom score function on the prediction of the test data to match the calculation. You do not need to use cross_val_score.
I'm pretty sure there's better and easier way but this is what I came up with as the cross_val_score documentation wasn't really clear.
You are right, it's about how you use the cv parameter and I used this format: An iterable yielding train, test splits.
The idea is to create an object that yields train, test split indices, and I referred: http://fa.bianp.net/blog/2015/holdout-cross-validation-generator/.
Assume that you already have a train test split. I used the sklearn built-in split and returned the indices as well:
from sklearn.model_selection import cross_val_score
X_train, X_valid, y_train, y_valid, indices_train, indices_test = train_test_split(train_X, train_y, np.arange(X_train.shape[0]), test_size=0.2, random_state=42)
Then, I create a class to yield the train, test split indices using the output from train_test_split:
class HoldOut:
def __init__(self, indices_train, indices_test):
self.ind_train = indices_train
self.ind_test = indices_test
def __iter__(self):
yield self.ind_train, self.ind_test
Then you can simply pass the Holdout object to the cv parameter:
cross_val_score(RandomForestClassifier(random_state=42, n_estimators=10), train_X, train_y,
cv=HoldOut(indices_train, indices_test), verbose=1)
Related
Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!
Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.
I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?
My data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'?
Please see below code for clarification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y,random_state=42)
Stratification will just return a portion of data which may be shuffled or not based on the arguments you pass to it. let's say your dataset consists of 100 instances of class 1 and 10 instances of class 0, you decide to do a split of 70:30, suppose you pass the appropriate parameters to get a split of 63-class1 instances and 7-class0 instances in training set and 27-class1 instances and 3-class0 instances in the test set. Clearly, it is no way balanced. The classifier you train will be highly biased and as good as a dummy classifier which predicts every input as class1.
A better approach would be, either try to collect more data of class-0, or oversample the dataset to artificially generate more class0 instances or undersample it to get less class1 instances. python imblearn is a library in python which can help you for that
First difference is that the train_test_split(X, y, test_size=0.2, stratify=y) will only split the data once and in which 80% will be in train and 20% in test.
Whereas StratifiedKFold(n_splits=2) will split the data into 50% train and 50% test.
Second is that you can specify n_splits greater than 2 to achieve a cross-validation fold effect, in which the data will splitted n_split number of times. So there will be multiple divisions of data into train and test.
For more information about the K-fold you can look at this question:
difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
The idea is same in that. train_test_split will internally use StratifiedShuffleSplit
I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()
I am trying to follow this tutorial to learn the machine learning based prediction but I have got two questions on it?
Ques1. How to set the n_estimators in the below piece of code, otherwise it will always assume the default value.
from sklearn.cross_validation import KFold
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
It is being called as:
from sklearn.svm import SVC
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
Ques2: How to use the already trained model file (e.g. obtained from SVM) so that I can use it to predict more (test) data which I didn't used for training?
For your first question, in the above code you would call run_cv(X,y,SVC,n_classifiers=100), the **kwargs will pass this to the classifier initializer with the step clf = clf_class(**kwargs).
For your second question, the cross validation in the code you've linked is just for model evaluation, i.e. comparing different types of models and hyperparameters, and determining the likely effectiveness of your model in production. Once you've decided on your model, you need to refit the model on the whole dataset:
clf.fit(X,y)
Then you can get predictions with clf.predict or clf.predict_proba.