Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
models = [
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
error :
UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
When using cross validation, it splits the whole train set such that the train and test (in cv) will have same distribution. If there are 10 objects with label "A", which are about 20% of whole examples, it will split it to groups where every group has 2 objects with label "A" so it will also 20% from test.
But what happens when a label "A" has only 1 object (one row with that class) and you try to split it for 5 groups? This is an error you get. It does not know how to handle that.
It's a bit hard to tell how to solve it without knowing what your data looks like and what are your needs. Different problems may have different solutions.
You can:
Remove problematic label from dataset. Check for all classes with extreme low frequency and group them all together to "Other" or something like that.
Give up on cv and use KFolds, which does not require that groups within cv will have same distribution.
Related
I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
DataFrame
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1)
y = pd_data[target]
y=y.astype('int')
protocol_type = [['tcp','udp','icmp']]
col_transformer = ColumnTransformer([
("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'),
['protocol_type']),
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89)
Finally I proceed to the model evaluation, which drops me the following result:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('SVM', SVC()))
#selector = SelectFromModel(estimator=model)
scaler = option2
selector = SelectKBest(score_func=f_classif,k = 3)
results=[]
for name, model in models:
pipeline = make_pipeline(col_transformer,scaler,selector)
#print(pipeline)
X_train_selected = pipeline.fit_transform(X_train,y_train)
#print(X_train_selected)
X_test_selected = pipeline.fit_transform(X_test,y_test)
modelo = model.fit(X_train_selected, y_train)
kf = KFold(n_splits=10, shuffle=True, random_state=89)
cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy')
results.append(cv_results)
print(name, cv_results)
plt.boxplot(results)
plt.show()
Boxplots from CV
My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?
You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.
One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.
(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)
I have spent 30 hours on this single problem de-bugging and it makes absolutely no sense, hopefully one of you guys can show me a different perspective.
The problem is that I use my training dataframe in a random forest and get very good accuracy 98%-99% but when I try and load in a new sample to predict on. The model ALWAYS guesses the same class.
# Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)
# Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)
# Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)
# Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
# Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)
print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))
The way I am processing the data is the same, but when I predict on the X_test or X_train I get my normal 98% and when I predict on my new data it always guesses the same class.
# The json file is not in the correct format, this function normalizes it
normalized_json = json_normalizer(json_file, "", training=False)
# Turn the json into a list of dictionaries which contain the features
features_dict = create_dict(normalized_json, label=None)
# Convert the dictionaries into pandas dataframes
df = pd.DataFrame.from_records(features_dict)
print('Total amount of email samples: ', len(df))
print()
df = df.fillna(-1)
# One hot encodes string values
df = one_hot_encode(df, noOverride=True)
if 'label' in df.columns:
df = df.drop(['label'], axis='columns')
print(list(model.predict(df))[:100])
print(list(model.predict(X_train))[:100])
Above is my testing scenario, you can see in the last two lines I am predicting on X_train the data used to train the model and df the out of sample data that it always guesses class 0.
Some useful information:
The datasets are imbalanced; class 0 has about 150,000 samples while class 1 has about 600,000 samples
There are 141 features
changing the n_estimators and max_depth doesn't fix it
Any ideas would be helpful, also if you need more information let me know my brain is fried right now and that's all I could think of.
Fixed, The issue was the imbalance of datasets also I realized that changing the depth gave me different results.
For example, 10 trees with 3 depth -> seemed to work fine
10 trees with 6 depth -> back to guessing only the same class
I want to fit xgboost model on a dataset based on a group of variables.
Here is the reproducible code.
train = pd.DataFrame({'A':np.random.randint(1,3,1000),
'B':np.random.randint(1,3,1000),
'C':np.random.uniform(1,10,1000),
'D':np.random.uniform(1,25,1000),
'E':np.random.uniform(10,50,1000),
'F':np.random.uniform(5,250,1000)})
This is the mock up data. Here I want to run this model based on columns 'A' and 'B'.
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:,:-1], train.iloc[:,-1], test_size=test_size, random_state=seed)
Here I am splitting data into test and train datasets.
models = []
def xgb_model_fit(x):
xg_reg=xgb.XGBRegressor(objective ='reg:linear',colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5,alpha = 10,n_estimators = 10)
xg_reg.fit(X_train,y_train)
models.append([x.name,xg_reg])
Group by variable list is
gb = ['A','B']
Model building step is
X_train.groupby(gb,as_index=False).apply(xgb_model_fit)
Now I want to use the fitted models on test data on their respected group.
Question1:
In test data, my first group is (1,1). Now I want to use (1,1) group fitted model on test data group same group and then have to get fitted values. How to do this?
Question2:
Is there any direct way to fit model by group instead of running model for each group?
how can I do this task?
I'm not able to do something and I would like to know if it's a bug or normal way.
I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case.
As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.
( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )
I proceed that way:
pipe = Pipeline([('pca', PCA()),
('clf', SVC()),
])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)
# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)
And get some:
File : _split.py", line 489, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.
In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.
Can I have some clues on it?
Have a nice day,
Best regards
I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:
outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
cv=inner_cv, verbose=2, random_state=42,
n_jobs=-1, scoring=my_squared_score)
# Fit the random search model
rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
print(rf_random.best_params_)
nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))
print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))
I hope this helps.
Best regards,
Felix
Up to now I had only one dataset (df.csv). So far I used a validation size of 20% and .train_test_split for a normal regression model.
array = df.values
X = array[:,0:26]
Y = array[:,26]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation =
cross_validation.train_test_split(X, Y,
test_size=validation_size, random_state=seed)
num_folds = 10
num_instances = len(X_train)
seed = 7
scoring = 'mean_squared_error'
When I have three seperate datasets (train.csv/test.csv/ground_truth.csv), how can I handle it? Of course, at first I use the train.csv, then the test.csv and finally the ground_truth. But how should I implement these different datasets in my model?
When you perform cross-validation, train and test data are essentially the same dataset which is split in different ways in order to prevent overfitting. The number of folds indicates the different ways the set is split.
For example, 5-fold cross validation splits the training set in 5 pieces and each time 4 of them are used for training and 1 for testing. So in your case, you have the following options:
Either perform cross-validation just on the training set, then check with the test set and the ground truth (fitting is done just on the training set so if done correctly accuracy on test and ground truth ought to be similar) or combine training and test for a larger and possibly more representative dataset and then check on ground truth.