split test/train data has inconsistent number of samples when using xgboost - python

I am new to machine learning so be gentle. I have a single csv file for data, that I would like to split into test/train data. I have used the following code to split the data
raw_data1. drop('Income_Weekly', axis=1, inplace=True)
df = raw_data1
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.5
X_train = df[msk]
y_train = df[~msk]
However, when trying to apply the xgboost algorithm, I receive an error:
ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]
The error occurs at the line:
random_cv.fit(X_train,y_train)
My complete code is as follows:
import xgboost
from sklearn.model_selection import RandomizedSearchCV
classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor()
booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]
## Hyper Parameter Optimization
#
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
# Define the grid of hyperparameters to search
hyperparameter_grid = {
'n_estimators': n_estimators,
'max_depth':max_depth,
'learning_rate':learning_rate,
'min_child_weight':min_child_weight,
'booster':booster,
'base_score':base_score
}
random_cv = RandomizedSearchCV(estimator=regressor,
param_distributions=hyperparameter_grid,
cv=5, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(X_train,y_train)

Your current mask and splitting method results in two different size dataframes which can't be compared with xgboost, they should be the same size.
Maybe change the splitting code to use sklearn's train test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_x_cols], df['y_col'],test_size=0.2)
Looks like you are using a random number to split the set, you can change test_size to use a random number if you want but 0.2/ 0.3/ 0.33 are used often.
list_x_cols should contain all your feature columns e.g. df['xvar1', 'xvar2', 'xvar3', ...].
To the test if the shape of your split is not aligned, the following code can be used. Please share the outcome of the print statements:
print(X_train.shape)
print(y_train.shape)
if X_train.shape[0] != y_train.shape[0]:
print("X and y rows imbalanced")

Related

When using HalvingRandomSearchCV I get the error "UserWarning: One or more of the test scores are non-finite"

I'm trying to predict Age (→ Y). Previously, I used ElasticNet, however to no avail.
Currently I'm trying to use sklearn's RandomForestRegressor (which works when I'm using RandomizedSearchCV), however HalvingRandomSearchCV turns out to be substantially faster.
That's why I'm trying to upgrade my program to use HalvingRandomSearchCV but doing so creates the error "UserWarning: One or more of the test scores are non-finite" - 4 times in total. (See error message)
Here it's suggested that this is due to the parameters in random_grid. I manipulated these and couldn't resolve the error that way.
Next I looked into rf_random.cv_results_. I found that mean_test_score, split2_test_score and std_test_score are the test scores with NaN values. In all cases the first 72 entries are NaN and the values match the printed error messages.
Yet, I can't find the 1st array to raise the error, which should contain only NaNs.
What's also weird is that the same parameters result in different outcomes.
When looking at rf_random.cv_results_["params"] I find that using the same parameters results in NaN for the 3rd setting of parameters and (e.g.) -9.7 for the 72th setting of parameters.
Final remarks:
X is a (656,91) DataFrame and contains z-score transformed data.
Y is a (656, 1) DataFrame and contains the Age as a float in years.
Right now my code also raises the "UndefinedMetricWarning: R^2 score is not well-defined with less than two samples." warning. To my understanding this is a separate issue and will be the next thing I'll fix.
Thanks
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.model_selection import KFold
def RandomForestCVRandomSearch(X, Y, n_splits_outer_cv = 3, n_splits_inner_cv = 3):
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
cv_outer = KFold(n_splits = n_splits_outer_cv, random_state=1, shuffle=True)
for i, (train_index, test_index) in enumerate(cv_outer.split(X)):
x_train, x_test = X.loc[train_index,:], X.loc[test_index,:]
y_train, y_test = Y.loc[train_index,:], Y.loc[test_index,:]
rf = RandomForestRegressor(random_state=42)
rf_random = HalvingRandomSearchCV(estimator = rf, param_distributions = random_grid,
cv = n_splits_inner_cv, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(x_train.to_numpy(), y_train.to_numpy().ravel())
rf_score = rf_random.score(x_test.to_numpy(), y_test.to_numpy())
Update:
While the error itself is not clear to me - yet - I overlooked the importance of the "UndefinedMetricWarning: R^2 score is not well-defined with less than two samples." warning.
A fix to the problem is to change the scoring strategy:
rf_random = HalvingRandomSearchCV(estimator = rf, param_distributions = random_grid,
cv = n_splits_inner_cv, verbose=2, random_state=42,
n_jobs = -1, scoring = 'explained_variance')
Update 2
It might be that a bad interaction between my nested CV and the internal CV of HalvingRandomSearchCV caused the issue. As I wrote before n_resources is 6 in the first iteration and we believe that the resources are split across the CV (=3). So that we end up with 2 resources in each fold - which is not enough to calculate an R2-score. That would also explain why in the second iteration, where n_resources increases, the issue doesn't appear.

ValueError: too many values to unpack (expected 3), Machine learning

I am learning machine learning from a book Artificial-Intelligence-with-Python-Second-Edition. I faced such error:
ValueError: too many values to unpack (expected 3)
Here is the code from the book:
print("\nGrid scores for the parameter grid:")
for params, avg_score, _ in classifier.grid_scores_: # from sklearn import grid_search
print(params, '-->', round(avg_score, 3))
(The code for the tutorial was taken from the GitHub: Artificial-Intelligence-with-Python-Second-Edition/Chapter06/run_grid_search.py )
From sklearn import grid_search - this library is no longer used, I need to change it to cv_results_.
but when I'm using this attributes cv_results_, I get this error:
ValueError: too many values to unpack (expected 3)
I have tried different variants and also re-read all the help on this topic and I cannot find a solution yet.
My full code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from utilities import visualize_classifier
# Load input data
input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
# Separate input data into three classes based on labels
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])
# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split.train_test_split(
X, y, test_size=0.25, random_state=5)
# Define the parameter grid
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]},
{'max_depth': [4], 'n_estimators': [25, 50, 100, 250]}
]
metrics = ['precision_weighted', 'recall_weighted']
for metric in metrics:
print("\n##### Searching optimal parameters for", metric)
classifier = grid_search.GridSearchCV(
ExtraTreesClassifier(random_state=0),
parameter_grid, cv=5, scoring=metric)
classifier.fit(X_train, y_train)
print("\nGrid scores for the parameter grid:")
for params, avg_score, _ in classifier.cv_results_:
print(params, '-->', round(avg_score, 3))
print("\nBest parameters:", classifier.best_params_)
y_pred = classifier.predict(X_test)
print("\nPerformance report:\n")
print(classification_report(y_test, y_pred))
GridSearchCV.cv_results_ is a dictionary of numpy ndarrays (source). You are trying to cast 1 dictionary into 3 variables (params, avg_score and _). It probably worked in the past since grid_search.cv_results_ returned 3 objects, while current GridSearchCV.cv_results_ returns one dictionary.
It's very straight forward to convert the dictionary into a Pandas DataFrame.
import pandas as pd
df = pd.DataFrame(classifier.cv_results_)
You are interested in printing only the parameters and the scores, so let's do that by selecting the columns which have 'param' or 'score' in their names:
df_columns_to_print = [column for column in df.columns if 'param' in column or 'score' in column]
print(df[df_columns_to_print])

Sklearn DecisionTreeClassifier F-Score Different Results with Each run

I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

Heavily weighted distance returns the same results as regular distance in knn with iris dataset

I am experimenting with the way the weights on the distance affect the performance of the kNN algorithm and for a reproducible example I am working with the iris dataset.
To my surprise, weighting 2 predictors 100 times more than the rest 2 predictors generate identical predictions with the unweighted model. What is this rather counterintuitive finding?
My code is the following:
X_original = iris['data']
Y = iris['target']
sc = StandardScaler() # Defines the parameters of the Scaler
X = sc.fit_transform(X_original) # Transforms the original data to standardized data and returns them
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
split = sss.split(X, Y)
s = list(split)
train_index = s[0][0]
test_index = s[0][1]
X_train = X[train_index, ]
X_test = X[test_index, ]
Y_train = Y[train_index]
Y_test = Y[test_index]
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
iris_fit = knn.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w1 = knn.predict(X_test)
weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)
knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2,
metric_params={'w': weights})
iris_fit_w = knn_w.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w100 = knn_w.predict(X_test)
(predictions_w1 != predictions_w100).sum()
0
They are not always the same, add a random state to your train test split and you will see how it changes for different values.
StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)
Additionally, the weighted Minkowski distance with such extreme weights on 3rd (petal length) and 4th (petal width) feature basically gives you the same results as if you only ran KNN on these 2 features with unweighted Minkowski. And since they seem to be quite informative then it is no surprise you get very similar results compared to the case of considering all 4 features. See the wiki picture below

Using explicit (predefined) validation set for grid search with sklearn

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.

Categories