I am learning machine learning from a book Artificial-Intelligence-with-Python-Second-Edition. I faced such error:
ValueError: too many values to unpack (expected 3)
Here is the code from the book:
print("\nGrid scores for the parameter grid:")
for params, avg_score, _ in classifier.grid_scores_: # from sklearn import grid_search
print(params, '-->', round(avg_score, 3))
(The code for the tutorial was taken from the GitHub: Artificial-Intelligence-with-Python-Second-Edition/Chapter06/run_grid_search.py )
From sklearn import grid_search - this library is no longer used, I need to change it to cv_results_.
but when I'm using this attributes cv_results_, I get this error:
ValueError: too many values to unpack (expected 3)
I have tried different variants and also re-read all the help on this topic and I cannot find a solution yet.
My full code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from utilities import visualize_classifier
# Load input data
input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
# Separate input data into three classes based on labels
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])
# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split.train_test_split(
X, y, test_size=0.25, random_state=5)
# Define the parameter grid
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]},
{'max_depth': [4], 'n_estimators': [25, 50, 100, 250]}
]
metrics = ['precision_weighted', 'recall_weighted']
for metric in metrics:
print("\n##### Searching optimal parameters for", metric)
classifier = grid_search.GridSearchCV(
ExtraTreesClassifier(random_state=0),
parameter_grid, cv=5, scoring=metric)
classifier.fit(X_train, y_train)
print("\nGrid scores for the parameter grid:")
for params, avg_score, _ in classifier.cv_results_:
print(params, '-->', round(avg_score, 3))
print("\nBest parameters:", classifier.best_params_)
y_pred = classifier.predict(X_test)
print("\nPerformance report:\n")
print(classification_report(y_test, y_pred))
GridSearchCV.cv_results_ is a dictionary of numpy ndarrays (source). You are trying to cast 1 dictionary into 3 variables (params, avg_score and _). It probably worked in the past since grid_search.cv_results_ returned 3 objects, while current GridSearchCV.cv_results_ returns one dictionary.
It's very straight forward to convert the dictionary into a Pandas DataFrame.
import pandas as pd
df = pd.DataFrame(classifier.cv_results_)
You are interested in printing only the parameters and the scores, so let's do that by selecting the columns which have 'param' or 'score' in their names:
df_columns_to_print = [column for column in df.columns if 'param' in column or 'score' in column]
print(df[df_columns_to_print])
Related
I am new to machine learning so be gentle. I have a single csv file for data, that I would like to split into test/train data. I have used the following code to split the data
raw_data1. drop('Income_Weekly', axis=1, inplace=True)
df = raw_data1
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.5
X_train = df[msk]
y_train = df[~msk]
However, when trying to apply the xgboost algorithm, I receive an error:
ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]
The error occurs at the line:
random_cv.fit(X_train,y_train)
My complete code is as follows:
import xgboost
from sklearn.model_selection import RandomizedSearchCV
classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor()
booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]
## Hyper Parameter Optimization
#
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
# Define the grid of hyperparameters to search
hyperparameter_grid = {
'n_estimators': n_estimators,
'max_depth':max_depth,
'learning_rate':learning_rate,
'min_child_weight':min_child_weight,
'booster':booster,
'base_score':base_score
}
random_cv = RandomizedSearchCV(estimator=regressor,
param_distributions=hyperparameter_grid,
cv=5, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(X_train,y_train)
Your current mask and splitting method results in two different size dataframes which can't be compared with xgboost, they should be the same size.
Maybe change the splitting code to use sklearn's train test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_x_cols], df['y_col'],test_size=0.2)
Looks like you are using a random number to split the set, you can change test_size to use a random number if you want but 0.2/ 0.3/ 0.33 are used often.
list_x_cols should contain all your feature columns e.g. df['xvar1', 'xvar2', 'xvar3', ...].
To the test if the shape of your split is not aligned, the following code can be used. Please share the outcome of the print statements:
print(X_train.shape)
print(y_train.shape)
if X_train.shape[0] != y_train.shape[0]:
print("X and y rows imbalanced")
I'm doing some basic hyperparameter optimization for an xgboost model and have run across the following issue. Firstly my code:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import xgboost as xgb
from functools import partial
from skopt import space, gp_minimize
<Some preprocessing...>
x = Oe.fit_transform(x)
y = Ly.fit_transform(y)
def optimize(params, param_names, x, y):
params = dict(zip(params, param_names))
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nc = len(set(y_train))
xgb_model = xgb.XGBClassifier(use_label_encoder=False, num_class=nc+1, objective="multi:softprob", **params)
xgb_model.fit(X_train, y_train)
preds = xgb_model.predict(X_test)
acc = accuracy_score(y_test, preds)
return -1.0 * acc
param_space = [
space.Integer(3, 10, name="max_depth"),
space.Real(0.01, 0.3, prior="uniform", name="learning_rate"),
]
param_names = [
"max_depth",
"learning_rate"
]
optimization_function = partial(
optimize,
param_names,
x=x,
y=y
)
result = gp_minimize(
optimization_function,
dimensions=param_space,
n_calls=30,
n_random_starts=6,
verbose=True
)
print(dict(zip(param_names, result.x)))
After doing some searching myself, I realized that if I don't use a random_state on my train test split, in order to have a deterministic result, then I risk getting a y_train that doesn't contain labels in form of 0,1,2 ... thus getting the following error
ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1].
On the other hand, if I use a random state then my optimization implementation that I use here loses its purpose since I will always have the same result, considering I'm using a small dataset.
And indeed after running my code with random_state=0 for example, after 3 iterations of gp_minimize, I end up getting the same optimum no matter what combination of hyperparameters it produces.
Update: One could argue that even if I chose different random states, the optimal combination that I would get, would also depend on that set of random states, so in the end I only want to know if this is the right approach to optimize my model.
I'm trying to implement a complement naive bayes classifier using sklearn. My data have very imbalanced classes (30k samples of class 0 and 6k samples of the 1 class) and I'm trying to compensate this using weighted class.
Here is the shape of my dataset:
enter image description here
I tried to use the compute compute_class_weight function to calcute the weights and then pass it to the fit function when training my model:
import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB
#Import the csv data
data = pd.read_csv('output_pt900.csv')
#Create the header of the csv file
header = []
for x in range(0,2500):
header.append('pixel' + str(x))
header.append('status')
#Add the header to the csv data
data.columns = header
#Replace the b's and the f's in the status column by 0 and 1
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)
print(data)
#Drop the NaN values
data = data.dropna()
#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)
#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)
c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)
clf = ComplementNB()
clf.fit(x_train,y_train, c_w)
y_predict = clf.predict(x_test)
cm = confusion_matrix(y_test, y_predict)
svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()
but I got thesse error:
ValueError: sample_weight.shape == (2,), expected (29752,)!
Anyone knows how to use weighted class in sklearn models?
compute_class_weight returns an array of length equal to the number of unique classes with the weight to assign to instances of each class (link). So if there are 2 unique classes, c_w has length 2, containing the weight that should be assigned to samples with label 0 and 1.
When calling fit for your model, the weight for each sample is expected by the sample_weight argument. This should explain the error you received. To solve this issue, you need to use c_w returned by compute_class_weight to create an array of individual sample weights. You could do this with [c_w[i] for i in all_together]. Your fit call would ultimately look something like:
clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])
i build a regression model to predict energy ( 1 columns ) from 5 variables ( 5 columns ) ... i used my exprimental data to train and fit the model and it works with good score ...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('new.csv')
X = data.drop(['E'],1)
y = data['E']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5 ,
random_state=2)
from sklearn import ensemble
clf1 = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth =5,
min_samples_split = 2, loss='ls',
learning_rate = 0.1)
clf1.fit(X_train, y_train)
clf1.score(X_test, y_test)
but now i want to add a new csv file contain new data for mentioned 5 variables to OrderedDict and use the model to predict energy ...
with code bellow i manually insert row by row and it predict energy correctly
from collections import OrderedDict
new_data = OrderedDict([('H',48.52512), ('A',169.8379), ('P',55.52512),
('R',3.058758), ('Q',2038.055)])
new_data = pd.Series(new_data)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but i cant do this with huge datasets and need to import csv file ... i do the bellow but cant figure it out ....
data_2 = pd.read_csv('new2.csv')
X_new = OrderedDict(data_2)
new_data = pd.Series(X_new)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but gives me : ValueError: setting an array element with a sequence.
can anyone help me ??
I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.