Target variable error when building a decision tree classifier in python? - python

X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
When I try to run the above code to fit data and train the model it gives me the following error. I am using google colab for python
can anyone please help me with this?
ValueError Traceback (most recent call last)
<ipython-input-33-3523056235b2> in <module>()
1 clf_entropy= DecisionTreeClassifier()
----> 2 clf_entropy.fit(X_train, y_train)
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
167 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
168 'multilabel-indicator', 'multilabel-sequences']:
--> 169 raise ValueError("Unknown label type: %r" % y_type)

DecisionTreeClassifier will check the type of target variable you have, so if each entry is a tuple or list, it will issue that warning, for example, this is how it should go:
balance_data = pd.concat([
pd.DataFrame(np.random.choice(['A','B'],100)),
pd.DataFrame(np.random.uniform(0,1,(100,5)))
],axis=1)
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
Now if your target variable is a list or list:
balance_data.iloc[:,0] = [[np.random.choice(['A','B','C'],1)] for i in range(100)]
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
Y[0]
Out[36]: [array(['C'], dtype='<U1')]
Then it will throw the same warning like you saw:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
ValueError: Unknown label type: 'unknown'
What you should do is check what is in balance_data.values[:,0], and make sure there's no embedded list or tuple.

Related

K-Folds cross-validator show KeyError: None of Int64Index

I try to use K-Folds cross-validator with dicision tree. I use for loop to train and test data from KFOLD like this code.
df = pd.read_csv(r'C:\\Users\data.csv')
# split data into X and y
X = df.iloc[:,:200]
Y = df.iloc[:,200]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
clf = DecisionTreeClassifier()
kf =KFold(n_splits=5, shuffle=True, random_state=3)
cnt = 1
# Cross-Validate
for train, test in kf.split(X, Y):
print(f'Fold:{cnt}, Train set: {len(train)}, Test set:{len(test)}')
cnt += 1
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
clf = clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("test")
print(y_test)
print("predict")
print(predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
when I run it show error like this.
KeyError: "None of [Int64Index([ 0, 1, 2, 5, 7, 8, 9, 10, 11, 12,\n ...\n 161, 164, 165, 166, 167, 168, 169, 170, 171, 173],\n dtype='int64', length=120)]
How to fix it?
The issue is here:
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
To access some parts/slices of your dataframe, you should use the iloc property. This should solve your problem:
X_train = X.iloc[train]
y_train = Y.iloc[train]
X_test = X.iloc[test]
y_test = Y.iloc[test]

how to change my code to use k fold cross validation with k = 5

I want to change my code so that instead of this part:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.2)
train_data = X_train.copy()
train_data.loc[:, 'target'] = y_train
test_data = X_test.copy()
test_data.loc[:, 'target'] = y_test
data_config = DataConfig(
target=['target'], #target should always be a list. Multi-targets are only supported for
regression. Multi-Task Classification is not implemented
continuous_cols=train_data.columns.tolist(),
categorical_cols=[],
normalize_continuous_features=True
)
trainer_config = TrainerConfig(
auto_lr_find=True,
batch_size=64,
max_epochs=10,
)
optimizer_config = {'optimizer':'Adam', 'optimizer_params':{'weight_decay': 0, 'amsgrad':
False}, 'lr_scheduler':None, 'lr_scheduler_params':{},
'lr_scheduler_monitor_metric':'valid_loss'}
model_config = NodeConfig(
task="classification",
num_layers=2,
num_trees=512,
learning_rate=1,
embed_categorical=True,
)
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
)
tabular_model.fit(train=train_data, test=test_data)
pred = tabular_model.predict(test_data)
pred['prediction'] = pred['prediction'].astype(int)
pred.loc[(pred['prediction'] >= 1 )] = 1
print_metrics(test_data['target'], pred["prediction"].astype('int'), tag="Holdout")
I want to Use the K fold method with k = 5 or 10.
Thank you for your advice.
The complete code example that I have used method train_test_split is above.
Here is an example of the k-fold method:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
result (in this example):
0.9666666666666667
The example is from here: https://scikit-learn.org/stable/modules/cross_validation.html

stratified 5-fold cross validation for continuous-value taregt The least populated class in y has only 1 member, which is too few

For this code:
#x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
train = [x_train, y_train]
I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_28063/1294340868.py in <module>
1 #x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
----> 2 x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
3 train = [x_train, y_train]
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2441 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
2442
-> 2443 train, test = next(cv.split(X=arrays[0], y=stratify))
2444
2445 return list(
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
1598 """
1599 X, y, groups = indexable(X, y, groups)
-> 1600 for train, test in self._iter_indices(X, y, groups):
1601 yield train, test
1602
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
1938 class_counts = np.bincount(y_indices)
1939 if np.min(class_counts) < 2:
-> 1940 raise ValueError(
1941 "The least populated class in y has only 1"
1942 " member, which is too few. The minimum"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I don't get an error if I use the line below instead:
x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
However, my intention is to do stratified 5-fold cross validation. How should I achieve that? I understand that for some target values in my y I do only have 1 item and stratification needs more than 1 item. How can I group these bins together?
Here's how my target y normalized histogram looks like:
Here's also the not normalized plot of y:
Here's a snippet of y's distribution. As you see, there's a lot of targets that only have 1 item in their bin.
Update:
Please note that I found this code from verstack package, however, I do not know how to make a 5-fold cross validation with it.
x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train]
You cannot perform a stratified split as there is value that are present only once so they cannot have an even repartitions in train and test set.
Once solution would be to bin this continuous variable into intervals using KBinsDiscretizer and perform the stratified split on it as follows:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
X, y = make_regression()
y_discretized = KBinsDiscretizer(n_bins=10,
encode='ordinal',
strategy='uniform').fit_transform(y.reshape(-1, 1))
X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=0.3,
random_state=42,
stratify=y_discretized)

lightgbm.basic.LightGBMError: Sum of query counts is not same with #data

I was try to do hyper-parameter tuning using GroupKFold and RandomSearchCV. I have cross checked the shapes, they are matching. How to solve this error?
Code:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)
qids_train = X_train.groupby(["query_id"])["query_id"].count().to_numpy()
flatted_group_train = np.repeat(range(len(qids_train)), repeats=qids_train)
gbm = lightgbm.LGBMRanker(
objective="lambdarank",
metric="ndcg",
is_unbalance=True,
)
gkf = GroupKFold(n_splits=5)
cv = gkf.split(X_train, y_train, groups=flatted_group_train)
grid = RandomizedSearchCV(gbm, params_grid, n_iter=10, cv=cv, verbose=2, refit=False)
grid.fit(
X=X_train,
y=y_train,
group=qids_train,
)
Error:
lightgbm.basic.LightGBMError: Sum of query counts is not same with
#data

ValueError: Found input variables with inconsistent numbers of samples: [2935848, 2935849]

When I run this code:
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
​
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
​
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
​
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
​
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
​
(2935848, 5)
(2935849, 5)
I get this ValueError:
ValueError Traceback (most recent call
last) in
13 from sklearn.metrics import mean_squared_error
14
---> 15 X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
16
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/model_selection/_split.py
in train_test_split(*arrays, **options) 2125 raise
TypeError("Invalid parameters passed: %s" % str(options)) 2126
-> 2127 arrays = indexable(*arrays) 2128 2129 n_samples = _num_samples(arrays[0])
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/utils/validation.py
in indexable(*iterables)
291 """
292 result = [_make_indexable(X) for X in iterables]
--> 293 check_consistent_length(*result)
294 return result
295
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/utils/validation.py
in check_consistent_length(*arrays)
255 if len(uniques) > 1:
256 raise ValueError("Found input variables with inconsistent numbers of"
--> 257 " samples: %r" % [int(l) for l in lengths])
258
259
ValueError: Found input variables with inconsistent numbers of samples: [2935848, 2935849]
Your problem is reached because you two dataframe (train and sales) have different length. Your train dataset has 2935848 samples and the sales dataset has 2935849. Both dataset has to have the same length in order to work properly. Check why this length is not matching and add one row or drop one to match them.
Secondly, but no least, you should understand what are you doing with train_test_split and which is your goal. This function inputs are X and Y, and outputs X_train, X_test, y_train, y_test. Reading your code, you are inputting two X (X_train and X_sales) with same 5 features. I hope you are doing this because some reason, be aware of this.
X are all the samples with their features, and Y are the corresponding outputs value you want to predict. Check that and evaluate is using train_test_split is the function you are looking for.
I have this error while I'm trying to do my confusion matrix:
Found input variables with inconsistent numbers of samples: [1527, 1]
This is my code:
x = df[['gender', 'age', 'hypertension', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=20)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
scaler = StandardScaler()
x_train_scale = scaler.fit_transform(x_train)
x_test_scale = scaler.fit_transform(x_test)
KNN = KNeighborsClassifier()
x = df[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
print(x.head())
print(y.head())
KNN = KNN.fit(x, y)
test = pd.DataFrame()
test['gender'] = [2]
test['age'] = [3]
test['hypertension'] = [0]
test['heart_disease'] = [0]
test['ever_married'] = [2]
test['work_type'] = [4]
test['Residence_type'] = [2]
test['avg_glucose_level'] = [95.12]
test['bmi'] = [18]
test['smoking_status'] = [2]
test['work_type_cat'] = [4]
test['gender_cat'] = [1]
test['Residence_type_cat'] = [1]
y_predict = KNN.predict(test)
print(y_predict)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_predict))

Categories