I try to use K-Folds cross-validator with dicision tree. I use for loop to train and test data from KFOLD like this code.
df = pd.read_csv(r'C:\\Users\data.csv')
# split data into X and y
X = df.iloc[:,:200]
Y = df.iloc[:,200]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
clf = DecisionTreeClassifier()
kf =KFold(n_splits=5, shuffle=True, random_state=3)
cnt = 1
# Cross-Validate
for train, test in kf.split(X, Y):
print(f'Fold:{cnt}, Train set: {len(train)}, Test set:{len(test)}')
cnt += 1
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
clf = clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("test")
print(y_test)
print("predict")
print(predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
when I run it show error like this.
KeyError: "None of [Int64Index([ 0, 1, 2, 5, 7, 8, 9, 10, 11, 12,\n ...\n 161, 164, 165, 166, 167, 168, 169, 170, 171, 173],\n dtype='int64', length=120)]
How to fix it?
The issue is here:
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
To access some parts/slices of your dataframe, you should use the iloc property. This should solve your problem:
X_train = X.iloc[train]
y_train = Y.iloc[train]
X_test = X.iloc[test]
y_test = Y.iloc[test]
Related
I want to change my code so that instead of this part:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.2)
train_data = X_train.copy()
train_data.loc[:, 'target'] = y_train
test_data = X_test.copy()
test_data.loc[:, 'target'] = y_test
data_config = DataConfig(
target=['target'], #target should always be a list. Multi-targets are only supported for
regression. Multi-Task Classification is not implemented
continuous_cols=train_data.columns.tolist(),
categorical_cols=[],
normalize_continuous_features=True
)
trainer_config = TrainerConfig(
auto_lr_find=True,
batch_size=64,
max_epochs=10,
)
optimizer_config = {'optimizer':'Adam', 'optimizer_params':{'weight_decay': 0, 'amsgrad':
False}, 'lr_scheduler':None, 'lr_scheduler_params':{},
'lr_scheduler_monitor_metric':'valid_loss'}
model_config = NodeConfig(
task="classification",
num_layers=2,
num_trees=512,
learning_rate=1,
embed_categorical=True,
)
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
)
tabular_model.fit(train=train_data, test=test_data)
pred = tabular_model.predict(test_data)
pred['prediction'] = pred['prediction'].astype(int)
pred.loc[(pred['prediction'] >= 1 )] = 1
print_metrics(test_data['target'], pred["prediction"].astype('int'), tag="Holdout")
I want to Use the K fold method with k = 5 or 10.
Thank you for your advice.
The complete code example that I have used method train_test_split is above.
Here is an example of the k-fold method:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
result (in this example):
0.9666666666666667
The example is from here: https://scikit-learn.org/stable/modules/cross_validation.html
Already tried various models and keep getting low scores. My model needs to predict the G3 score.
There are no missing values in the dataset, and all values are integers.
Already tried to check the most important features used by a base model and the score does not improve. Any tips to improve it? Am I doing something wrong?
student_df = pd.read_csv("student_data.csv")
test_df = pd.read_csv("test_data.csv")
student_df = student_df.rename(columns=str.lower)
test_df = test_df.rename(columns=str.lower)
student_df = student_df.iloc[:, 1:]
test_df = test_df.iloc[:, 1:]
X = student_df.drop(columns=["g3"], axis=1)
y = student_df["g3"]
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)
base_rf = RandomForestRegressor()
base_model = base_rf.fit(X_train, y_train)
params = {
"n_estimators": [200, 300, 400, 500],
"max_features": ["sqrt", None],
"max_depth": [3, 4, 5, 6, 7],
"random_state": [0],
}
mse = make_scorer(mean_squared_error, greater_is_better=False)
clf = GridSearchCV(RandomForestRegressor(), params, scoring=mse, n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
Score, RMSE, R2Score
Base Model score is - 0.33531194310489754, 2.7544894130501762, 0.33531194310489754
Clf score is - 7.449311807649206, 2.7293427427952697, 0.34739287125101215
student cols
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
When I try to run the above code to fit data and train the model it gives me the following error. I am using google colab for python
can anyone please help me with this?
ValueError Traceback (most recent call last)
<ipython-input-33-3523056235b2> in <module>()
1 clf_entropy= DecisionTreeClassifier()
----> 2 clf_entropy.fit(X_train, y_train)
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
167 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
168 'multilabel-indicator', 'multilabel-sequences']:
--> 169 raise ValueError("Unknown label type: %r" % y_type)
DecisionTreeClassifier will check the type of target variable you have, so if each entry is a tuple or list, it will issue that warning, for example, this is how it should go:
balance_data = pd.concat([
pd.DataFrame(np.random.choice(['A','B'],100)),
pd.DataFrame(np.random.uniform(0,1,(100,5)))
],axis=1)
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
Now if your target variable is a list or list:
balance_data.iloc[:,0] = [[np.random.choice(['A','B','C'],1)] for i in range(100)]
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
Y[0]
Out[36]: [array(['C'], dtype='<U1')]
Then it will throw the same warning like you saw:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
ValueError: Unknown label type: 'unknown'
What you should do is check what is in balance_data.values[:,0], and make sure there's no embedded list or tuple.
I have 2 datasets and applying 5 different ML models.
Dataset 1:
def dataset_1():
...
...
bike_data_hours = bike_data_hours[:500]
X = bike_data_hours.iloc[:, :-1].values
y = bike_data_hours.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, X_test, y_train.reshape(-1, 1), y_test.reshape(-1, 1)
The shape is (400, 14) (100, 14) (400, 1) (100, 1). The dtypes: object (int64, float64).
Dataset 2:
def dataset_2():
...
...
final_movie_df = final_movie_df[:500]
X = final_movie_df.iloc[:, :-1]
y = final_movie_df.iloc[:, -1]
gs = GroupShuffleSplit(n_splits=2, test_size=0.2)
train_ix, test_ix = next(gs.split(X, y, groups=X.UserID))
X_train = X.iloc[train_ix]
y_train = y.iloc[train_ix]
X_test = X.iloc[test_ix]
y_test = y.iloc[test_ix]
return X_train.shape, X_test.shape, y_train.values.reshape(-1,1).shape, y_test.values.reshape(-1,1).shape
The shape is (400, 25) (100, 25) (400, 1) (100, 1). The dtypes: object (int64, float64).
I am using different models. The code is
X_train, X_test, y_train, y_test = dataset
fold_residuals, fold_dfs = [], []
kf = KFold(n_splits=k, shuffle=True)
for train_index, _ in kf.split(X_train):
if reg_name == "RF" or reg_name == "SVR":
preds = regressor.fit(X_train[train_index], y_train[train_index].ravel()).predict(X_test)
elif reg_name == "Knn-5":
preds = regressor.fit(X_train[train_index], np.ravel(y_train[train_index], order="C")).predict(X_test)
else:
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
But I am getting a common error like this, this, and this. I have gone through all of these posts, but getting no idea about the error. I have used iloc and values given as a solution for the visited links.
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([ 0, 1, 3, 4, 5, 6, 7, 9, 10, 11,\n ...\n 387, 388, 389, 390, 391, 392, 393, 395, 397, 399],\n dtype='int64', length=320)] are in the [columns]"
Here, if I use train_test_split instead of GroupShuffleSplit then the code is working. However, I want to use GroupShuffleSplit based on the UserID so that the same user does not split for both train and test. Could you tell me how can I solve the problem while I will use GroupShuffleSplit?
Could you tell me why I am getting the error for dataset_2 while dataset_1 is working fully fine (and the shape and dtypes) are the same for both datasets.
You have to use values for your dataset_2. Do changes
X_train = X.iloc[train_ix].values
y_train = y.iloc[train_ix].values
X_test = X.iloc[test_ix].values
y_test = y.iloc[test_ix].values
return X_train.shape, X_test.shape, y_train.reshape(-1,1).shape, y_test.reshape(-1,1).shape
Hopefully now will work
I have successfully built logistic regression model using train dataset below.
X = train.drop('y', axis=1)
y = train['y']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5)
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
logreg1 = LogisticRegression()
logreg1.fit(X_train, y_train)
score = logreg1.score(X_test, y_test)
cvs = cross_val_score(logreg1, X_test, y_test, cv=5).mean()
My problem is I want to bring in the test dataset to predict the unknown y value. In the test data theres no y column. How can I predict the y value using the seperate test dataset??
Use predict():
y_pred = logreg1.predict(X_test)
score = logreg1.score(X_test, y_pred)
print(y_pred) // see the predictions