RandomForest Low R2Score G3 Predictor - python

Already tried various models and keep getting low scores. My model needs to predict the G3 score.
There are no missing values in the dataset, and all values are integers.
Already tried to check the most important features used by a base model and the score does not improve. Any tips to improve it? Am I doing something wrong?
student_df = pd.read_csv("student_data.csv")
test_df = pd.read_csv("test_data.csv")
student_df = student_df.rename(columns=str.lower)
test_df = test_df.rename(columns=str.lower)
student_df = student_df.iloc[:, 1:]
test_df = test_df.iloc[:, 1:]
X = student_df.drop(columns=["g3"], axis=1)
y = student_df["g3"]
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)
base_rf = RandomForestRegressor()
base_model = base_rf.fit(X_train, y_train)
params = {
"n_estimators": [200, 300, 400, 500],
"max_features": ["sqrt", None],
"max_depth": [3, 4, 5, 6, 7],
"random_state": [0],
}
mse = make_scorer(mean_squared_error, greater_is_better=False)
clf = GridSearchCV(RandomForestRegressor(), params, scoring=mse, n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
Score, RMSE, R2Score
Base Model score is - 0.33531194310489754, 2.7544894130501762, 0.33531194310489754
Clf score is - 7.449311807649206, 2.7293427427952697, 0.34739287125101215
student cols

Related

K-Folds cross-validator show KeyError: None of Int64Index

I try to use K-Folds cross-validator with dicision tree. I use for loop to train and test data from KFOLD like this code.
df = pd.read_csv(r'C:\\Users\data.csv')
# split data into X and y
X = df.iloc[:,:200]
Y = df.iloc[:,200]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
clf = DecisionTreeClassifier()
kf =KFold(n_splits=5, shuffle=True, random_state=3)
cnt = 1
# Cross-Validate
for train, test in kf.split(X, Y):
print(f'Fold:{cnt}, Train set: {len(train)}, Test set:{len(test)}')
cnt += 1
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
clf = clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("test")
print(y_test)
print("predict")
print(predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
when I run it show error like this.
KeyError: "None of [Int64Index([ 0, 1, 2, 5, 7, 8, 9, 10, 11, 12,\n ...\n 161, 164, 165, 166, 167, 168, 169, 170, 171, 173],\n dtype='int64', length=120)]
How to fix it?
The issue is here:
X_train = X[train]
y_train = Y[train]
X_test = X[test]
y_test = Y[test]
To access some parts/slices of your dataframe, you should use the iloc property. This should solve your problem:
X_train = X.iloc[train]
y_train = Y.iloc[train]
X_test = X.iloc[test]
y_test = Y.iloc[test]

Decreasing error in test data with small dataset

I have a problem that the training error is too good, but the test error is too bad. I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation:
XGBoost :
R2 Score : 0.559832465443366
MSE : 0.021168084677487115
RMSE : 0.1454925588388874
MAE : 0.12313938140869134
dataset: https://docs.google.com/spreadsheets/d/1xLTv4jLh7j3sTh0UKMHnSUvMXx1qNiXZ/edit?usp=share_link&ouid=116330084208220275542&rtpof=true&sd=true
these are my codes:
dataset = pd.read_excel('Data.xlsx')
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 4)
sc = StandardScaler()
x_train[:, :] = sc.fit_transform(x_train[:, :])
x_test[:, :] = sc.transform(x_test[:, :])
pca = PCA(n_components = 4)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
rf = RandomForestRegressor()
adb = AdaBoostRegressor()
xgb = xgb.XGBRegressor()
gbrt = GradientBoostingRegressor()
rf_parameters = {'n_estimators':[200,500],'criterion':['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'max_features': ['sqrt', 'log2', None]}
adb_parameters = {'n_estimators':[200,500],'loss':['linear', 'square', 'exponential']}
xgb_parameters = {'booster':['gbtree', 'dart'],
'sampling_method':['uniform', 'gradient_based'],
'tree_method':['auto','exact','approx','hist','gpu_hist'],
'n_estimators':[200,500]}
gbrt_parameters = {'loss':['squared_error', 'absolute_error', 'huber', 'quantile'],'n_estimators':[200,500],'criterion':['friedman_mse', 'squared_error'], 'max_features':['auto', 'sqrt', 'log2']}
rf_grid = GridSearchCV(rf, rf_parameters, cv = 8, n_jobs = -1)
adb_grid = GridSearchCV(adb, adb_parameters, cv = 8, n_jobs = -1)
xgb_grid = GridSearchCV(xgb, xgb_parameters, cv = 8, n_jobs = -1)
gbrt_grid = GridSearchCV(gbrt, gbrt_parameters, cv = 8, n_jobs = -1)
rf_grid.fit(x_train, y_train)
adb_grid.fit(x_train, y_train)
xgb_grid.fit(x_train, y_train)
gbrt_grid.fit(x_train, y_train)
y_pred_rf = rf_grid.predict(x_test)
y_pred_adb = adb_grid.predict(x_test)
y_pred_xgb = xgb_grid.predict(x_test)
y_pred_gbrt = gbrt_grid.predict(x_test)`
what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you
I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation, what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you

Very low Random Forest Score on Time Series Data (Multi-Classification)

I'm attempting to classify seismic events which are very challenging. I have a validation accuracy of 68% using a CNN however when attempting to get a baseline for random forest I receive very poor results. Somewhere around 35%. I'm new to using Random Forest and I'm looking for some help. The shape of the data is (500,15001) 500 being the number of samples and 15001 is the numpy array amount of data points in the time series data (I.E the seismic data). Then the labels are (500,). There are 4 different types of classification from Rockfall to Earthquake.
xTrain, xTest, yTrain, yTest = train_test_split(data, data_labels_np, test_size = 0.3, random_state = 2)
num_classes = len(np.unique(yTrain))
scaler = StandardScaler()
xTrain = scaler.fit_transform(xTrain.reshape(-1, xTrain.shape[-1])).reshape(xTrain.shape)
xTest = scaler.transform(xTest.reshape(-1, xTest.shape[-1])).reshape(xTest.shape)
n_estimators = [25]
max_depth = [25]
min_samples_leaf = [2]
bootstrap = [True, False]
param_grid = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"min_samples_leaf": min_samples_leaf,
"bootstrap": bootstrap,
}
rf = RandomForestRegressor(random_state=42)
rf_model = GridSearchCV(estimator=rf, param_grid=param_grid, cv=4, verbose=10, n_jobs=-1,error_score='raise')
rf_model.fit(xTrain, yTrain)
print("Using hyperparameters --> \n", rf_model.best_params_)
rf = RandomForestRegressor(random_state = 42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter=50, cv =4, verbose = 10, random_state=42, n_jobs = 10)
rf_random.fit(xTrain, yTrain)
rf_model.best_params_
rf_model.best_score_
rf_model.best_estimator_
print('Best score for training data:', rf_random.best_score_,"\n")

Using GridSearchCV best_params_ gives poor results

I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
And I return this classifier
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.
I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).
I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.
Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
If we do the training, and I would be careful about including n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
Then you can check on your test:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831

How to use predefined split for RandomizedSearchCV

I'm trying to regularize my random forest regressor with RandomizedSearchCV. With RandomizedSearchCV the train and test are not explicitly specified, I need to be able to specified my train test set so i can preprocess them after the split. Then i found this helpful QnA and also this. But i still do not know how to do it since in my case, i'm using cross-validation. I already tried to append my train test set from the cross validation but it does not work. It says ValueError: could not broadcast input array from shape (1824,9) into shape (1824) which refers to my X_test
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr ,
param_distributions = random_grid,
n_iter = 100,
cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)
I think your best option is to use a Pipeline plus a ColumnTransformer. Pipelines allow you to specify several steps of computations, including pre-/post-processing, and the column transformer applies different transformations to different columns. In your case, that would be something like:
pipeline = make_pipeline([
make_column_transformer([
(SimpleImputer(strategy='median'), range(1, 8)),
(make_pipeline([
SimpleImputer(strategy='most_frequent'),
LabelEncoder(),
]), 8)
]),
RandomForestRegressor()
])
Then you use this model as a normal estimator, with the usual fit and predict API. In particular, you give this to the randomized search:
rfr_random = RandomizedSearchCV(estimator = pipeline, ...)
Now the pre-processing steps will be applied to each split, before fitting the random forest.
This will certainly not work without further adaptations, but hopefully you get the idea.

Categories