Decreasing error in test data with small dataset

Decreasing error in test data with small dataset - python

I have a problem that the training error is too good, but the test error is too bad. I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation:
XGBoost :
R2 Score : 0.559832465443366
MSE : 0.021168084677487115
RMSE : 0.1454925588388874
MAE : 0.12313938140869134
dataset: https://docs.google.com/spreadsheets/d/1xLTv4jLh7j3sTh0UKMHnSUvMXx1qNiXZ/edit?usp=share_link&ouid=116330084208220275542&rtpof=true&sd=true
these are my codes:
dataset = pd.read_excel('Data.xlsx')
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 4)
sc = StandardScaler()
x_train[:, :] = sc.fit_transform(x_train[:, :])
x_test[:, :] = sc.transform(x_test[:, :])
pca = PCA(n_components = 4)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
rf = RandomForestRegressor()
adb = AdaBoostRegressor()
xgb = xgb.XGBRegressor()
gbrt = GradientBoostingRegressor()
rf_parameters = {'n_estimators':[200,500],'criterion':['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'max_features': ['sqrt', 'log2', None]}
adb_parameters = {'n_estimators':[200,500],'loss':['linear', 'square', 'exponential']}
xgb_parameters = {'booster':['gbtree', 'dart'],
'sampling_method':['uniform', 'gradient_based'],
'tree_method':['auto','exact','approx','hist','gpu_hist'],
'n_estimators':[200,500]}
gbrt_parameters = {'loss':['squared_error', 'absolute_error', 'huber', 'quantile'],'n_estimators':[200,500],'criterion':['friedman_mse', 'squared_error'], 'max_features':['auto', 'sqrt', 'log2']}
rf_grid = GridSearchCV(rf, rf_parameters, cv = 8, n_jobs = -1)
adb_grid = GridSearchCV(adb, adb_parameters, cv = 8, n_jobs = -1)
xgb_grid = GridSearchCV(xgb, xgb_parameters, cv = 8, n_jobs = -1)
gbrt_grid = GridSearchCV(gbrt, gbrt_parameters, cv = 8, n_jobs = -1)
rf_grid.fit(x_train, y_train)
adb_grid.fit(x_train, y_train)
xgb_grid.fit(x_train, y_train)
gbrt_grid.fit(x_train, y_train)
y_pred_rf = rf_grid.predict(x_test)
y_pred_adb = adb_grid.predict(x_test)
y_pred_xgb = xgb_grid.predict(x_test)
y_pred_gbrt = gbrt_grid.predict(x_test)`
what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you
I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation, what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you

Related

Same parameter for knn in R and python

this is knn in R
data=read.csv("data.csv" , header =TRUE )
x= data.matrix(data[,1:47])
y= (data[,48]-1)
y <- as.factor(y)
training= sample(1:n, 0.7*n)
testing= sample(1:n)[-training]
knn <- knn3Train(data[training,], data[testing,],y[training], k=5)
this is knn in Python
data = pd.read_csv("data.csv")
x= data.drop(columns=['class'])
y=data['class'].values
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=500)
knearest = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn = knearest.fit(X_train, y_train)
I want the same result for knn in both languages but i couldn't make exact same parameters. i hope you have understood my question.
Knn is my first ever step in R and python. i have zero knowledge otherwise.

Very low Random Forest Score on Time Series Data (Multi-Classification)

I'm attempting to classify seismic events which are very challenging. I have a validation accuracy of 68% using a CNN however when attempting to get a baseline for random forest I receive very poor results. Somewhere around 35%. I'm new to using Random Forest and I'm looking for some help. The shape of the data is (500,15001) 500 being the number of samples and 15001 is the numpy array amount of data points in the time series data (I.E the seismic data). Then the labels are (500,). There are 4 different types of classification from Rockfall to Earthquake.
xTrain, xTest, yTrain, yTest = train_test_split(data, data_labels_np, test_size = 0.3, random_state = 2)
num_classes = len(np.unique(yTrain))
scaler = StandardScaler()
xTrain = scaler.fit_transform(xTrain.reshape(-1, xTrain.shape[-1])).reshape(xTrain.shape)
xTest = scaler.transform(xTest.reshape(-1, xTest.shape[-1])).reshape(xTest.shape)
n_estimators = [25]
max_depth = [25]
min_samples_leaf = [2]
bootstrap = [True, False]
param_grid = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"min_samples_leaf": min_samples_leaf,
"bootstrap": bootstrap,
}
rf = RandomForestRegressor(random_state=42)
rf_model = GridSearchCV(estimator=rf, param_grid=param_grid, cv=4, verbose=10, n_jobs=-1,error_score='raise')
rf_model.fit(xTrain, yTrain)
print("Using hyperparameters --> \n", rf_model.best_params_)
rf = RandomForestRegressor(random_state = 42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter=50, cv =4, verbose = 10, random_state=42, n_jobs = 10)
rf_random.fit(xTrain, yTrain)
rf_model.best_params_
rf_model.best_score_
rf_model.best_estimator_
print('Best score for training data:', rf_random.best_score_,"\n")

RandomForest Low R2Score G3 Predictor

Already tried various models and keep getting low scores. My model needs to predict the G3 score.
There are no missing values in the dataset, and all values are integers.
Already tried to check the most important features used by a base model and the score does not improve. Any tips to improve it? Am I doing something wrong?
student_df = pd.read_csv("student_data.csv")
test_df = pd.read_csv("test_data.csv")
student_df = student_df.rename(columns=str.lower)
test_df = test_df.rename(columns=str.lower)
student_df = student_df.iloc[:, 1:]
test_df = test_df.iloc[:, 1:]
X = student_df.drop(columns=["g3"], axis=1)
y = student_df["g3"]
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)
base_rf = RandomForestRegressor()
base_model = base_rf.fit(X_train, y_train)
params = {
"n_estimators": [200, 300, 400, 500],
"max_features": ["sqrt", None],
"max_depth": [3, 4, 5, 6, 7],
"random_state": [0],
}
mse = make_scorer(mean_squared_error, greater_is_better=False)
clf = GridSearchCV(RandomForestRegressor(), params, scoring=mse, n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
Score, RMSE, R2Score
Base Model score is - 0.33531194310489754, 2.7544894130501762, 0.33531194310489754
Clf score is - 7.449311807649206, 2.7293427427952697, 0.34739287125101215
student cols

How to use predefined split for RandomizedSearchCV

I'm trying to regularize my random forest regressor with RandomizedSearchCV. With RandomizedSearchCV the train and test are not explicitly specified, I need to be able to specified my train test set so i can preprocess them after the split. Then i found this helpful QnA and also this. But i still do not know how to do it since in my case, i'm using cross-validation. I already tried to append my train test set from the cross validation but it does not work. It says ValueError: could not broadcast input array from shape (1824,9) into shape (1824) which refers to my X_test
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr ,
param_distributions = random_grid,
n_iter = 100,
cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)

I think your best option is to use a Pipeline plus a ColumnTransformer. Pipelines allow you to specify several steps of computations, including pre-/post-processing, and the column transformer applies different transformations to different columns. In your case, that would be something like:
pipeline = make_pipeline([
make_column_transformer([
(SimpleImputer(strategy='median'), range(1, 8)),
(make_pipeline([
SimpleImputer(strategy='most_frequent'),
LabelEncoder(),
]), 8)
]),
RandomForestRegressor()
])
Then you use this model as a normal estimator, with the usual fit and predict API. In particular, you give this to the randomized search:
rfr_random = RandomizedSearchCV(estimator = pipeline, ...)
Now the pre-processing steps will be applied to each split, before fitting the random forest.
This will certainly not work without further adaptations, but hopefully you get the idea.

PCA, when applied on new data, performance collapsed

I am using PCA to do dimension reduction, my training data has 1200000 records with 335 dimensions. Here is my code to train the model
X, y = load_data(f_file1)
valid_X, valid_y = load_data(f_file2)
pca = PCA(n_components=n_compo, whiten=True)
X = pca.fit_transform(X)
valid_input = pca.transform(valid_X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=30,
min_samples_leaf=2, class_weight={0: 10, 1: 1}) # imbalanced class
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train)*100,
clf.score(X_test, y_test)*100,
recall_score(y_train, clf.predict(X_train))*100,
recall_score(y_test, clf.predict(X_test))*100,
precision_score(y_train, clf.predict(X_train))*100,
precision_score(y_test, clf.predict(X_test))*100,
auc(*roc_curve(y_train, clf.predict_proba(X_train)[:, 1], pos_label=1)[:-1])*100,
auc(*roc_curve(y_test, clf.predict_proba(X_test)[:, 1], pos_label=1)[:-1])*100)
print(precision_score(valid_y, clf.predict(valid_input))*100,
recall_score(valid_y, clf.predict(valid_input))*100,
accuracy_score(valid_y, clf.predict(valid_input))*100,
auc(*roc_curve(valid_y, clf.predict_proba(valid_input)[:, 1], pos_label=1)[:-1])*100)
The output is
99.80, 99.32, 99.87, 99.88, 99.74, 98.78, 99.99, 99.46
0.00, 0.00, 97.13, 49.98, 700.69
So the recall and precision are 0s. Why PCA seems doesn't work on validate data and is the model got overfitted?

Probably it's overfitted because
max_depth=30
It's too much.
How did you select PCA dimension? Optimal value you can get via eigenvectors/eigenvalues approach:
data = data.values
mean = np.mean(data.T, axis=1)
demeaned = data - mean
evals, evecs = np.linalg.eig(np.cov(demeaned.T))
order = evals.argsort()[::-1]
evals = evals[order]
plt.plot(evals)
plt.grid(True)
plt.savefig('_!pca.png')
Optimal values you select by x values where line drop down to very zero.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decreasing error in test data with small dataset - python

Related

Same parameter for knn in R and python

Very low Random Forest Score on Time Series Data (Multi-Classification)

RandomForest Low R2Score G3 Predictor

How to use predefined split for RandomizedSearchCV

PCA, when applied on new data, performance collapsed

Categories

Resources