How to use predefined split for RandomizedSearchCV - python

I'm trying to regularize my random forest regressor with RandomizedSearchCV. With RandomizedSearchCV the train and test are not explicitly specified, I need to be able to specified my train test set so i can preprocess them after the split. Then i found this helpful QnA and also this. But i still do not know how to do it since in my case, i'm using cross-validation. I already tried to append my train test set from the cross validation but it does not work. It says ValueError: could not broadcast input array from shape (1824,9) into shape (1824) which refers to my X_test
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr ,
param_distributions = random_grid,
n_iter = 100,
cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)

I think your best option is to use a Pipeline plus a ColumnTransformer. Pipelines allow you to specify several steps of computations, including pre-/post-processing, and the column transformer applies different transformations to different columns. In your case, that would be something like:
pipeline = make_pipeline([
make_column_transformer([
(SimpleImputer(strategy='median'), range(1, 8)),
(make_pipeline([
SimpleImputer(strategy='most_frequent'),
LabelEncoder(),
]), 8)
]),
RandomForestRegressor()
])
Then you use this model as a normal estimator, with the usual fit and predict API. In particular, you give this to the randomized search:
rfr_random = RandomizedSearchCV(estimator = pipeline, ...)
Now the pre-processing steps will be applied to each split, before fitting the random forest.
This will certainly not work without further adaptations, but hopefully you get the idea.

Related

Decreasing error in test data with small dataset

I have a problem that the training error is too good, but the test error is too bad. I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation:
XGBoost :
R2 Score : 0.559832465443366
MSE : 0.021168084677487115
RMSE : 0.1454925588388874
MAE : 0.12313938140869134
dataset: https://docs.google.com/spreadsheets/d/1xLTv4jLh7j3sTh0UKMHnSUvMXx1qNiXZ/edit?usp=share_link&ouid=116330084208220275542&rtpof=true&sd=true
these are my codes:
dataset = pd.read_excel('Data.xlsx')
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 4)
sc = StandardScaler()
x_train[:, :] = sc.fit_transform(x_train[:, :])
x_test[:, :] = sc.transform(x_test[:, :])
pca = PCA(n_components = 4)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
rf = RandomForestRegressor()
adb = AdaBoostRegressor()
xgb = xgb.XGBRegressor()
gbrt = GradientBoostingRegressor()
rf_parameters = {'n_estimators':[200,500],'criterion':['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'max_features': ['sqrt', 'log2', None]}
adb_parameters = {'n_estimators':[200,500],'loss':['linear', 'square', 'exponential']}
xgb_parameters = {'booster':['gbtree', 'dart'],
'sampling_method':['uniform', 'gradient_based'],
'tree_method':['auto','exact','approx','hist','gpu_hist'],
'n_estimators':[200,500]}
gbrt_parameters = {'loss':['squared_error', 'absolute_error', 'huber', 'quantile'],'n_estimators':[200,500],'criterion':['friedman_mse', 'squared_error'], 'max_features':['auto', 'sqrt', 'log2']}
rf_grid = GridSearchCV(rf, rf_parameters, cv = 8, n_jobs = -1)
adb_grid = GridSearchCV(adb, adb_parameters, cv = 8, n_jobs = -1)
xgb_grid = GridSearchCV(xgb, xgb_parameters, cv = 8, n_jobs = -1)
gbrt_grid = GridSearchCV(gbrt, gbrt_parameters, cv = 8, n_jobs = -1)
rf_grid.fit(x_train, y_train)
adb_grid.fit(x_train, y_train)
xgb_grid.fit(x_train, y_train)
gbrt_grid.fit(x_train, y_train)
y_pred_rf = rf_grid.predict(x_test)
y_pred_adb = adb_grid.predict(x_test)
y_pred_xgb = xgb_grid.predict(x_test)
y_pred_gbrt = gbrt_grid.predict(x_test)`
what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you
I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation, what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you

Very low Random Forest Score on Time Series Data (Multi-Classification)

I'm attempting to classify seismic events which are very challenging. I have a validation accuracy of 68% using a CNN however when attempting to get a baseline for random forest I receive very poor results. Somewhere around 35%. I'm new to using Random Forest and I'm looking for some help. The shape of the data is (500,15001) 500 being the number of samples and 15001 is the numpy array amount of data points in the time series data (I.E the seismic data). Then the labels are (500,). There are 4 different types of classification from Rockfall to Earthquake.
xTrain, xTest, yTrain, yTest = train_test_split(data, data_labels_np, test_size = 0.3, random_state = 2)
num_classes = len(np.unique(yTrain))
scaler = StandardScaler()
xTrain = scaler.fit_transform(xTrain.reshape(-1, xTrain.shape[-1])).reshape(xTrain.shape)
xTest = scaler.transform(xTest.reshape(-1, xTest.shape[-1])).reshape(xTest.shape)
n_estimators = [25]
max_depth = [25]
min_samples_leaf = [2]
bootstrap = [True, False]
param_grid = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"min_samples_leaf": min_samples_leaf,
"bootstrap": bootstrap,
}
rf = RandomForestRegressor(random_state=42)
rf_model = GridSearchCV(estimator=rf, param_grid=param_grid, cv=4, verbose=10, n_jobs=-1,error_score='raise')
rf_model.fit(xTrain, yTrain)
print("Using hyperparameters --> \n", rf_model.best_params_)
rf = RandomForestRegressor(random_state = 42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter=50, cv =4, verbose = 10, random_state=42, n_jobs = 10)
rf_random.fit(xTrain, yTrain)
rf_model.best_params_
rf_model.best_score_
rf_model.best_estimator_
print('Best score for training data:', rf_random.best_score_,"\n")

RandomForest Low R2Score G3 Predictor

Already tried various models and keep getting low scores. My model needs to predict the G3 score.
There are no missing values in the dataset, and all values are integers.
Already tried to check the most important features used by a base model and the score does not improve. Any tips to improve it? Am I doing something wrong?
student_df = pd.read_csv("student_data.csv")
test_df = pd.read_csv("test_data.csv")
student_df = student_df.rename(columns=str.lower)
test_df = test_df.rename(columns=str.lower)
student_df = student_df.iloc[:, 1:]
test_df = test_df.iloc[:, 1:]
X = student_df.drop(columns=["g3"], axis=1)
y = student_df["g3"]
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)
base_rf = RandomForestRegressor()
base_model = base_rf.fit(X_train, y_train)
params = {
"n_estimators": [200, 300, 400, 500],
"max_features": ["sqrt", None],
"max_depth": [3, 4, 5, 6, 7],
"random_state": [0],
}
mse = make_scorer(mean_squared_error, greater_is_better=False)
clf = GridSearchCV(RandomForestRegressor(), params, scoring=mse, n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
Score, RMSE, R2Score
Base Model score is - 0.33531194310489754, 2.7544894130501762, 0.33531194310489754
Clf score is - 7.449311807649206, 2.7293427427952697, 0.34739287125101215
student cols

How to partition a dataset into three equal parts?

I am trying to divide my dataset into three equal parts by using scikit-learn. But when I use StratifiedKFold (on sklearn) to do it, it only shows me the command that I did for partition the dataset, rather than the result:
from sklearn.model_selection import StratifiedKFold
partition = StratifiedKFold(n_splits = 3, shuffle = True, random_state = None)
print(partition)
I am still new with Python libraries, so I am not sure about how to do it.
The second line of your code creates a StratifiedKFold object, it does not really partition your data. It is this object that you should use to split your data (see example below)
partition = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 1)
for train_index, test_index in partition.split(x, y):
x_train_f, x_test_f = x[train_index], x[test_index]
y_train_f, y_test_f = y[train_index], y[test_index]
Your answer for splitting your data in 3 parts has been answered here
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])

See the score of each fold when cross validating a model using a for loop

I want to see the individual score of each fitted model to visualize the strength of cross validation (I am doing this to show my coworkers why cross validation is important).
I have a .csv file with 500 rows, 200 independent variables and 1 binary target. I defined skf to fold the data 5 times using StratifiedKFold.
My code looks like this:
X = data.iloc[0:500, 2:202]
y = data["target"]
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = svm.SVC(kernel = "linear")
Scores = [0] * 5
for i, j in skf.split(X, y):
X_train, y_train = X.iloc[i], y.iloc[i]
X_test, y_test = X.iloc[j], y.iloc[j]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
As you can see, I assigned a list of 5 zeroes to Scores. I would like to assign the clf.score(X_test, y_test) of each of the 5 predictions to the list. However, the indices i and j are not {1, 2, 3, 4, 5}. Rather, they are row numbers used to fold the X and y data frames.
How can I assign the test scores of each of the k fitted models into Scoreswithin this loop? Do I need a separate index for this?
I know using cross_val_score literally does all this and gives you a geometric average of the k scores. However, I want to show my coworkers what happens behind the cross validation functions that come in the sklearn library.
Thanks in advance!
If I understood the question, and you don't need any particular indexing for Scores:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
X = np.random.normal(size = (500, 200))
y = np.random.randint(low = 0, high=2, size=500)
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = SVC(kernel = "linear")
Scores = []
for i, j in skf.split(X, y):
X_train, y_train = X[i], y[i]
X_test, y_test = X[j], y[j]
clf.fit(X_train, y_train)
Scores.append(clf.score(X_test, y_test))
The result is:
>>>Scores
[0.5247524752475248, 0.53, 0.5, 0.51, 0.4444444444444444]

Categories