I'm trying to do a multiclass classification using multiple machine learning using this function that I have created:
def model_roc(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
pipeline1 = imbpipeline(steps = [['pca' , PCA()],
['smote', SMOTE('not majority',random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,max_iter=1000)]])
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
param_grid1 = {'classifier__penalty': ['l1', 'l2'],'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search1 = GridSearchCV(estimator=pipeline1,
param_grid=param_grid1,
scoring=make_scorer(roc_auc_score, average='weighted',multi_class='ovo', needs_proba=True),
cv=stratified_kfold,
n_jobs=-1)
print("#"*100)
print(pipeline1['classifier'])
grid_search1.fit(X_train, y_train)
cv_score1 = grid_search1.best_score_
test_score1 = grid_search1.score(X_test, y_test)
print('cv_score',cv_score1, 'test_score',test_score1)
return
I have 2 questions:
Can I get multiple metrics from the same function(ROC/Accuracy/precision and F1_score) as it is imbalanced data with multiple classes?
I need to plot the learning curve, but I don't know how to do this out of my function.
to plot a learning curve of roc_auc_score
def plot_learning_curves(model,X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
train_scores ,test_scores = [],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict =model.predict(X_train[:m])
y_test_predict = model.predict(X_test)
train_scores.append(roc_auc_score(y_train,y_train_predict))
test_scores.append(roc_auc_score(y_test,y_test_predict))
plt.plot(train_scores,"r-+",linewidth=2,label="train")
plt.plot(test_scores,"b-",linewidth=2,label="test")
we iterate m from 1 to the whole length of the training set, slicing m records from the training data and training the model on this subset.
The model's roc_auc_score will be calculated and added to a list, showing the progress of it's roc_auc_score as the model trains on more and more instances.
-modified from ageron-handsonml2
Related
I'm using the train_test_split from sklearn.model_selection. My code looks like the following:
x_train, x_test , y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)
Edit: After this is done, how do I fit these to the linear regression model, and then see how good this model is? i.e. Which of the four components (x_train, x_test, y_train, or y_test) would I use to calculate MSE or RMSE? And how exactly how would I do that?
I was try to do hyper-parameter tuning using GroupKFold and RandomSearchCV. I have cross checked the shapes, they are matching. How to solve this error?
Code:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)
qids_train = X_train.groupby(["query_id"])["query_id"].count().to_numpy()
flatted_group_train = np.repeat(range(len(qids_train)), repeats=qids_train)
gbm = lightgbm.LGBMRanker(
objective="lambdarank",
metric="ndcg",
is_unbalance=True,
)
gkf = GroupKFold(n_splits=5)
cv = gkf.split(X_train, y_train, groups=flatted_group_train)
grid = RandomizedSearchCV(gbm, params_grid, n_iter=10, cv=cv, verbose=2, refit=False)
grid.fit(
X=X_train,
y=y_train,
group=qids_train,
)
Error:
lightgbm.basic.LightGBMError: Sum of query counts is not same with
#data
Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.
I am using lightGBM for finding feature importance but I am getting error LightGBMError: b'len of label is not same with #data' .
X.shape
(73147, 12)
y.shape
(73147,)
Code:
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(X, y_train, early_stopping_rounds=100, eval_set = [(X_test, y_test)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
See screenshot below:
You seem to have a typo in your code; instead of
model.fit(X, y_train, [...])
it should be
model.fit(X_train, y_train, [...])
As it is now, it is understandable that the length of X and y_train is not the same, hence your error.
I would like to test the accuracy by epoch in scikit-learn. However, so far, i have been unsuccessful.
This is my code part of classifying with mlpclassifier:
NUM_EPOCHS = 1000
LOG_FOR_EVERY = 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = MLPClassifier(hidden_layer_sizes=(18, 175, 256), batch_size=528,
learning_rate_init=0.0001, beta_1=0.001,
beta_2=0.001, max_iter=1, warm_start=True)
for i in range(NUM_EPOCHS):
clf.fit(X_train, y_train.ravel())
I have also made a graph with this result but i need to make it continuous and further increase accuracy.
Why is the accuracy not increasing?
Please try with :
clf.fit(X_train, y_train.ravel(), epochs=NUM_EPOCHS)