I am currently working on a project in which one portion we would like to use a Filter method of evaluation in which we did correlation analysis but in the other portion, we thought it would be a good idea to do more or a Wrapper method feature elimination using cross validation. I found the recursive feature elimination package on Sci-kit learn.
I am loading my data into a Pandas data frame 'df':
df = pd.read_csv("/Users/rohinmahesh/Documents/Data_Mining_Project/bank-additional-full.csv")
df = df.reset_index()
After creating dummy variables I create the feature and label arrays:
x = df2.iloc[:, :-1]
y = df2.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
After running various models with accuracy as a metric (project specification), the most optimal model was Logistic Regression:
clf = LogisticRegression()
clf.fit(x_train, y_train)
I am then evaluating the model with 3 fold Cross Validation (project specification):
pre_score = cross_val_score(clf, x, y, cv=3)
score = pre_score * 100
Now when I run the regular Recursive Feature Elimination, it is working:
rfe = RFE(clf, 3)
rfe = rfe.fit(x_test, y_test)
But when I try and run this using the RFECV, I am getting an error message:
rfe = RFECV(clf, cv=3)
rfe = rfe.fit(x_test, y_test)
I am new to this, so any help would be great!
Edit: I have gotten this to run properly (not sure what I did before that made it wrong) but I am getting this output, and I don't understand how everything is of rank 1. Could someone catch my mistake possibly?
A little late to post...
It simply means that the algorithm did find any performance improvements from dropping features.
All selected features have a rank of 1.
Only dropped features have a rank greater than 1.
I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
{'validation_0': OrderedDict([('auc',
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
# Show the plots of the test and train set (make sure they look similar!)
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.
In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.
Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.
I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers
I'm running RandomForestRegressor(). I'm using R-squared for scoring. Why do I get dramatically different results with .score versus cross_val_score? Here is the relevant code:
X = df.drop(['y_var'], axis=1)
y = df['y_var']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Random Forest Regression
rfr = RandomForestRegressor()
model_rfr = rfr.fit(X_train,y_train)
pred_rfr = rfr.predict(X_test)
result_rfr = model_rfr.score(X_test, y_test)
# cross-validation
rfr_cv_r2 = cross_val_score(rfr, X, y, cv=5, scoring='r2')
I understand that cross-validation is scoring multiple times versus one for .score, but the results are so radically different, that something is clearly wrong. Here are the results:
R2-dot-score: .99072
R2-cross-val: [0.5349302 0.65832268 0.52918704 0.74957719 0.45649582]
What am I doing wrong? Or what might explain this discrepancy?
OK, I may have solved this. It seems as if cross_val_score does not shuffle the data, which may be leading to worse predictions when data is grouped together. The easiest solution I found (via this answer) to this was to simply shuffle the dataframe before running the model:
shuffled_df = df.reindex(np.random.permutation(df.index))
After I did that, I started getting similar results between .score and cross_val_score:
R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142 0.9922923 0.99259524 0.99195022]
OK, I may have solved this. It seems as if cross_val_score does not randomize the data, which may be leading to worse predictions when similar data is grouped together. The easiest solution I found (via this answer) to this was to simply shuffle the dataframe before running the model:
shuffled_df = df.reindex(np.random.permutation(df.index))
After I did that, I started getting similar results between .score and cross_val_score:
R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142 0.9922923 0.99259524 0.99195022]