How to train xgboost with TF-IDF - python

I'm trying to train the model to classify short texts. I do the following:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
train['vector']=vectorizer.fit_transform(train['item_name'])
train=train.drop('item_name',axis=1)
y=train.category_id
train=train.drop('category_id',axis=1)
X_train, X_test, y_train, y_test = train_test_split(train,y, test_size=0.10,stratify=y,random_state=42)
import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)
But I get an error:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
enable_categorical must be set to True.vector

When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.vector
You should be training XGBClassifier with TfidfVectorizer transformation results only. Right now you're passing the original un-vectorized text sentences also, which causes the above ValueError to be raised.
The simplest solution is to set up a two-step pipeline:
pipeline = Pipeline([
("vectorizer", TfidfVectorizer()),
("classifier", XGBClassifier())
])
pipeline.fit(X_train, y_train)
However, be aware that XGBoost estimators are interpreting sparse data matrices differently from the regular Scikit-Learn estimators. To get the correct/meaningful results, you should additionally convert the sparse data matrix to a dense data matrix.
See Training Scikit-Learn based TF(-IDF) plus XGBoost pipelines

Related

Apply a cross validated ML model to unseen data

I would like to use scikit learn to predict with X a variable y. I would like to train a classifier on a training dataset using cross validation and then to apply this classifier to an unseen test dataset (as in https://www.nature.com/articles/s41586-022-04492-9)
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# Import dataset
X, y = datasets.load_iris(return_X_y=True)
# Create binary variable y
y[y == 0] = 1
# Divide in train and test set
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size=75, random_state=4, stratify=y)
# Cross validation on the train data
cv_model = cross_validate(model, x_train, y_train, cv=5)
Now I would like to use this cross validated model and to apply it to the unseen test set. I am unable to find how.
It would be something like
result = cv_model.score(x_test, y_test)
Except this does not work
You cannot do that; you need to fit the model before using it to predict new data. cross_validate is just a convenience function to get the scores; as clearly mentioned in the documentation, it returns just that, i.e. scores, and not a (fitted) model:
Evaluate metric(s) by cross-validation and also record fit/score times.
[...]
Returns: scores : dict of float arrays of shape (n_splits,)
Array of scores of the estimator for each run of the cross validation.
A dict of arrays containing the score/time arrays for each scorer is returned.

Can I use GridSearchCV with KNeighboursRegressor?

I have a data set with some float column features (X_train) and a continuous target (y_train).
I want to run KNN regression on the data set, and I want to (1) do a grid search for hyperparameter tuning and (2) run cross validation on the training.
I wrote this code:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'weights': ['uniform','distance'],'p':[1,2,5]}]
# Search for best hyperparameters
grid = GridSearchCV(estimator=KNeighborsRegressor(),
param_grid=hp_candidates,
cv=cv_method,
verbose=1,
scoring='accuracy',
return_train_score=True)
grid.fit(X_train,y_train)
The error I get is:
Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I understand the error, that I can only do this method for KNN in classification, not regression.
But what I can't find is how to edit this code to make it suitable for KNN regression? Can someone explain to me how this could be done?
(So the ultimate aim is I have a data set, I want to tune the parameters, do cross validation, and output the best model based on above and get back some accuracy scores, ideally scores that have comparable scores in other algorithms and are not specific to KNN, so I can compare accuracy).
Also just to mention, this is my first attempt at KNN in scikitlearn, so all comments/critic is welcome.
Yes you can use GridSearchCV with the KNeighboursRegressor.
As you have a metric choice problem,
you can read the metrics documentation here : https://scikit-learn.org/stable/modules/model_evaluation.html
The metrics appropriate for a regression problem are different than from classification problems, and you have the list here for appropritae regression metrics:
‘explained_variance’
‘max_error’
‘neg_mean_absolute_error’
‘neg_mean_squared_error’
‘neg_root_mean_squared_error’
‘neg_mean_squared_log_error’
‘neg_median_absolute_error’
‘r2’
‘neg_mean_poisson_deviance’
‘neg_mean_gamma_deviance’
‘neg_mean_absolute_percentage_error’
So you can chose one to replace "accuracy" and test it.

sklearn Stacking Estimator passthrough skips preprocessing and passes original data

This issue has been discussed here but there has been no comments: https://github.com/scikit-learn/scikit-learn/issues/16473
I have some numerical features and categorical features in X. The categorical features were one hot encoded. So my pipeline is something similar to the sklearn docs example:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
But if I change passthrough=True, it will raise a TypeError because the passthrough gives the original X and skips the preprocessing part of the pipeline:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Is there anyway to make the passthrough include the first preprocessing part of the pipeline?
I also cannot add the preprocessing pipeline infront of the final estimator because it will concatenate the original X dataframe with the final layer predictions which is a numpy array as mentioned in the github discussion link at the top of this post. My exact preprocessing pipeline has several custom transformers that operates on pandas dataframe.
Thank you for any help!

How does one code data from a different (test) file vs all the data in one file?

All examples I've ever come across always conveniently have data in one file to show how train_test_split works (or any model really). But quite often the training data and testing data are two separate files.
So, I made a ultra-basic logistic regression train file and test file consisting of two columns, 'age', 'insurance'. And naming the df's df_train, df_test.
I realize df_test hasn't been trained, hence the error but...isn't that the point?
I know model.predict(X_test) doesn't throw an error, but that is based on the training data not the test data.
Word of warning, this is what happens when you're old and trying to learn new things. Don't get old.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.insurance,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict(df_test)
Thanks,
Old fart
As you stated :
train file and test file consisting of two columns, 'age',
'insurance'.
So if test files contains both age and insurance columns and used as it is, the predict function will not work because of mis-match of input between training and prediction.
Also model.predict expect the independent variable only(in your case its age) in below format :
predict(self, X)[source]¶
Predict class labels for samples in X.
Parameters:
X : array_like or sparse matrix, shape (n_samples, n_features)
Samples.
Now coming to the modification :
model.predict(df_test["age"].values)
Edit : Try this :
from sklearn.model_selection import train_test_split
X = df["age"].values
y = df["insurance"].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict([list(df_test["age"].values)])

Cross validation and model selection

I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.
I split the data into two parts. Train data and test data. Here is the code:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores
Now I need to evaluate the estimator clf on X_test.
clf.score(X_test, y_test)
here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?
cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.
The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.

Categories