Getting Feature Importances for XGBoost multioutput - python

Beginning in xgboost version 1.6, you can now run multioutput models directly. In the past, I had been using the scikit learn wrapper MultiOutputRegressor around an xgbregressor estimator. I could then access the individual models feature importance by using something thing like wrapper.estimators_[i].feature_importances_
Now, however, when I run feature_importances_ on a multioutput model of xgboostregressor, I only get one set of features even through I have more than one target. Any idea what this array of feature importances actually represents? Is it the first, last, or some sort of average across all the targets? Is this a function that perhaps just not ready to handle multioutput?
*Realize questions are always easier to answer when you have some code to test:
import numpy as np
from sklearn import datasets
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
from numpy.testing import assert_almost_equal
n_targets = 3
X, y = datasets.make_regression(n_targets=n_targets)
X_train, y_train = X[:50], y[:50]
X_test, y_test = X[50:], y[50:]
single_run_features = {}
references = np.zeros_like(y_test)
for n in range(0,n_targets):
xgb_indi_model = xgb.XGBRegressor(random_state=0)
xgb_indi_model.fit(X_train, y_train[:, n])
references[:,n] = xgb_indi_model.predict(X_test)
single_run_features[n] = xgb_indi_model.feature_importances_
xgb_multi_model = xgb.XGBRegressor(random_state=0)
xgb_multi_model.fit(X_train, y_train)
y__multi_pred = xgb_multi_model.predict(X_test)
xgb_scikit_model = MultiOutputRegressor(xgb.XGBRegressor(random_state=0))
xgb_scikit_model.fit(X_train, y_train)
y_pred = xgb_scikit_model.predict(X_test)
print(assert_almost_equal(references, y_pred))
print(assert_almost_equal(y__multi_pred, y_pred))
scikit_features = {}
for i in range(0,n_targets):
scikit_features[i] = xgb_scikit_model.estimators_[i].feature_importances_
xgb_multi_model_features = xgb_multi_model.feature_importances_
single_run_features
scikit_features
The features importances match for the loop version of single target model, single_run_features, and the MultiOutputRegressor version, scikit_features. The issues is the results in xgb_multi_model_features. Any suggestions??

Is it the first, last, or some sort of average across all the targets
It's the average of all targets.

Related

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

Can I show feature importance for MultiOutputClassifier?

I'm trying to recover the feature importance of a multioutput Classifier using a RandomForest.
The MultiOutput model does not show any problems:
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
## Generate data
x, y = make_multilabel_classification(n_samples=1000,
n_features=15,
n_labels = 5,
n_classes=3,
random_state=12,
allow_unlabeled = True)
x_train = x[:700,:]
x_test = x[701:,:]
y_train = y[:700,:]
y_test = y[701:,:]
## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)
## Make prediction
dfOutput_multi_forest = multi_forest.predict_proba(x_test)
The prediction dfOutput_multi_forest does not show any problems, but I want to recover the feature importance of the multi_forest for interpretation of the output.
Using multi_forest.feature_importance_ throws the following error message:
AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_importance_'
Does anyone know how to retrieve the feature importance?
I'm using scikit v0.20.2
Indeed, it doesn't appear that Sklearn's MultiOutputClassifier has an attribute that contains some sort of amalgamation of the feature importances of all the estimators (in your case, all the RandomForest classifiers) used in the model.
However, it is possible to access the feature importances of each RandomForest classifier, and then average them all together to give you each feature's average importance, across all RandomForest classifiers.
MultiOutputClassifier objects have an attribute called estimators_. If you run multi_forest.estimators_, you will get a list containing an object for each of your RandomForest classifiers.
For each of these RandomForest classifier objects, you can access its feature importances through the feature_importances_ attribute.
To put it all together, this was my approach:
feat_impts = []
for clf in multi_forest.estimators_:
feat_impts.append(clf.feature_importances_)
np.mean(feat_impts, axis=0)
I ran the example code you pasted into your question, and then ran the above block of code to output a list of the following 15 averages:
array([0.09830467, 0.0912088 , 0.05738045, 0.1211305 , 0.03901933,
0.05429491, 0.06929378, 0.06404416, 0.05676634, 0.04919717,
0.05244265, 0.0509295 , 0.05615341, 0.09202444, 0.04780991])
Which contains the average importance of each of your 15 features, across each of the 3 random forest classifiers used in your MultiOutputClassifier.
This should at least help you to see which features, on the whole, tended to be more important in making predictions for each of your 3 classes.

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

How can I create a Linear Regression Model from a split dataset?

I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.

Categories