I'm trying to recover the feature importance of a multioutput Classifier using a RandomForest.
The MultiOutput model does not show any problems:
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
## Generate data
x, y = make_multilabel_classification(n_samples=1000,
n_features=15,
n_labels = 5,
n_classes=3,
random_state=12,
allow_unlabeled = True)
x_train = x[:700,:]
x_test = x[701:,:]
y_train = y[:700,:]
y_test = y[701:,:]
## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)
## Make prediction
dfOutput_multi_forest = multi_forest.predict_proba(x_test)
The prediction dfOutput_multi_forest does not show any problems, but I want to recover the feature importance of the multi_forest for interpretation of the output.
Using multi_forest.feature_importance_ throws the following error message:
AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_importance_'
Does anyone know how to retrieve the feature importance?
I'm using scikit v0.20.2
Indeed, it doesn't appear that Sklearn's MultiOutputClassifier has an attribute that contains some sort of amalgamation of the feature importances of all the estimators (in your case, all the RandomForest classifiers) used in the model.
However, it is possible to access the feature importances of each RandomForest classifier, and then average them all together to give you each feature's average importance, across all RandomForest classifiers.
MultiOutputClassifier objects have an attribute called estimators_. If you run multi_forest.estimators_, you will get a list containing an object for each of your RandomForest classifiers.
For each of these RandomForest classifier objects, you can access its feature importances through the feature_importances_ attribute.
To put it all together, this was my approach:
feat_impts = []
for clf in multi_forest.estimators_:
feat_impts.append(clf.feature_importances_)
np.mean(feat_impts, axis=0)
I ran the example code you pasted into your question, and then ran the above block of code to output a list of the following 15 averages:
array([0.09830467, 0.0912088 , 0.05738045, 0.1211305 , 0.03901933,
0.05429491, 0.06929378, 0.06404416, 0.05676634, 0.04919717,
0.05244265, 0.0509295 , 0.05615341, 0.09202444, 0.04780991])
Which contains the average importance of each of your 15 features, across each of the 3 random forest classifiers used in your MultiOutputClassifier.
This should at least help you to see which features, on the whole, tended to be more important in making predictions for each of your 3 classes.
Related
Beginning in xgboost version 1.6, you can now run multioutput models directly. In the past, I had been using the scikit learn wrapper MultiOutputRegressor around an xgbregressor estimator. I could then access the individual models feature importance by using something thing like wrapper.estimators_[i].feature_importances_
Now, however, when I run feature_importances_ on a multioutput model of xgboostregressor, I only get one set of features even through I have more than one target. Any idea what this array of feature importances actually represents? Is it the first, last, or some sort of average across all the targets? Is this a function that perhaps just not ready to handle multioutput?
*Realize questions are always easier to answer when you have some code to test:
import numpy as np
from sklearn import datasets
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
from numpy.testing import assert_almost_equal
n_targets = 3
X, y = datasets.make_regression(n_targets=n_targets)
X_train, y_train = X[:50], y[:50]
X_test, y_test = X[50:], y[50:]
single_run_features = {}
references = np.zeros_like(y_test)
for n in range(0,n_targets):
xgb_indi_model = xgb.XGBRegressor(random_state=0)
xgb_indi_model.fit(X_train, y_train[:, n])
references[:,n] = xgb_indi_model.predict(X_test)
single_run_features[n] = xgb_indi_model.feature_importances_
xgb_multi_model = xgb.XGBRegressor(random_state=0)
xgb_multi_model.fit(X_train, y_train)
y__multi_pred = xgb_multi_model.predict(X_test)
xgb_scikit_model = MultiOutputRegressor(xgb.XGBRegressor(random_state=0))
xgb_scikit_model.fit(X_train, y_train)
y_pred = xgb_scikit_model.predict(X_test)
print(assert_almost_equal(references, y_pred))
print(assert_almost_equal(y__multi_pred, y_pred))
scikit_features = {}
for i in range(0,n_targets):
scikit_features[i] = xgb_scikit_model.estimators_[i].feature_importances_
xgb_multi_model_features = xgb_multi_model.feature_importances_
single_run_features
scikit_features
The features importances match for the loop version of single target model, single_run_features, and the MultiOutputRegressor version, scikit_features. The issues is the results in xgb_multi_model_features. Any suggestions??
Is it the first, last, or some sort of average across all the targets
It's the average of all targets.
I am trying to run an artificial neural network with scikit-learn.
I want to run the regression, get the model fit results, an generate out of sample forecasts.
This is my code below. Any help will be greatly appreciated.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
#import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
#view the data
df.head(5)
#to drop a column of data type
df2=df.drop('Unnamed: 13', axis=1)
#view the data
df2.head(5)
Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
describe the data
df.describe().transpose()
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
set the X and Y
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
import MLP Classifier and fit the network
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)
predict_train = mlp.predict(X_train)
set up the MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(50, 8),
max_iter=15,
alpha=1e-4,
solver="sgd",
verbose=True,
random_state=1,
learning_rate_init=0.1)
import the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
predict_test = mlp.predict(X_test)
to train on the data I use the MLPClassifier to call the fit function on the training data.
mlp.fit(X_train, y_train)
after this, the neural network is done training.
after the neural network is trained, the next step is to test it.
print out the model scores
print(f"Training set score: {mlp.score(X_train, y_train)}")
print(f"Test set score: {mlp.score(X_test, y_test)}")
y_predict = mlp.predict(X_train)
I am getting an error from below
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
The error message is
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
any help will be greatly appreciated
predict function gives you the actual class and since your point can belong to one and only one class (except multi label), it is supposed to be like this only
What is the shape of your Y_true_labels? Might be the case that your labels are Sparse and with 2 classes, means 0,1 and since the models is minimising Log Loss as described here as:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Also looking at the predict() it says:
log_y_probndarray of shape (n_samples, n_classes)
The predicted log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_. Equivalent to log(predict_proba(X))
So it means that if probability is 0.3 it means it belongs to class and if it's 0.7 it belongs to class, ASSUMING it's binary classification with a threshold set to 0.5.
What you might be confusing with is the predict_proba() function which gives you the probabilities for each classes.
Might be the case. Please post your X,Y data shape and type so that we can understand better.
I used SHAP to explain my RF
RF_best_parameters = RandomForestRegressor(random_state=24, n_estimators=100)
RF_best_parameters.fit(X_train, y_train.values.ravel())
shap_explainer_model = shap.TreeExplainer(RF_best_parameters)
The TreeExplainer class has an attribute expected_value.
My first guess that this field is the mean of the predicted y, according to the X_train (I also read this here )
But it is not.
The output of the command:
shap_explainer_model.expected_value
is 0.2381.
The output of the command:
RF_best_parameters.predict(X_train).mean()
is 0.2389.
As we can see the values are not same.
So what is the meaning of the expected_value here?
This is due to a peculiarity of the method when used with the Random Forest algorithm; quoting from the response in the relevant Github thread shap explainer expected_value is different from model expected value:
It is because of how sklearn records the training samples in the tree models it builds. Random forests use a random subsample of the data to train each tree, and it is that random subsample that is used in sklearn to record the leaf sample weights in the model. Since TreeExplainer uses the recorded leaf sample weights to represent the training dataset, it will depend on the random sampling used during training. This will cause small variations like the ones you are seeing.
We can actually verify that this behavior is not present with other algorithms, say Gradient Boosting Trees:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
But for RF, we get indeed a "small variation" as mentioned by the main SHAP developer in the thread above:
rf = RandomForestRegressor(random_state=0)
rf.fit(X_train, y_train)
rf_explainer = shap.TreeExplainer(rf)
rf_explainer.expected_value
# array([-11.59166808])
mean_pred_rf = np.mean(rf.predict(X_train))
mean_pred_rf
# -11.280125877556388
np.isclose(mean_pred_rf, rf_explainer.expected_value)
# array([False])
Just try :
shap_explainer_model = shap.TreeExplainer(RF_best_parameters, data=X_train, feature_perturbation="interventional", model_output="raw")
Then the shap_explainer_model.expected_value should give you the mean prediction of your model on train data.
Otherwise, TreeExplainer uses feature_perturbation="tree_path_dependent"; accoding to the documentation:
The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.
I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.
I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.