When I plot the feature importance, I get this messy plot. I have more than 7000 variables. I understand the built-in function only selects the most important, although the final graph is unreadable.
This is the complete code:
import numpy as np
import pandas as pd
df = pd.read_csv('ricerice.csv')
array=df.values
X = array[:,0:7803]
Y = array[:,7804]
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, Y)
import matplotlib.pyplot as plt
from matplotlib import pyplot
from xgboost import plot_importance
fig1=plt.gcf()
plot_importance(model)
plt.draw()
fig1.savefig('xgboost.png', figsize=(50, 40), dpi=1000)
Although the size of the figure, the graph is illegible.
There are couple of points:
To fit the model, you want to use the training dataset (X_train, y_train), not the entire dataset (X, y).
You may use the max_num_features parameter of the plot_importance() function to display only top max_num_features features (e.g. top 10).
With the above modifications to your code, with some randomly generated data the code and output are as below:
import numpy as np
# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100) # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
plot_importance(model, max_num_features=10) # top 10 most important features
plt.show()
You need to sort your feature importances in descending order first:
sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]
Then just plot them with the column names from your dataframe
from matplotlib import pyplot as plt
n_top_features = 10
sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]
plt.barh(X_test.columns[sorted_idx][:n_top_features ], trained_mdl.feature_importances_[sorted_idx][:n_top_features ])
You can obtain feature importance from Xgboost model with feature_importances_ attribute. In your case, it will be:
model.feature_imortances_
This attribute is the array with gain importance for each feature. Then you can plot it:
from matplotlib import pyplot as plt
plt.barh(feature_names, model.feature_importances_)
(feature_names is a list with features names)
You can sort the array and select the number of features you want (for example, 10):
sorted_idx = model.feature_importances_.argsort()
plt.barh(feature_names[sorted_idx][:10], model.feature_importances_[sorted_idx][:10])
plt.xlabel("Xgboost Feature Importance")
There are two more methods to get feature importance:
you can use permutation_importance from scikit-learn (from version 0.22)
you can use SHAP values
You can read more in this blog post of mine.
Related
I'm getting the error while running:
Creating a confusion matrix ##option
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))
Error: NameError: name 'predictions' is not defined
The entire code is:
Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.model_selection import train_test_split
# Importing the dataset
loans = pd.read_csv('loan_borowwer_data.csv')
loans.describe()
loans.isnull().sum()
loans.info()
plt.hist(loans['fico'],color='blue',edgecolor='white',bins=5)
plt.title('Histogram of fico')
plt.xlabel('fico')
plt.ylabel('frequency')
plt.show()
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue', bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red', bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
#Create a list of elements containing the string purpose. Name this list cat_feats.
cat_feats = ['purpose']
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
final_data.head()
# # Train-Test Split
# In[ ]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
# # Training Decision Tree Model
# Let us create a DecisionTreeClassifier from sklearn.tree.
# In[76]:
#Let us create a DecisionTreeClassifier from sklearn.tree.
# In[77]:
# Decision tree model
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
# # Evaluating Decision Tree
# Creating a confusion matrix ##option
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))
NameError: name 'predictions' is not defined
Please help
You are missing a step before creating the confusion matrix. After you declare and fit() the model with the train data, you need to do a prediction on the test data. It would be something like predictions = dtree.predict(X_test) which will create the predicted y values for the X_test values. Once that is done, you can run the confusion matrix and it should create the matrix you desire. Hope this helps.
I have fitted a model from which I'd like to know the scores (r-squared).
The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?
My code:
import pandas as pd
import numpy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()
y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)
To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.
Formula
Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer
SciKit Learn R-squared is very different from square of Pearson's Correlation R
Method 1: function
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Method 2: sklearn library
from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
It looks like you're using scikit-learn. If so, you can use the r2_score metric.
I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?
I want to estimate the model from the data I've used here in scikit-learn. I am using the DecisionTreeClassifier.score function but when running the code I'll receive an ValueError:
Can't handle mix of continuous and multiclass.
Here is the code I use:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
nba = pd.read_excel(r"C:\Users\user\Desktop\nba.xlsx")
X = nba.drop('平均得分', axis = 1)
y = nba['平均得分']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.20)
nba_tree = DecisionTreeClassifier()
nba_tree.fit(X_train, y_train.astype('int'))
y_pred = nba_tree.predict(X_test)
nba_tree.score(X_test, y_test)
It looks like your target variable 平均得分 is a continuous variable. Probably you are try to solve a regression problem. If that is the case then try DecisionTreeRegressor instead of DecisionTreeClassifier.
How can I show the important features that contribute to the SVM model along with the feature name?
My code is shown below,
First I Imported the modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
Then I divided my data into features and variables
y = df_new[['numeric values']]
X = df_new.drop('numeric values', axis=1).values
Then I Setup the pipeline
steps = [('scalar', StandardScaler()),
('SVM', SVC(kernel='linear'))]
pipeline = Pipeline(steps)
Then I Specified my the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}
I Created a train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=21)
Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline,param_grid = parameters,cv=5)
Fit to the training set
cv.fit(X_train,y_train.values.ravel())
Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
feature_importances = cv.best_estimator_.feature_importances_
The error message I get
'Pipeline' object has no attribute 'feature_importances_'
What I understood is that, lets suppose you are building a model with 100 feature and you want to know which feature is more important and which is less if this is the case ?
Just try Uni-variate feature selection method, Its very basic method and you can play with this before going to advance methods for your data. Sample code is provided scikit-learn it self. You can modified it as per your requirement.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif
###############################################################################
# import some data to play with
# The iris dataset
iris = datasets.load_iris()
# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target
###############################################################################
plt.figure(1)
plt.clf()
X_indices = np.arange(X.shape[-1])
###############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
label=r'Univariate score ($-Log(p_{value})$)', color='g')
###############################################################################
# Compare to the weights of an SVM
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()
plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='r')
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector.transform(X), y)
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
width=.2, label='SVM weights after selection', color='b')
plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()
Code ref.
http://scikit-learn.org/0.15/auto_examples/plot_feature_selection.html
Note;
For each feature, this method will plot p-values for the univariate feature selection and the corresponding weights of an SVM. This method selects those feature which shows larger SVM weights.