Find the most import features for a SVM classification

Find the most import features for a SVM classification - python

I'm training a binary classifier using python and the popular scikit-learn module's SVM class. After training I use the predict method to make a classification as laid out in sci-kit's SVC documentation.
I would like to know more about the significance of my sample features to the resulting classification made by the trained decision_function (support vectors). Any strategies for evaluating feature significance when making predictions with such a model are welcome.
Thanks!
Andre

So, how do we interpret feature significance for a given sample's classification?
I think using a linear kernel is the most straightforward way to first approach this because of the significance/relative simplicity of the svc.coef_ attribute of a trained model. check out Bitwise's answer.
Below I will train a linear kernel SVM using scikit training data. Then we will look at the coef_ attribute. I will include a simple plot showing how the dot product of the classifier's coefficients and training feature data divide the resulting classes.
from sklearn import svm
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data # training features
y = data.target # training labels
lin_clf = svm.SVC(kernel='linear')
lin_clf.fit(X,y)
scores = np.dot(X, lin_clf.coef_.T)
b0 = y==0 # boolean or "mask" index arrays
b1 = y==1
malignant_scores = scores[b1]
benign_scores = scores[b1]
fig = plt.figure()
fig.suptitle("score breakdown by classification", fontsize=14, fontweight='bold')
score_box_plt = ply.boxplot(
[malignant_scores, benign_scores],
notch=True,
labels=list(data.target_names),
vert=False
)
plt.show(score_box_plt)
As you can see we do seem to have accessed the appropriate intercept and coefficient values. There is obvious separation of class scores with our decision boundary hovering around 0.
Now that we have a scoring system based on our linear coefficients we can easily investigate how each feature contributed to final classification. Here we display each features effect on the final score of that sample.
## sample we're using X[2] --> classified benign, lin_clf score~(-20)
lin_clf.predict(X[2].reshape(1,30))
contributions = np.multiply(X[2], lin_clf.coef_.reshape((30,)))
feature_number = np.arange(len(contributions)) +1
plt.bar(feature_number, contributions, align='center')
plt.xlabel('feature index')
plt.ylabel('score contribution')
plt.title('contribution to classification outcome by feature index')
plt.show(feature_contrib_bar)
We can also simply sort this same data to get a contribution-ranked list of features for a given classification to see which feature contributed the most to the score we are assessing the composition of.
abs_contributions = np.flip(np.sort(np.absolute(contributions)), axis=0)
feat_and_contrib = []
for contrib in abs_contributions:
if contrib not in contributions:
contrib = -contrib
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
else:
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
# sorted by max abs value. each row a tuple:;(feature index, contrib)
feat_and_contrib
From that ranked list we can see that the top five feature indices that contributed to the final score (of around -20 along with a classification 'benign') were [0, 22, 13, 2, 21] which correspond to the feature names in our data set; ['mean radius', 'worst perimeter', 'area error', 'mean perimeter', 'worst texture'].

Suppose You have Bag of word Featurization and you want to know which words are important
for classification then use this code for linear svm
weights = np.abs(lr_svm.coef_[0])
sorted_index = np.argsort(wt)[::-1]
top_10 = sorted_index[:10]
terms = text_vectorizer.get_feature_names()
for ind in top_10:
print(terms[ind])

You can use SelectFromModel in sklearn to get the names of the most relevant features of your model. Here is an example of extracting the features for LassoCV.
You can also check out this example which makes use of coef_ attribute in SVM to visualize the top most features.

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.

Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

Why do I get two different values in heatmap and feature_importances?

I'm running a feature selection using sns.heatmap and one using sklearn feature_importances.
When using the same data I get two difference values.
Here is the heatmap
and heatmap code
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")
df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
"Win", "Money_Line_Percentage", 'Cover'], axis=1)
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
df_model[top_corr_features].corr(), annot=True, cmap='hot')
plt.xticks(rotation=90)
plt.yticks(rotation=45)
plt.show()
Here is the feature_importances bar graph
and the code
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)
df_model = training_data.copy()
df_model = df_model.dropna()
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
model = RandomForestClassifier(
random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')
print(perm_importance)
plt.show()
I'm not sure which one is more accurate or if I'm using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?

You are comparing two different things, why would you expect them to be the same? And what would it even mean in this case?
Feature importances in tree based models are computed based on how many times given feature was used for splitting. Feature that is used more often for a split is more important (for a particular model fitted with particular dataset) than a feature that is used less often.
Correlation on the other hand is a measure of linear relationship between 2 features.
I'm not sure which one is more accurate
What do you mean by accuracy? Both of these are accurate in what they are measuring. It is just that none of these directly tells you which feature/s to throw away.
Note that just because 2 features are correlated, it doesn't mean that you can automatically throw one of them away. Collinearity can cause issues with interpretability of the model. If you have highly correlated features, then you can't say which one is more important based on the weights associated with these features. Collinearity should not affect the prediction power of the model. More often, you will find that by throwing away one of the correlated features, your model's prediction power decreases.
Collinearity in a dataset can therefore make feature importances of your random forrest model less interpretable in a sense that you can't rely on their strict ordering. But again, it should not affect the predictive power of the model (except that the model is more prone to overfitting due to having more degrees of freedom).
Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
Feature engineering/selection is more of an art than science (outside of end-to-end deep learning). There is no correct answer here and you will need to develop your own heuristics and try different things to see which one works better in which scenario.
Example of a simple heuristic based on feature importances and correlation can be (assuming that you have large number of features):
fit the random forrest model and measure the feature importances
throw away those that seem to have no impact on the model (close to 0 importance)
refit the model with the new subset of your original data and see whether the metric of your interest (accuracy, MSE, ...) stays approximately the same as in the step 1.
if you still have a lot of features, you can repeat the step 1-3, increasing the throw-away threshold until your metric of interest starts worsening
measure the correlation of the features that you are left with and select the most correlated pairs (based on some threshold, e.g. (|c| > 0.8))
pick one pair; drop a feature from this pair; measure model performance; return the dropped feature; repeat for each each pair
drop the feature that seems to have the least negative effect on the model's performance based on the results from step 6.
repeat steps 6-7 until the model's performance starts dropping

Multiclass classification using Gaussian Mixture Models with scikit learn

I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.

Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.

GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.

Scikit Logistic Regression summary output?

is there a way to have a similar, nice output for the scikit logistic regression models as in statsmodels? With all the p-values, std. errors etc. in one table?

As you and others have pointed out, this is a limitation of scikit learn. Before discussing below a scikit approach for your question, the “best” option is to use statsmodels as follows:
import statsmodels.api as sm
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()
X represents your input features/predictors matrix and y represents the outcome variable. Statsmodels works well if X lacks highly correlated features, lacks low variance features, feature(s) don’t generate “perfect/quasi-perfect separation”, and any categorical features are reduced to “n-1” levels i.e., dummy-coded (and not “n” levels i.e., one-hot encoded as described here: dummy variable trap).
However, if above isn't feasible/practical, one scikit approach is coded below for fairly equivalent results - in terms of feature coefficients/odds with their standard errors and 95%CI estimates. Essentially, the code generates these results from distinct logistic regression scikit models trained against distinct test-train splits of your data. Again, make sure categorical features are dummy coded to n-1 levels (or your scikit coefficients will be incorrect for categorical features).
#Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty
= "none")
##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)
##Create features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
train_x, test_x, train_y, test_y = train_test_split(X, y, stratify=y,
test_size=0.3, random_state=state) #5 test-train splits
log_nr.fit(train_x, train_y) #fit logistic model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_)
##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)
#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] -
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] +
coeff_table["95ci_se_coeff"]
Finally, (optionally) convert coefficients to odds by exponentiating as follows. Odds ratios are my favorite output from logistic regression and these are appended to your dataframe using code below.
#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])
This answer builds upon on a somewhat related reply by #pciunkiewicz available here : Collate model coefficients across multiple test-train splits from sklearn

How to ensemble SVM and Logistic Regression with python

I am doing a task of text classification(7000 texts evenly distributed by 10 labels). And by exploring SVM and and Logistic Regression
clf1 = svm.LinearSVC()
clf1.fit(X, y)
clf1.predict(X_test)
score1 = clf1.score(X_test,y_true)
clf2 = linear_model.LogisticRegression()
clf2.fit(X, y)
clf2.predict(X_test)
score2 = clf2.score(X_test,y_true)
I got two accuracies, score1 and score2 I guess whether I could improve my accuracy by developing an ensemble system combining the outputs of the two classifiers above.
I have learnt knowledge on ensemble by myself and I know there are bagging,boosting,and stacking.
However, I do not know how to use the scores predicted from my SVM and Logistic Regression in ensemble. Could anyone give me some ideas or show me some example code?

You can just multiply the probabilities, or use another combination rule.
In order to do that in a more generic way (try several rules)
you can use brew.
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = [clf1, clf2]
ens = Ensemble(classifiers=clfs)
# Since you have only 2 classifiers 'majority_vote' is note an option,
# rule = ['mean', 'majority_vote', 'max', 'min', 'median']
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)
Also, keep in mind that the classifiers should be diverse enough to give a good combination result.
If you had fewer features, I'd say you should check out some Dynamic Classifier/Ensemble Selection (also provided in brew) but since you probably have many features, euclidean distance probably do not make sense to get the region of competence of each classifier. Best thing is to check out by hand which kind of labels each classifiers tends to get right based on the confusion matrix.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.