I Am new in Data Science. I am trying to find out the feature importance ranking for my dataset. I already applied Random forest and got the output.
Here is my code:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# importing dataset
X = dataset.iloc[:,3:12].values
Y = dataset.iloc[:,13].values
#encoding catagorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1= LabelEncoder()
labelencoder_X_2= LabelEncoder()
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
#spliting dataset into test set and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
In the importance part i almost copied the example shown in :
Here is the code:
#feature importance
from sklearn.ensemble import ExtraTreesClassifier
importances = regressor.feature_importances_
std = np.std([tree.feature_importances_ for tree in regressor.estimators_],
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
I am expecting the output shown in the documentation. Can Anyone Help me please ? Thanks in Advance.
My dataset is here:
You have a lot of features and cannot been seen in a single plot.
Just plot some of them.
Here I plot the first 20 most important:
# Plot the feature importances of the forest
plt.title("Feature importances")
_ = plt.bar(range(n), importances[indices][:n], color="r", yerr=std[indices][:n])
plt.xticks(range(n), indices)
plt.xlim([-1, n])
My code in case you need it: https://filebin.net/be4h27swglqf3ci3
I found this question here which seems to address my problem(Determining the most contributing features for SVM classifier in sklearn).
However as my understanding of Python language is limited I need some help.
I have a dependent variable which is 'Group' that has two levels 'Group1' and 'Group2'.
This is the code I found, adapted to my data:
import pandas as pd
df = pd.read_csv('C:/Users/myPC/OneDrive/Desktop/analysis/dataframe6.csv')
X = df.drop('Group', axis=1)
y = df['Group']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
from matplotlib import pyplot as plt
from sklearn import svm
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
features_names = ['input1', 'input2']
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
f_importances(svclassifier.coef_, features_names)
It produces just a blank plot.
I think there is something I should change in features_names = ['input1', 'input2'] but I am not sure what.
The code you used to plot expects a one-dimensional array. The attribute coef_, according to the documentation will be:
coef_ ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)
Weights assigned to the features when kernel="linear".
Using an example :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
df = pd.DataFrame(np.random.uniform(0,1,(400,3)),columns=['input1','input2','input3'])
df['Group'] = np.random.choice(['Group1','Group2'],400)
X = df.drop('Group', axis=1)
y = df['Group']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
We check the shape of the array:
(1, 3)
Because you have only 2 class, there's only 1 row. We can do:
from matplotlib import pyplot as plt
from sklearn import svm
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
features_names = ['input1', 'input2','input3']
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
f_importances(svclassifier.coef_[0], features_names)
This is the plot I got :
I have to fit 40 time series in a VectorAutoregressive model, the enormous quantity of variables suggest to use a selection method. I would love to use the LASSO method, but I'm using statsmodel for the fitting, and the only way to implement LASSO with that library is for a Linear regression model. Someone can help?
You can try using fit_regularized, it's like when you fit an OLS, and you set L1_wt to be 1 so that it is a lasso:
We can with an example, first load the boston dataset:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import numpy as np
import statsmodels.api as sm
scaler = StandardScaler()
data = load_boston()
data_scaled = scaler.fit_transform(data.data)
X_train, X_test, y_train, y_test = train_test_split(data_scaled, data.target, test_size=0.33, random_state=42)
Below to show that it works similarly, and you need to tweak the shrinkage parameter, alpha anyway in your model:
alphas = [0.0001,0.001, 0.01, 0.1,0.2, 0.5, 1]
mse_sklearn = []
mse_sm = []
for a in alphas:
clf = linear_model.Lasso(alpha=a)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
mse_sklearn.append(mean_squared_error(y_test, y_pred))
mdl = sm.OLS(y_train,sm.add_constant(X_train)).fit_regularized(alpha=a,L1_wt=1)
y_pred = mdl.predict(sm.add_constant(X_test))
mse_sm.append(mean_squared_error(y_test, y_pred))
Visualize the results:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
I'm trying to simply plot a regression line, however I get messy lines. Is it because I fitted the model with 2 features, so the only appropriate visualization would be a 3d plane?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
# prepare data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)[['AGE','RM']]
y = boston.target
# split dataset into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=33)
# apply linear regression on dataset
lm = LinearRegression()
lm.fit(X_train, y_train)
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
#plot relationship between RM and price
plt.plot(X_train['RM'], pred_train, color='r')
plt.title('Relationship between RM and Price')
You are right. You are training on multiple features, i.e AGE, and RM. But you are plotting a 2D plot with only one feature, i.e RM. Try to get a 3D plot. In general, linear regression with two features results in a plane. This is still a linear regression. That is why we use the term "hyperplane". It resolves to a line for a single feature, a plane for two features and so on.
Here is the output in 3D:
plt3d = plt.figure().gca(projection='3d')
plt3d.plot_trisurf(X_train['RM'].values, X_train['AGE'].values, pred_train, alpha=0.7, antialiased=True)
The problem is that when you plot you have to order the arguments.
'plt.plot(np.sort(X_train['RM']), np.sort(pred_train), color='r')'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
# prepare data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)[['AGE','RM']]
y = boston.target
# split dataset into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=33)
# apply linear regression on dataset
lm = LinearRegression()
lm.fit(X_train, y_train)
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
#plot relationship between RM and price
plt.plot(np.sort(X_train['RM']), np.sort(pred_train), color='r')
plt.title('Relationship between RM and Price')
the result:
Probably if you do a 3d plot you will visualize easily the relation between the co-variables RM and age 3d-plot
So, basically, I'm using a RF for descriptive modelling as follows:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y), y)
class_weights = dict(enumerate(class_weights))
{0: 0.5561096747856852, 1: 4.955559597429368}
clf = RandomForestClassifier(class_weight=class_weights, random_state=0)
cross_val_score(clf, X, y, cv=10, scoring='f1').mean()
And plotting variables importance as:
import matplotlib.pyplot as plt
def plot_importances(clf, features, n):
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
if n:
indices = indices[:n]
plt.figure(figsize=(10, 5))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), features[indices], rotation=90)
plt.xlim([-1, len(indices)])
return features[indices]
imp = plot_importances(clf, X.columns, 30)
I was expecting variable importances to be the same across multiple runs. However, their importances changes whenever I re-run the notebook.
I don't understand why is that. Is it related to the cross_val_score method somehow?
I cannot reproduce the problem. For me the variable importances does remain the same for multiple runs when I produce some data using:
X, y = make_classification(n_samples=1000,
X = pd.DataFrame(X)
Also changing the data to have an uneven weighting by selecting the first 750 y/X data points does not lead to differences in importances.
What data do you use?
How can I show the important features that contribute to the SVM model along with the feature name?
My code is shown below,
First I Imported the modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
Then I divided my data into features and variables
y = df_new[['numeric values']]
X = df_new.drop('numeric values', axis=1).values
Then I Setup the pipeline
steps = [('scalar', StandardScaler()),
('SVM', SVC(kernel='linear'))]
pipeline = Pipeline(steps)
Then I Specified my the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}
I Created a train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=21)
Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline,param_grid = parameters,cv=5)
Fit to the training set
Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
feature_importances = cv.best_estimator_.feature_importances_
The error message I get
'Pipeline' object has no attribute 'feature_importances_'
What I understood is that, lets suppose you are building a model with 100 feature and you want to know which feature is more important and which is less if this is the case ?
Just try Uni-variate feature selection method, Its very basic method and you can play with this before going to advance methods for your data. Sample code is provided scikit-learn it self. You can modified it as per your requirement.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif
# import some data to play with
# The iris dataset
iris = datasets.load_iris()
# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target
X_indices = np.arange(X.shape[-1])
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
label=r'Univariate score ($-Log(p_{value})$)', color='g')
# Compare to the weights of an SVM
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()
plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='r')
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector.transform(X), y)
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
width=.2, label='SVM weights after selection', color='b')
plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.legend(loc='upper right')
Code ref.
For each feature, this method will plot p-values for the univariate feature selection and the corresponding weights of an SVM. This method selects those feature which shows larger SVM weights.