GridSearchCV results heatmap - python

I am trying to generate a heatmap for the GridSearchCV results from sklearn. The thing I like about sklearn-evaluation is that it is really easy to generate the heatmap. However, I have hit one issue. When I give a parameter as None, for e.g.
max_depth = [3, 4, 5, 6, None]
while generating, the heatmap, it shows error saying:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Is there any workaround for this?
I have found other ways to generate heatmap like using matplotlib and seaborn, but nothing gives as beautiful heatmaps as sklearn-evalutaion.

I fiddled around with the grid_search.py file /lib/python3.8/site-packages/sklearn_evaluation/plot/grid_search.py. At line 192/193 change the lines
From
row_names = sorted(set([t[0] for t in matrix_elements.keys()]),
key=itemgetter(1))
col_names = sorted(set([t[1] for t in matrix_elements.keys()]),
key=itemgetter(1))
To:
row_names = sorted(set([t[0] for t in matrix_elements.keys()]),
key=lambda x: (x[1] is None, x[1]))
col_names = sorted(set([t[1] for t in matrix_elements.keys()]),
key=lambda x: (x[1] is None, x[1]))
Moving all None to the end of a list while sorting is based on a previous answer
from Andrew Clarke.
Using this tweak, my demo script is shown below:
import numpy as np
import sklearn.datasets as datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn_evaluation import plot
data = datasets.make_classification(n_samples=200, n_features=10, n_informative=4, class_sep=0.5)
X = data[0]
y = data[1]
hyperparameters = {
"max_depth": [1, 2, 3, None],
"criterion": ["gini", "entropy"],
"max_features": ["sqrt", "log2"],
}
est = RandomForestClassifier(n_estimators=5)
clf = GridSearchCV(est, hyperparameters, cv=3)
clf.fit(X, y)
plot.grid_search(clf.cv_results_, change=("max_depth", "criterion"), subset={"max_features": "sqrt"})
import matplotlib.pyplot as plt
plt.show()
The output is as shown below:

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

Polynomial Regression: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). but there are no Infinite or Nan values

I am Using Sklearn and I'm getting an Error...
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
df = pd.read_csv('Noodles Data.csv')
d = {'Pack': 1, 'Bowl': 2,'Cup': 3, 'Tray': 4, 'Can' : 5, 'Box': 6}
df['Style'] = df['Style'].map(d)
X = df.iloc[:, 2].values.reshape(-1,1)
y = df.iloc[:, 3].values.reshape(-1,1)
poly_reg = PolynomialFeatures(degree = 6)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
After This, I'm getting the following Error...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I've tried all possible ways to remove NaN values from my data but still couldn't solve the problem.

from_formula() missing 2 required positional arguments: 'formula' and 'data'

I am getting positional arguments error for the ols function under statsmodels.formula.api
have tried for statsmodels.regression.linear_model and changing OLS to ols and vice-versa.
import statsmodels.regression.linear_model as sm
X = np.append(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.ols(endog = Y, exog = X_opt).fit()
Expected output is the fitting for the regression model. But I am getting an this error:
from_formula() missing 2 required positional arguments: 'formula' and
'data'
To get this example to work (I am assuming you are running the udemy machine learning course, which is line for line this example) I had to change the import statement. The library they are using is not where the OLS function resides any longer.
import statsmodels.regression.linear_model as lm
then
regressor_ols = lm.OLS(endog = y, exog = x_optimal).fit()
This should work :
import statsmodels.api as smf;
X = np.append(arr=np.ones((50,1),dtype=np.int), values = X,axis = 1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_ols = smf.OLS(y,X_opt).fit()
Remove
import statsmodels.regression.linear_model as sm
And just import statsmodels.api as following
import statsmodels.api as sm
The course is quite old and that's why fragments of code is obsolete, no idea why they are not updating it anymore.
Guys this module is part of Linear_model class so use following code to make it work.
import statsmodels.regression.linear_model as lm
X = np.append(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = lm.OLS(endog = y, exog = X_opt).fit()
Use import statsmodels.regression.linear_model as lm or import statsmodels.api as sm
import statsmodels.regression.linear_model as lm
X=np.append(arr=np.ones((50,1)).astype(int), values=X, axis=1)
X_opt=X[:,[0, 1, 2, 3, 4, 5]]
regressor_x=sm.OLS(endog=y, exog=X_opt).fit()
regressor_x.summary()
this one worked for me
import statsmodels.api as sm
X=np.insert(X,0,np.ones(X.shape[0]),axis=1)
colList=list()
for i in range(X.shape[1]):
colList.append(i)
X_opt=np.array(X[:, colList], dtype=float)
regressor_OLS=sm.OLS(endog=y,exog=X_opt).fit()
Solution 1:
import statsmodels.api as sm
x = np.append(arr= np.ones((50, 1)).astype(int), values= x, axis=1)
x_opt = x[:, [0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
Solution 2:
import statsmodels.regression.linear_model as lm
x = np.append(arr= np.ones((50, 1)).astype(int), values= x, axis=1)
x_opt = x[:, [0,1,2,3,4,5]]
regressor_ols = lm.OLS(endog=y, exog=x_opt).fit()
I recently had the same problem, as auticus said, the library with the OLS function is no longer in statsmodels.formula.api. But you also must create X_otp as a list
import statsmodels.regression.linear_model as lm
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5]].tolist()
SL = 0.05
regression_OLS = lm.OLS(endog = y, exog = X_opt). fit()

How to graph grid scores from GridSearchCV?

I am looking for a way to graph grid_scores_ from GridSearchCV in sklearn. In this example I am trying to grid search for best gamma and C parameters for an SVR algorithm. My code looks as follows:
C_range = 10.0 ** np.arange(-4, 4)
gamma_range = 10.0 ** np.arange(-4, 4)
param_grid = dict(gamma=gamma_range.tolist(), C=C_range.tolist())
grid = GridSearchCV(SVR(kernel='rbf', gamma=0.1),param_grid, cv=5)
grid.fit(X_train,y_train)
print(grid.grid_scores_)
After I run the code and print the grid scores I get the following outcome:
[mean: -3.28593, std: 1.69134, params: {'gamma': 0.0001, 'C': 0.0001}, mean: -3.29370, std: 1.69346, params: {'gamma': 0.001, 'C': 0.0001}, mean: -3.28933, std: 1.69104, params: {'gamma': 0.01, 'C': 0.0001}, mean: -3.28925, std: 1.69106, params: {'gamma': 0.1, 'C': 0.0001}, mean: -3.28925, std: 1.69106, params: {'gamma': 1.0, 'C': 0.0001}, mean: -3.28925, std: 1.69106, params: {'gamma': 10.0, 'C': 0.0001},etc]
I would like to visualize all the scores (mean values) depending on gamma and C parameters. The graph I am trying to obtain should look as follows:
Where x-axis is gamma, y-axis is mean score (root mean square error in this case), and different lines represent different C values.
The code shown by #sascha is correct. However, the grid_scores_ attribute will be soon deprecated. It is better to use the cv_results attribute.
It can be implemente in a similar fashion to that of #sascha method:
def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2):
# Get Test Scores Mean and std for each grid search
scores_mean = cv_results['mean_test_score']
scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))
scores_sd = cv_results['std_test_score']
scores_sd = np.array(scores_sd).reshape(len(grid_param_2),len(grid_param_1))
# Plot Grid search scores
_, ax = plt.subplots(1,1)
# Param1 is the X-axis, Param 2 is represented as a different curve (color line)
for idx, val in enumerate(grid_param_2):
ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val))
ax.set_title("Grid Search Scores", fontsize=20, fontweight='bold')
ax.set_xlabel(name_param_1, fontsize=16)
ax.set_ylabel('CV Average Score', fontsize=16)
ax.legend(loc="best", fontsize=15)
ax.grid('on')
# Calling Method
plot_grid_search(pipe_grid.cv_results_, n_estimators, max_features, 'N Estimators', 'Max Features')
The above results in the following plot:
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
digits = datasets.load_digits()
X = digits.data
y = digits.target
clf_ = SVC(kernel='rbf')
Cs = [1, 10, 100, 1000]
Gammas = [1e-3, 1e-4]
clf = GridSearchCV(clf_,
dict(C=Cs,
gamma=Gammas),
cv=2,
pre_dispatch='1*n_jobs',
n_jobs=1)
clf.fit(X, y)
scores = [x[1] for x in clf.grid_scores_]
scores = np.array(scores).reshape(len(Cs), len(Gammas))
for ind, i in enumerate(Cs):
plt.plot(Gammas, scores[ind], label='C: ' + str(i))
plt.legend()
plt.xlabel('Gamma')
plt.ylabel('Mean score')
plt.show()
Code is based on this.
Only puzzling part: will sklearn always respect the order of C & Gamma -> official example uses this "ordering"
Output:
For plotting the results when tuning several hyperparameters, what I did was fixed all parameters to their best value except for one and plotted the mean score for the other parameter for each of its values.
def plot_search_results(grid):
"""
Params:
grid: A trained GridSearchCV object.
"""
## Results from grid search
results = grid.cv_results_
means_test = results['mean_test_score']
stds_test = results['std_test_score']
means_train = results['mean_train_score']
stds_train = results['std_train_score']
## Getting indexes of values per hyper-parameter
masks=[]
masks_names= list(grid.best_params_.keys())
for p_k, p_v in grid.best_params_.items():
masks.append(list(results['param_'+p_k].data==p_v))
params=grid.param_grid
## Ploting results
fig, ax = plt.subplots(1,len(params),sharex='none', sharey='all',figsize=(20,5))
fig.suptitle('Score per parameter')
fig.text(0.04, 0.5, 'MEAN SCORE', va='center', rotation='vertical')
pram_preformace_in_best = {}
for i, p in enumerate(masks_names):
m = np.stack(masks[:i] + masks[i+1:])
pram_preformace_in_best
best_parms_mask = m.all(axis=0)
best_index = np.where(best_parms_mask)[0]
x = np.array(params[p])
y_1 = np.array(means_test[best_index])
e_1 = np.array(stds_test[best_index])
y_2 = np.array(means_train[best_index])
e_2 = np.array(stds_train[best_index])
ax[i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
ax[i].errorbar(x, y_2, e_2, linestyle='-', marker='^',label='train' )
ax[i].set_xlabel(p.upper())
plt.legend()
plt.show()
I wanted to do something similar (but scalable to a large number of parameters) and here is my solution to generate swarm plots of the output:
score = pd.DataFrame(gs_clf.grid_scores_).sort_values(by='mean_validation_score', ascending = False)
for i in parameters.keys():
print(i, len(parameters[i]), parameters[i])
score[i] = score.parameters.apply(lambda x: x[i])
l =['mean_validation_score'] + list(parameters.keys())
for i in list(parameters.keys()):
sns.swarmplot(data = score[l], x = i, y = 'mean_validation_score')
#plt.savefig('170705_sgd_optimisation//'+i+'.jpg', dpi = 100)
plt.show()
The order that the parameter grid is traversed is deterministic, such that it can be reshaped and plotted straightforwardly. Something like this:
scores = [entry.mean_validation_score for entry in grid.grid_scores_]
# the shape is according to the alphabetical order of the parameters in the grid
scores = np.array(scores).reshape(len(C_range), len(gamma_range))
for c_scores in scores:
plt.plot(gamma_range, c_scores, '-')
here's a solution that makes use of seaborn pointplot. the advantage of this method is that it will allow you to plot results when searching across more than 2 parameters
import seaborn as sns
import pandas as pd
def plot_cv_results(cv_results, param_x, param_z, metric='mean_test_score'):
"""
cv_results - cv_results_ attribute of a GridSearchCV instance (or similar)
param_x - name of grid search parameter to plot on x axis
param_z - name of grid search parameter to plot by line color
"""
cv_results = pd.DataFrame(cv_results)
col_x = 'param_' + param_x
col_z = 'param_' + param_z
fig, ax = plt.subplots(1, 1, figsize=(11, 8))
sns.pointplot(x=col_x, y=metric, hue=col_z, data=cv_results, ci=99, n_boot=64, ax=ax)
ax.set_title("CV Grid Search Results")
ax.set_xlabel(param_x)
ax.set_ylabel(metric)
ax.legend(title=param_z)
return fig
Example usage with xgboost:
from xgboost import XGBRegressor
from sklearn import GridSearchCV
params = {
'max_depth': [3, 6, 9, 12],
'gamma': [0, 1, 10, 20, 100],
'min_child_weight': [1, 4, 16, 64, 256],
}
model = XGBRegressor()
grid = GridSearchCV(model, params, scoring='neg_mean_squared_error')
grid.fit(...)
fig = plot_cv_results(grid.cv_results_, 'gamma', 'min_child_weight')
This will produce a figure that shows the gamma regularization parameter on the x-axis, the min_child_weight regularization parameter in the line color, and any other grid search parameters (in this case max_depth) will be described by the spread of the 99% confidence interval of the seaborn pointplot.
*Note in the example below I have changed the aesthetics slightly from the code above.
I used grid search on xgboost with different learning rates, max depths and number of estimators.
gs_param_grid = {'max_depth': [3,4,5],
'n_estimators' : [x for x in range(3000,5000,250)],
'learning_rate':[0.01,0.03,0.1]
}
gbm = XGBRegressor()
grid_gbm = GridSearchCV(estimator=gbm,
param_grid=gs_param_grid,
scoring='neg_mean_squared_error',
cv=4,
verbose=1
)
grid_gbm.fit(X_train,y_train)
To create the graph for error vs number of estimators with different learning rates, I used the following approach:
y=[]
cvres = grid_gbm.cv_results_
best_md=grid_gbm.best_params_['max_depth']
la=gs_param_grid['learning_rate']
n_estimators=gs_param_grid['n_estimators']
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
if params["max_depth"]==best_md:
y.append(np.sqrt(-mean_score))
y=np.array(y).reshape(len(la),len(n_estimators))
%matplotlib inline
plt.figure(figsize=(8,8))
for y_arr, label in zip(y, la):
plt.plot(n_estimators, y_arr, label=label)
plt.title('Error for different learning rates(keeping max_depth=%d(best_param))'%best_md)
plt.legend()
plt.xlabel('n_estimators')
plt.ylabel('Error')
plt.show()
The plot can be viewed here:
Note that the graph can similarly be created for error vs number of estimators with different max depth (or any other parameters as per the user's case).
Here's fully working code that will produce plots so you can fully visualize the varying of up to 3 parameters using GridSearchCV. This is what you will see when running the code:
Parameter1 (x-axis)
Cross Validaton Mean Score (y-axis)
Parameter2 (extra line plotted for each different Parameter2 value, with a legend for reference)
Parameter3 (extra charts will pop up for each different Parameter3 value, allowing you to view differences between these different charts)
For each line plotted, also shown is a standard deviation of what you can expect the Cross Validation Mean Score to do based on the multiple CV's you're running. Enjoy!
from sklearn import tree
from sklearn import model_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
Algo = [['DecisionTreeClassifier', tree.DecisionTreeClassifier(), # algorithm
'max_depth', [1, 2, 4, 6, 8, 10, 12, 14, 18, 20, 22, 24, 26, 28, 30], # Parameter1
'max_features', ['sqrt', 'log2', None], # Parameter2
'criterion', ['gini', 'entropy']]] # Parameter3
def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2, title):
# Get Test Scores Mean and std for each grid search
grid_param_1 = list(str(e) for e in grid_param_1)
grid_param_2 = list(str(e) for e in grid_param_2)
scores_mean = cv_results['mean_test_score']
scores_std = cv_results['std_test_score']
params_set = cv_results['params']
scores_organized = {}
std_organized = {}
std_upper = {}
std_lower = {}
for p2 in grid_param_2:
scores_organized[p2] = []
std_organized[p2] = []
std_upper[p2] = []
std_lower[p2] = []
for p1 in grid_param_1:
for i in range(len(params_set)):
if str(params_set[i][name_param_1]) == str(p1) and str(params_set[i][name_param_2]) == str(p2):
mean = scores_mean[i]
std = scores_std[i]
scores_organized[p2].append(mean)
std_organized[p2].append(std)
std_upper[p2].append(mean + std)
std_lower[p2].append(mean - std)
_, ax = plt.subplots(1, 1)
# Param1 is the X-axis, Param 2 is represented as a different curve (color line)
# plot means
for key in scores_organized.keys():
ax.plot(grid_param_1, scores_organized[key], '-o', label= name_param_2 + ': ' + str(key))
ax.fill_between(grid_param_1, std_lower[key], std_upper[key], alpha=0.1)
ax.set_title(title)
ax.set_xlabel(name_param_1)
ax.set_ylabel('CV Average Score')
ax.legend(loc="best")
ax.grid('on')
plt.show()
dataset = 'Titanic'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
cv_split = model_selection.KFold(n_splits=10, random_state=2)
for i in range(len(Algo)):
name = Algo[0][0]
alg = Algo[0][1]
param_1_name = Algo[0][2]
param_1_range = Algo[0][3]
param_2_name = Algo[0][4]
param_2_range = Algo[0][5]
param_3_name = Algo[0][6]
param_3_range = Algo[0][7]
for p in param_3_range:
# grid search
param = {
param_1_name: param_1_range,
param_2_name: param_2_range,
param_3_name: [p]
}
grid_test = GridSearchCV(alg, param_grid=param, scoring='accuracy', cv=cv_split)
grid_test.fit(X_train, y_train)
plot_grid_search(grid_test.cv_results_, param[param_1_name], param[param_2_name], param_1_name, param_2_name, dataset + ' GridSearch Scores: ' + name + ', ' + param_3_name + '=' + str(p))
param = {
param_1_name: param_1_range,
param_2_name: param_2_range,
param_3_name: param_3_range
}
grid_final = GridSearchCV(alg, param_grid=param, scoring='accuracy', cv=cv_split)
grid_final.fit(X_train, y_train)
best_params = grid_final.best_params_
alg.set_params(**best_params)
#nathandrake Try the following which is adapted based off the code from #david-alvarez :
def plot_grid_search(cv_results, metric, grid_param_1, grid_param_2, name_param_1, name_param_2):
# Get Test Scores Mean and std for each grid search
scores_mean = cv_results[('mean_test_' + metric)]
scores_sd = cv_results[('std_test_' + metric)]
if grid_param_2 is not None:
scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))
scores_sd = np.array(scores_sd).reshape(len(grid_param_2),len(grid_param_1))
# Set plot style
plt.style.use('seaborn')
# Plot Grid search scores
_, ax = plt.subplots(1,1)
if grid_param_2 is not None:
# Param1 is the X-axis, Param 2 is represented as a different curve (color line)
for idx, val in enumerate(grid_param_2):
ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val))
else:
# If only one Param1 is given
ax.plot(grid_param_1, scores_mean, '-o')
ax.set_title("Grid Search", fontsize=20, fontweight='normal')
ax.set_xlabel(name_param_1, fontsize=16)
ax.set_ylabel('CV Average ' + str.capitalize(metric), fontsize=16)
ax.legend(loc="best", fontsize=15)
ax.grid('on')
As you can see, I added the ability to support grid searches that include multiple metrics. You simply specify the metric you want to plot in the call to the plotting function.
Also, if your grid search only tuned a single parameter you can simply specify None for grid_param_2 and name_param_2.
Call it as follows:
plot_grid_search(grid_search.cv_results_,
'Accuracy',
list(np.linspace(0.001, 10, 50)),
['linear', 'rbf'],
'C',
'kernel')
This worked for me when I was trying to plot mean scores vs no. of trees in the Random Forest. The reshape() function helps to find out the averages.
param_n_estimators = cv_results['param_n_estimators']
param_n_estimators = np.array(param_n_estimators)
mean_n_estimators = np.mean(param_n_estimators.reshape(-1,5), axis=0)
mean_test_scores = cv_results['mean_test_score']
mean_test_scores = np.array(mean_test_scores)
mean_test_scores = np.mean(mean_test_scores.reshape(-1,5), axis=0)
mean_train_scores = cv_results['mean_train_score']
mean_train_scores = np.array(mean_train_scores)
mean_train_scores = np.mean(mean_train_scores.reshape(-1,5), axis=0)

How do I get the components for LDA in scikit-learn?

When using PCA in sklearn, it's easy to get out the components:
from sklearn import decomposition
pca = decomposition.PCA(n_components=n_components)
pca_data = pca.fit(input_data)
pca_components = pca.components_
But I can't for the life of me figure out how to get the components out of LDA, as there is no components_ attribute. Is there a similar attribute in sklearn lda?
In the case of PCA, the documentation is clear. The pca.components_ are the eigenvectors.
In the case of LDA, we need the lda.scalings_ attribute.
Visual example using iris data and sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)
x_new = lda.transform(X)
Verify that the lda.scalings_ are the eigenvectors:
print(lda.scalings_)
print(lda.transform(np.identity(4)))
[[-0.67614337 0.0271192 ]
[-0.66890811 0.93115101]
[ 3.84228173 -1.63586613]
[ 2.17067434 2.13428251]]
[[-0.67614337 0.0271192 ]
[-0.66890811 0.93115101]
[ 3.84228173 -1.63586613]
[ 2.17067434 2.13428251]]
Additionally here is a useful function to plot the biplot and verify visually:
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("LD{}".format(1))
plt.ylabel("LD{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], lda.scalings_)
plt.show()
Results
My reading of the code is that the coef_ attribute is used to weight each of the components when scoring a sample's features against the different classes. scaling is the eigenvector and xbar_ is the mean. In the spirit of UTSL, here's the source for the decision function:
https://github.com/scikit-learn/scikit-learn/blob/6f32544c51b43d122dfbed8feff5cd2887bcac80/sklearn/discriminant_analysis.py#L166
In PCA, the transform operation uses self.components_.T (see the code):
X_transformed = np.dot(X, self.components_.T)
In LDA, the transform operation uses self.scalings_ (see the code):
X_new = np.dot(X, self.scalings_)
Note the .T which transposes the array in the PCA, and not in LDA:
PCA: components_ : array, shape (n_components, n_features)
LDA: scalings_ : array, shape (n_features, n_classes - 1)

Categories