I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150 below. The real-valued f60150 in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07 doesn't make sense. Each of these features are either 1 or 0. -9.53674e-07 is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing vs. no corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Running the above with booleanize and to_int set to True yields the following chart:
Running the above with booleanize and to_int set to False yields the following chart:
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X or y are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Related
I use a sklearn LinearRegression()estimator, with 5 variables
['feat1', 'feat2', 'feat3', 'feat4', 'feat5']
In order to predict a continuous value.
Estimator returns the list of coefficient values and the bias:
linear = LinearRegression()
print(linear.coef_)
print(linear.intercept_)
[ 0.18799409 -0.05406106 -0.01327966 -0.13348129 -0.00614054]
-0.011064865422734674
Then, given the fact I have each feature as variables, I can hardcode the coefficients into a linear formula and estimate my values, like so:
val = ((0.18799409*feat1) - (0.05406106*feat2) - (0.01327966*feat3) - (0.13348129*feat4) - (0.00614054*feat5)) -0.011064865422734674
Now lets say I use a polynomial regression of degree 2, using a pipeline, and by printing:
model = Pipeline(steps=[
('scaler',StandardScaler()),
('polynomial_features', PolynomialFeatures(degree=degree, include_bias=False)),
('linear_regression', LinearRegression())])
#fit model
model.fit(X_train, y_train)
print(model['linear_regression'].coef_)
print(model['linear_regression'].intercept_)
I get:
[ 7.06524186e-01 -2.98605001e-02 -4.67175212e-02 -4.86890790e-01
-1.06320101e-02 -2.77958604e-03 -3.38253025e-04 -7.80563090e-03
4.51356888e-03 8.32036733e-03 3.57638244e-02 -2.16446849e-02
-7.92169287e-02 3.36809467e-02 -6.60531497e-03 2.16613331e-02
2.10097993e-02 3.49970303e-02 -3.02970698e-02 -7.81462599e-03]
0.011042927069084668
How do I transform the formula above in order to calculate val from regression, with values from .coef_ and .intercept_, using array indexing instead of hardcoding the values, for any 'n' degree ?
Is there any scipy or numpy method suited for that?
It's important to note that polynomial regression is just an extended case of linear regression, thus all we need to do is transform our input data consistently. For any N we can use the PolynomialFeatures from sklearn.preprocessing. From using dummy data, we can see how this would work:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#set parameters
X = np.stack([np.arange(i,i+10) for i in range(5)]).T
Y = np.random.randn(10)*10+3
N = 2
poly_reg=PolynomialFeatures(degree=N,include_bias=False)
X_poly=poly_reg.fit_transform(X)
#print(X[0],X_poly[0]) #to check parameters, note that it includes the y intercept as an input of 1
poly = LinearRegression().fit(X_poly, Y)
And thus, we can get the coef_ the way you were doing before, and simply perform a matrix multiplication to get the regressed value.
new_dat = poly_reg.transform(np.arange(2,2+10,2)[None]) #5 new datapoints
np.testing.assert_array_equal(poly.predict(new_dat),new_dat # poly.coef_ + poly.intercept_)
----EDIT----
In case you cannot use the transform for PolynomialFeatures, it's just a iterated combination loop to generate the data from your list of features.
new_feats = np.array([feat1,feat2,feat3,feat4,feat5])
from itertools import combinations_with_replacement
def gen_poly_feats(x,N):
#this function returns all unique groupings (w/ replacement) of the indices into the array x for use in polynomial regression.
return np.concatenate([[np.product(x[np.array(i)]) for i in list(combinations_with_replacement(range(len(x)), n))] for n in range(1,N+1)])[None]
new_feats_poly = gen_poly_feats(new_feats,N)
# just to be sure that this matches...
np.testing.assert_array_equal(new_feats_poly,poly_reg.transform(new_feats[None]))
#then we can use the above linear regression model to predict the new data
val = new_feats_poly # poly.coef_ + poly.intercept_
Based on the guide Implementing PCA in Python, by Sebastian Raschka I am building the PCA algorithm from scratch for my research purpose. The class definition is:
import numpy as np
class PCA(object):
"""Dimension Reduction using Principal Component Analysis (PCA)
It is the procces of computing principal components which explains the
maximum variation of the dataset using fewer components.
:type n_components: int, optional
:param n_components: Number of components to consider, if not set then
`n_components = min(n_samples, n_features)`, where
`n_samples` is the number of samples, and
`n_features` is the number of features (i.e.,
dimension of the dataset).
Attributes
==========
:type covariance_: np.ndarray
:param covariance_: Coviarance Matrix
:type eig_vals_: np.ndarray
:param eig_vals_: Calculated Eigen Values
:type eig_vecs_: np.ndarray
:param eig_vecs_: Calculated Eigen Vectors
:type explained_variance_: np.ndarray
:param explained_variance_: Explained Variance of Each Principal Components
:type cum_explained_variance_: np.ndarray
:param cum_explained_variance_: Cumulative Explained Variables
"""
def __init__(self, n_components : int = None):
"""Default Constructor for Initialization"""
self.n_components = n_components
def fit_transform(self, X : np.ndarray):
"""Fit the PCA algorithm into the Dataset"""
if not self.n_components:
self.n_components = min(X.shape)
self.covariance_ = np.cov(X.T)
# calculate eigens
self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)
# explained variance
_tot_eig_vals = sum(self.eig_vals_)
self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
self.cum_explained_variance_ = np.cumsum(self.explained_variance_)
# define `W` as `d x k`-dimension
self.W_ = self.eig_vecs_[:, :self.n_components]
print(X.shape, self.W_.shape)
return X.dot(self.W_)
Consider the iris-dataset as a test case, PCA is achieved and visualized as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)
# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()
The output is as:
Now, I wanted to verify the output, for which I used sklearn library, and the output is as follows:
from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components
principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()
I don't understand why the output is oriented differently, with a minor different value. I studied numerous codes [1, 2, 3], all of which have the same issue. My questions:
What is different in sklearn, that the plot is different? I've tried with a different dataset too - the same problem.
Is there a way to fix this issue?
I was not able to study the sklearn.decompose.PCA algorithm, as I am new to OOPs concept with python.
Output in the blog post by Sebastian Raschka also has a minor variation in output. Figure below:
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
The difference in values comes from PCA from sklearn using svd decomposition. In sklearn there's a function svd_flip used to flip the PCs, which explains why you see this flip
More details on the help page:
It uses the LAPACK implementation of the full SVD or a randomized
truncated SVD by the method of Halko et al. 2009, depending on the
shape of the input data and the number of components to extract.
You can read about the relation here
We first run your example dataset:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy
iris = load_iris()
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
n_components = 4
sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))
We now perform SVD on your centered matrix:
U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)
svdPCs = pd.DataFrame(U*S)
The results:
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
svdPCs
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
You can implement without the flip. The values will be the same and your PCA will be valid as noted in the other answer.
Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.
I need to find features with max correlation with 2 principal component.
This is training task and the result is wrong (all 4 components have more correlation with 1 component)
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
target = iris.target
target_names = iris.target_names
means, = np.mean(data, axis=0),
X = (data - means)
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X)
proect_data = model.transform(X)
proect_data_abs = np.absolute(proect_data)
means, = np.mean(proect_data_abs, axis=0),
Y = (proect_data_abs - means)
corr_array = np.corrcoef(X.T, Y.T)
You do no provide any justification for why you take the absolute value of your transformed data, and it is very unclear why you do it.
If that part is removed, which makes subtracting the mean again unnecessary, you get expected results, and it's easy to read off what features have the highest correlation with the principal components:
Y = proect_data
corr_array = np.corrcoef(X.T, Y.T)
corr_array[4:,:4]
array([[ 0.89754488, -0.38999338, 0.99785405, 0.96648418],
[ 0.39023141, 0.82831259, -0.04903006, -0.04818017]])
Once I normalized my data with an sklearn l2 normalizer and use it as training data:
How do I turn the predicted output back to the "raw" shape?
In my example I used normalized housing prices as y and normalized living space as x. Each used to fit their own X_ and Y_Normalizer.
The y_predict is in therefore also in the normalized shape, how do I turn in into the original raw currency state?
Thank you.
If you are talking about sklearn.preprocessing.Normalizer, which normalizes matrix lines, unfortunately there is no way to go back to original norms unless you store them by hand somewhere.
If you are using sklearn.preprocessing.StandardScaler, which normalizes columns, then you can obtain the values you need to go back in the attributes of that scaler (mean_ if with_mean is set to True and std_)
If you use the normalizer in a pipeline, you wouldn't need to worry about this, because you wouldn't modify your data in place:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
# classifier example
from sklearn.svm import SVC
pipeline = make_pipeline(Normalizer(), SVC())
Thank you very much for your answer, I didn't know about the pipeline feature before
For the case of L2 normalization turns out you can do it manually.
Here is one example for a small array:
x = np.array([5, 8 , 12, 15])
#Using Sklearn
normalizer_x = preprocessing.Normalizer(norm = "l2").fit(x)
x_norm = normalizer_x.transform(x)[0]
print x_norm
>array([ 0.23363466, 0.37381545, 0.56072318, 0.70090397])
Or do it manually with the weight of the squareroot of the squaresum:
#Manually
w = np.sqrt(sum(x**2))
x_norm2 = x/w
print x_norm2
>array([ 0.23363466, 0.37381545, 0.56072318, 0.70090397])
So turning them "back" to the raw formate is simple by multiplying with "w".