Vehicle gear prediction using clustering algorithm (machine learning) [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to predict which gear vehicle is driven.
I have Engine_Speed and vehicle_Speed column in the data set:
I have tried the k-means clustering algorithm, but it didn't succeed.
Which algorithm do I have to use? And how do I implement it using Python?

Looking at the vehicle speed in relation to the engine speed, the different slopes should give the different gears.
My initial reaction would be to say that this is a linear regression problem. You don't have enough data for anything else. Looking at the data, though, we can see that it is actually two linear regression problems:
[![Engine speed vs. vehicle speed][2]][2]
There is an inflection point at about 700 revs, so you should design a cutoff that selects one of two regression lines, depending on whether you are above or below the cutoff.
To determine the regression in Python, you can use any number of packages. In scikit-learn it looks like this:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
The example given there, using the Python console, is
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
Obviously you need to put your own data in X and y and in fact you would want two arrays for the two sections of your graph. You would also have two reg = LinearRegression().fit(X, y) expressions, and an if statement deciding which reg to use, depending on the input. The inflection point is at the intersection of your two regression lines.
The two regression lines have the form y = m1 x + c1 and y = m2 x + c2, where m1, m2 are the gradients of the lines and c1, c2 the intercepts. At the point of intersection m1x + c1 = m2x + c2. If you don't want to do the maths, then you can use Shapely:
import shapely
from shapely.geometry import LineString, Point
line1 = LineString([A, B])
line2 = LineString([C, D])
int_pt = line1.intersection(line2)
point_of_intersection = int_pt.x, int_pt.y
print(point_of_intersection)
(taken from this answer on Stack Overflow: How do I compute the intersection point of two lines?)
After discussion with Sanjiv, here is the updated code (adapted from here: https://machinelearningmastery.com/clustering-algorithms-with-python/)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans
matplotlib.use('TkAgg')
df = pd.read_excel("GearPredictionSanjiv.xlsx", sheet_name='FullData')
x = []
y = []
x = round(df['Engine_speed'])
y = df['Vehicle_speed']
if 'Ratio' not in df.columns or not os.path.exists('dataset.xlsx'):
df['Ratio'] = round(x/y)
model = KMeans(n_clusters=5)
# Fit the model
model.fit(X)
# Assign a cluster to each example
yhat = model.predict(X)
# Plot
plt.scatter(yhat, X['Ratio'], c=yhat, cmap=plt.cm.coolwarm)
# Show the plot
plt.show()

The question is somewhat confusing.
I assume you want to infer the vehicle speed using the engine_speed. Then there is only one feature in this dataset (i.e., engine speed) and the class label is vehicle speed. Actually, a simple IF THEN ELSE can solve the statement but for the sake of answering your question using a machine learning approach (e.g., Decision Tree), I will share how to solve this as a classification problem using scikit-learn in Python.
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score
### np.reshape(array, (-1, 1)) is to convert the array to 2D array
engine_speed = np.reshape([1124, 974, 405, 865, 754, 200], (-1, 1))
vehicle_speed = np.reshape([5, 4, 3, 4, 4, 2], (-1, 1))
test_engine_speed = np.reshape([1000, 900, 800, 700, 600, 500, 400], (-1, 1))
test_vehicle_speed = np.reshape([5, 4, 4, 4, 4, 3, 3], (-1, 1))
clf = tree.DecisionTreeClassifier()
clf = clf.fit(engine_speed, vehicle_speed)
y_pred = clf.predict(test_engine_speed)
print(accuracy_score(test_vehicle_speed, y_pred))
print(test_vehicle_speed.ravel()) # ravel() is to convert 2D array to 1D array
print(y_pred.ravel()) # ravel() is to convert 2D array to 1D array
I hope this would be helpful.

Related

Implementation of Principal Component Analysis from Scratch Orients the Data Differently than scikit-learn

Based on the guide Implementing PCA in Python, by Sebastian Raschka I am building the PCA algorithm from scratch for my research purpose. The class definition is:
import numpy as np
class PCA(object):
"""Dimension Reduction using Principal Component Analysis (PCA)
It is the procces of computing principal components which explains the
maximum variation of the dataset using fewer components.
:type n_components: int, optional
:param n_components: Number of components to consider, if not set then
`n_components = min(n_samples, n_features)`, where
`n_samples` is the number of samples, and
`n_features` is the number of features (i.e.,
dimension of the dataset).
Attributes
==========
:type covariance_: np.ndarray
:param covariance_: Coviarance Matrix
:type eig_vals_: np.ndarray
:param eig_vals_: Calculated Eigen Values
:type eig_vecs_: np.ndarray
:param eig_vecs_: Calculated Eigen Vectors
:type explained_variance_: np.ndarray
:param explained_variance_: Explained Variance of Each Principal Components
:type cum_explained_variance_: np.ndarray
:param cum_explained_variance_: Cumulative Explained Variables
"""
def __init__(self, n_components : int = None):
"""Default Constructor for Initialization"""
self.n_components = n_components
def fit_transform(self, X : np.ndarray):
"""Fit the PCA algorithm into the Dataset"""
if not self.n_components:
self.n_components = min(X.shape)
self.covariance_ = np.cov(X.T)
# calculate eigens
self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)
# explained variance
_tot_eig_vals = sum(self.eig_vals_)
self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
self.cum_explained_variance_ = np.cumsum(self.explained_variance_)
# define `W` as `d x k`-dimension
self.W_ = self.eig_vecs_[:, :self.n_components]
print(X.shape, self.W_.shape)
return X.dot(self.W_)
Consider the iris-dataset as a test case, PCA is achieved and visualized as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)
# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()
The output is as:
Now, I wanted to verify the output, for which I used sklearn library, and the output is as follows:
from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components
principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()
I don't understand why the output is oriented differently, with a minor different value. I studied numerous codes [1, 2, 3], all of which have the same issue. My questions:
What is different in sklearn, that the plot is different? I've tried with a different dataset too - the same problem.
Is there a way to fix this issue?
I was not able to study the sklearn.decompose.PCA algorithm, as I am new to OOPs concept with python.
Output in the blog post by Sebastian Raschka also has a minor variation in output. Figure below:
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
The difference in values comes from PCA from sklearn using svd decomposition. In sklearn there's a function svd_flip used to flip the PCs, which explains why you see this flip
More details on the help page:
It uses the LAPACK implementation of the full SVD or a randomized
truncated SVD by the method of Halko et al. 2009, depending on the
shape of the input data and the number of components to extract.
You can read about the relation here
We first run your example dataset:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy
iris = load_iris()
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
n_components = 4
sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))
We now perform SVD on your centered matrix:
U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)
svdPCs = pd.DataFrame(U*S)
The results:
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
svdPCs
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
You can implement without the flip. The values will be the same and your PCA will be valid as noted in the other answer.

fisher's linear discriminant in Python

I have the fisher's linear discriminant that i need to use it to reduce my examples A and B that are high dimensional matrices to simply 2D, that is exactly like LDA, each example has classes A and B, therefore if i was to have a third example they also have classes A and B, fourth, fifth and n examples would always have classes A and B, therefore i would like to separate them in a simple use of fisher's linear discriminant. Im pretty much new to machine learning, so i dont know how to separate my classes, i've been following the formula by eye and coding on the go. From what i was reading, i need to apply a linear transformation to my data so i can find a good threshold for it, but first i'd need to find the maximization function. For such task, i managed to find Sw and Sb, but i don't know how to go from there...
Where i also need to find the maximization function.
That maximization function gives me an eigen value solution:
What i have for each classes are matrices 5x2 of 2 examples. For instance:
Example 1
Class_A = [
201, 103,
40, 43,
23, 50,
12, 123,
99, 78
]
Class_B = [
201, 129,
114, 195,
180, 90,
69, 62,
76, 90
]
Example 2
Class_A = [
68, 98,
201, 203,
78, 212,
49, 5,
204, 78
]
Class_B = [
52, 19,
220, 219,
159, 195,
99, 23,
46, 50
]
I tried finding Sw for the example above like this:
Example_1_Class_A = np.dot(Example_1_Class_A, np.transpose(Example_1_Class_A))
Example_1_Class_B = np.dot(Example_1_Class_B, np.transpose(Example_1_Class_B))
Example_2_Class_A = np.dot(Example_2_Class_A, np.transpose(Example_2_Class_A))
Example_2_Class_B = np.dot(Example_2_Class_B, np.transpose(Example_2_Class_B))
Sw = sum([Example_1_Class_A, Example_1_Class_B, Example_2_Class_A, Example_2_Class_B], axis=0)
As for Sb, i tried like this:
Example_1_Class_A_mean = Example_1_Class_A.mean(axis=0)
Example_1_Class_B_mean = Example_1_Class_B.mean(axis=0)
Example_2_Class_A_mean = Example_2_Class_A.mean(axis=0)
Example_2_Class_B_mean = Example_2_Class_B.mean(axis=0)
Example_1_Class_A_Sb = np.dot(Example_1_Class_A_mean, np.transpose(Example_1_Class_A_mean))
Example_1_Class_B_Sb = np.dot(Example_1_Class_B_mean, np.transpose(Example_1_Class_B_mean))
Example_2_Class_A_Sb = np.dot(Example_2_Class_A_mean, np.transpose(Example_2_Class_A_mean))
Example_2_Class_B_Sb = np.dot(Example_2_Class_B_mean, np.transpose(Example_2_Class_B_mean))
Sb = sum([Example_1_Class_A_Sb, Example_1_Class_B_Sb, Example_2_Class_A_Sb, Example_2_Class_B_Sb], axis=0)
The problem is, i have no idea what else to do with my Sw and Sb, i am completely lost. Basically, what i need to do is get from here to this:
How for given Example A and Example B, do i separate a cluster only for classes As and only for classes b
Before answering your question, I will first touch the basic difference between PCA and (F)LDA. In PCA you don't know anything about underlying classes, but you assume that the information about classes separability lies in the variance of data. So you rotate your original axes (sometimes it is called projecting all the data onto new ones) in such way that your first new axis is pointing to the direction of most variance, second one is perpendicular to the first one and pointing to the direction of most residiual variance, and so on. This way a PCA transformation results in a (sub)space of the same dimensionality as the original one. Than you can take only first 2 dimensions, rejecting the rest, hence getting a dimensionality reduction from k dimensions to only 2.
LDA works a bit differently. In this case you know in advance how many classes there are in your data, and you can find their mean and covariance matrices. What Fisher criterion does it finds a direction in which the mean between classes is maximized, while at the same time total variability is minimized (total variability is a mean of within-class covariance matrices). And for each two classes there is only one such line. This is why when your data has C classes, LDA can provide you at most C-1 dimensions, regardless of the original data dimensionality. In your case this means that as you have only 2 classes A and B, you will get a one-dimensional projection, i.e. a line. And this is exactly what you have in your picture: original 2d data is projected on to a line. The direction of the line is the solution of the eigenproblem.
Let's generate data that is similar to your picture:
a = np.random.multivariate_normal((1.5, 3), [[0.5, 0], [0, .05]], 30)
b = np.random.multivariate_normal((4, 1.5), [[0.5, 0], [0, .05]], 30)
plt.plot(a[:,0], a[:,1], 'b.', b[:,0], b[:,1], 'r.')
mu_a, mu_b = a.mean(axis=0).reshape(-1,1), b.mean(axis=0).reshape(-1,1)
Sw = np.cov(a.T) + np.cov(b.T)
inv_S = np.linalg.inv(Sw)
res = inv_S.dot(mu_a-mu_b) # the trick
####
# more general solution
#
# Sb = (mu_a-mu_b)*((mu_a-mu_b).T)
# eig_vals, eig_vecs = np.linalg.eig(inv_S.dot(Sb))
# res = sorted(zip(eig_vals, eig_vecs), reverse=True)[0][1] # take only eigenvec corresponding to largest (and the only one) eigenvalue
# res = res / np.linalg.norm(res)
plt.plot([-res[0], res[0]], [-res[1], res[1]]) # this is the solution
plt.plot(mu_a[0], mu_a[1], 'cx')
plt.plot(mu_b[0], mu_b[1], 'yx')
plt.gca().axis('square')
# let's project data point on it
r = res.reshape(2,)
n2 = np.linalg.norm(r)**2
for pt in a:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'b.:', alpha=0.2)
for pt in b:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'r.:', alpha=0.2)
The resulting projection is calculated using a neat trick for two class problem. You can read details on it here in section 1.6.
Regarding the "examples" you mention in your question. I believe you need to repeat the process for each example, as it is a different set of data point probably with different distributions. Also, put attention that estimated mean (mu_a, mu_b) and class covariance matrices would be slightly different from the ones that data was generated with, especially for small sample size.
Mathematics
See https://sebastianraschka.com/Articles/2014_python_lda.html#lda-in-5-steps for more information.
Implementation using Iris
Since you want to use LDA for dimensionality reduction but provide only 2d data I am showing how to perform this procedure on the iris dataset.
Let's import libraries
import pandas as pd
import numpy as np
import sklearn as sk
from collections import Counter
from sklearn import datasets
# load dataset and transform to pandas df
X, y = datasets.load_iris(return_X_y=True)
X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(4)])
y = pd.DataFrame(y, columns=['labels'])
tot = pd.concat([X,y], axis=1)
# calculate class means
class_means = tot.groupby('labels').mean()
total_mean = X.mean()
The class_means are given by:
class_means
feat_0 feat_1 feat_2 feat_3
labels
0 5.006 3.428 1.462 0.246
1 5.936 2.770 4.260 1.326
2 6.588 2.974 5.552 2.026
To do this, we first subtract the class means from each observation (basically we calculate x - m_i from the equation above).
Subtract the corresponding class mean from each observation. Since we want to calculate
x_mi = tot.transform(lambda x: x - class_means.loc[x['labels']], axis=1).drop('labels', 1)
def kronecker_and_sum(df, weights):
S = np.zeros((df.shape[1], df.shape[1]))
for idx, row in df.iterrows():
x_m = row.as_matrix().reshape(df.shape[1],1)
S += weights[idx]*np.dot(x_m, x_m.T)
return S
# Each x_mi is weighted with 1. Now we use the kronecker_and_sum function to calculate the within-class scatter matrix S_w
S_w = kronecker_and_sum(x_mi, 150*[1])
mi_m = class_means.transform(lambda x: x - total_mean, axis=1)
# Each mi_m is weighted with the number of observations per class which is 50 for each class in this example. We use kronecker_and_sum to calculate the between-class scatter matrix.
S_b=kronecker_and_sum(mi_m, 3*[50])
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_w).dot(S_b))
We only need to consider the eigenvalues which are remarkably different from zero (in this case only the first two)
eig_vals
array([ 3.21919292e+01, 2.85391043e-01, 6.53468167e-15, -2.24877550e-15])
Transform X with the matrix of the two eigenvectors which correspond to the highest eigenvalues
W = eig_vecs[:, :2]
X_trafo = np.dot(X, W)
tot_trafo = pd.concat([pd.DataFrame(X_trafo, index=range(len(X_trafo))), y], 1)
# plot the result
tot_trafo.plot.scatter(x=0, y=1, c='labels', colormap='viridis')
We have reduced the dimensions from 4 to 2 and chosen the space in such a way, that the classes can be well seperated.
Scikit-learn usage
Scikit has LDA support aswell. What we did in dozens of lines can be done with the following lines of code:
from sklearn import discriminant_analysis
lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2)
X_trafo_sk = lda.fit_transform(X,y)
pd.DataFrame(np.hstack((X_trafo_sk, y))).plot.scatter(x=0, y=1, c=2, colormap='viridis')
I'm not giving a plot here, cause it is the same as in our derived example (except for a 180 degree rotation).

How could I use a dynamic espilon in a DBSCAN?

Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].plot(sorted(k_dist))
ax[0].set_xlabel('object index after sorting by k-distance')
ax[0].set_ylabel('k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.

xgboost.plot_tree: binary feature interpretation

I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150 below. The real-valued f60150 in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07 doesn't make sense. Each of these features are either 1 or 0. -9.53674e-07 is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing vs. no corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Running the above with booleanize and to_int set to True yields the following chart:
Running the above with booleanize and to_int set to False yields the following chart:
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X or y are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()

LinearDiscriminantAnalysis - Single column output from .transform(X)

I have been successfully playing around with replicating one of the sklearn tutorials using the iris dataset in PyCharm using Python 2.7. However, when trying to repeat this with my own data I have been encountering an issue. I have been importing data from a .csv file using 'np.genfromtxt', but for some reason I keep getting a single column output for X_r2 (see below), when I should get a 2 column output. I have therefore replaced my data with some randomly generated variables to post onto SO, and I am still getting the same issue.
I have included the 'problem' code below, and I would be interested to know what I have done wrong. I have extensively used the debugging features in PyCharm to check that the type and shape of my variables are similar to the original sklearn example, but it did not help me with the problem. Any help or suggestions would be appreciated.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(2, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
The array y in the example you posted has values of 0, 1 and 2 while yours only has values of 0 and 1. This change achieves what you want:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(3, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

Categories