How to display feature names in sklearn decision tree? - python

I've currently got a decision tree displaying the features names as X[index], i.e. X[0], X[1], X[2], etc.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# plot tree
plt.figure(figsize=(20,16))# set plot size (denoted in inches)
tree.plot_tree(dt,fontsize=10)
Im looking to replace these X[featureNumber] with the actual feature name.
so instead of it displaying X[0], I would want it to display the feature name returned by X.columns.values[0] (I don't know if this code is correct).
Im also aware there is an easy way of doing this using graphviz, but for some reason I cant get graphviz running in Jupiter, so Im looking for a way of doing it without.
Photo of current decision tree:

This is explained in the documentation:
sklearn.tree.plot_tree(decision_tree, *, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, impurity=True, node_ids=False, proportion=False, rotate='deprecated', rounded=False, precision=3, ax=None, fontsize=None)
feature_names: list of strings, default=None
Names of each of the features. If None, generic names will be used (“X[0]”, “X[1]”, …).

Related

How do you find feature names for Decision Tree Classification?

I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.
See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")

Remove some features from sklearn PolynomialFeatures

I am using sklearn module PolynomialFeatures to fit my model with polynoms over my datas.
To this extent I am doing the following :
P = PolynomialFeatures(3, interaction_only=False, include_bias=False)
model = make_pipeline(P, Ridge(tol=0.001, alpha=1, fit_intercept=False))
model.fit(initial_conditions, times_of_flight)
It works well and now I would like to be able to remove some of these features to refine my model. Say I would like to remove every features that contain one of the two first variables, x_1 and x_2, without the other.
I have tried to modify my PolynomialFeatures attributes (powers_, n_input_features_...) before fitting but Scikit returns me a sklearn.exceptions.NotFittedError error.
How should I proceed ?

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

Is there a way to print one XGBoostRegressor tree in python?

I have constructed a XGBoostRegressor model where I now want to try and plot one of the trees. I know that regular xgb classifier has the function plot_tree but unfortunately XGBoostRegressor does not. Is there any other way to plot the tree? I also tried importing plot_tree from xgboost and use plot_tree(xgb) which returns
ValueError('Unable to parse node: 44['product_family'])
Any ideas if there is any other way in doing this?
I found the error, I had some whitespace in some of my feature names. I added the following line
df.columns = df.columns.str.replace(" ", "_")
And now it worked to use the plot_tree(xgb).

Up-/downsampling with One vs. rest classifier

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?
As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.
Hope this helps.

Categories