Is there a way to print one XGBoostRegressor tree in python? - python

I have constructed a XGBoostRegressor model where I now want to try and plot one of the trees. I know that regular xgb classifier has the function plot_tree but unfortunately XGBoostRegressor does not. Is there any other way to plot the tree? I also tried importing plot_tree from xgboost and use plot_tree(xgb) which returns
ValueError('Unable to parse node: 44['product_family'])
Any ideas if there is any other way in doing this?

I found the error, I had some whitespace in some of my feature names. I added the following line
df.columns = df.columns.str.replace(" ", "_")
And now it worked to use the plot_tree(xgb).

Related

How to display feature names in sklearn decision tree?

I've currently got a decision tree displaying the features names as X[index], i.e. X[0], X[1], X[2], etc.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# plot tree
plt.figure(figsize=(20,16))# set plot size (denoted in inches)
tree.plot_tree(dt,fontsize=10)
Im looking to replace these X[featureNumber] with the actual feature name.
so instead of it displaying X[0], I would want it to display the feature name returned by X.columns.values[0] (I don't know if this code is correct).
Im also aware there is an easy way of doing this using graphviz, but for some reason I cant get graphviz running in Jupiter, so Im looking for a way of doing it without.
Photo of current decision tree:
This is explained in the documentation:
sklearn.tree.plot_tree(decision_tree, *, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, impurity=True, node_ids=False, proportion=False, rotate='deprecated', rounded=False, precision=3, ax=None, fontsize=None)
feature_names: list of strings, default=None
Names of each of the features. If None, generic names will be used (“X[0]”, “X[1]”, …).

How do you find feature names for Decision Tree Classification?

I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.
See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")

cannot use workaround for python PMML Pipeline

I'm trying to build a simple preprocessing Pipeline for a clustering model that uses K-Means and export it to PMML format.
I manage to make the Pipeline work but can't manage to finally export it to pmml.
I have divided the pipeline in two steps, handle numerical data and handle categorical data.
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
,df_out=True,default=None)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features]
,df_out=True,default=None)
pipeline = PMMLPipeline(steps=[
('num_mapper',num_mapper),
('cat_mapper',categorical_mapper)
])
Note that i have setted default to None in the first dataFrameMapper since it allows the output dataframe to preserve columns that haven't been selected (columns that indeed will be needed by the second mapper).
These workarounds work ok, the problem comes later when i try to export the pipeline to PMML
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
This line of code yields the following error
java.lang.IllegalArgumentException: Attribute 'sklearn_pandas.dataframe_mapper.DataFrameMapper.default' has a missing (None/null) value
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:46)
at org.jpmml.sklearn.PyClassDict.getObject(PyClassDict.java:97)
I know this error is probably generated by the fact that i'm setting default to None in both DataFrameMappers, but the thing is it was the only workaround i found in order to preserve the columns needed for the second mapper.
Is there any other workaround I could use? I know i could do all the transformations in the first DataFrameMapper but I don't like that idea since I want to separate numerical transformation from categorical transformation.
Recently could kinda understand the use of FeatureUnion, and realized it could be an elegant solution.
Create the same mappers
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features])
preprocessing = FeatureUnion(transformer_list=[('num_mapper',num_mapper),('cat_mapper',categorical_mapper)])
pipeline = PMMLPipeline(steps=[
('preprocessing',preprocessing)
])
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
With this workaround even managed to avoid the use of df_out and default flags in the function call.

Why doesn't Statsmodels OLS support reading in columns with multiple words?

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.
I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:
import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv
sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>
However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):
import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()
File "<unknown>", line 1
Count of Specific Strands
^
SyntaxError: invalid syntax
Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:
test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()
Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?
This is due to the way the formula parser patsy is written: see this link for more information
The authors of patsy have, however, thought of this problem: (quoted from here)
This flexibility does create problems in one case, though – because we
interpret whatever you write in-between the + signs as Python code,
you do in fact have to write valid Python code. And this can be tricky
if your variable names have funny characters in them, like whitespace
or punctuation. Fortunately, patsy has a builtin “transformation”
called Q() that lets you “quote” such variables
Therefore, in your case, you should be able to write:
smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()

python nltk naive bayes probabilities

Is there a way to get at the individual probabilities using nltk.NaiveBayesClassifier.classify? I want to see the probabilities of classification to try and make a confidence scale. Obviously with a binary classifier the decision is going to be one or the other, but is there some way to see the inner workings of how the decision was made? Or, do I just have to write my own classifier?
Thanks
How about nltk.NaiveBayesClassifier.prob_classify?
http://nltk.org/api/nltk.classify.html#nltk.classify.naivebayes.NaiveBayesClassifier.prob_classify
classify calls this function:
def classify(self, featureset):
return self.prob_classify(featureset).max()
Edit: something like this should work (not tested):
dist = classifier.prob_classify(features)
for label in dist.samples():
print("%s: %f" % (label, dist.prob(label)))
I know this is utterly old. But as I struggled some time to found this out i sharing this code.
It show the probability associatte with each feature in Naive Bayes Classifer. It helps me understand better how show_most_informative_features worked. Possible it is the best option to everyone (and much possible that's why they created this funcion). Anyway, for those like me that MUST SEE the individual probabily for each label and word, you can use this code:
for label in classifier.labels():
print(f'\n\n{label}:')
for (fname, fval) in classifier.most_informative_features(50):
print(f" {fname}({fval}): ", end="")
print("{0:.2f}%".format(100*classifier._feature_probdist[label, fname].prob(fval)))

Categories