How do you find feature names for Decision Tree Classification? - python

I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.

See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")

Related

ARIMA forecasts constrained to an interval in Python language

I'm trying to copy the second exercise ("Forecasts constrained to an interval") in the link below:
https://otexts.com/fpp2/limits.html
What the link does is an ARIMA with forecasts constrained to an interval using a certain logarithmic transformation and then back-transformation at the end. But the example in the link uses R language, and I can't find a similar example for Python no matter how much I search.
Can anyone tell me how I can do the exact same thing described in the link with Python? I'm certain it is possible using the statsmodels library, but I'm not sure how to exactly replicate the transformation constraints.
The standard ARIMA in Python:
from statsmodels.tsa.arima_model import ARIMA
import numpy as np
model = ARIMA(series, order=(0,1,1))
model_fit = model.fit(trend='nc',full_output=True, disp=1)
print(model_fit.summary())
I have a feeling that I need to add something like this somewhere (transformation formula):
series = np.log((series-a)/(b-series))
as well as the back-transformation formula. But since they don't produce explicit errors I can't be sure whether I'm coding it right.
Also, I'm stuck at where I should be adding the transformation and back-transformation. I would appreciate it if someone could explain how the exercise in the link could be replicated in Python.
P.S. By 'transformation' here, it has nothing to do with making the time series stationary. I didn't mention the stationary part because it's unrelated to my current question. The link above uses the word 'transformation' to use the logarithmic formula to make the time series constrained to lie between 'a' and 'b'.
What I tried so far:
series = np.log((series-a)/(b-series))
model = ARIMA(series, order=(0,1,1))
model_fit = model.fit(trend='c',full_output=True, disp=1)
print(model_fit.summary())
fore = model_fit.forecast(steps=1)
fore = (b-a)*np.exp(fore)/(1+np.exp(fore)) + a
it's so clear from the link that you referred to in the question that the transformation is going to take place just before forecasting. so:
you do the transformation on your data
forecast using ARIMA model on transformed data
reverse the transformation on predicted data!
a = 50
b = 400
# Transformation on the data
train = np.log((series-a)/(b-series))
# Choose suitable order
model = ARIMA(train,order=(2,2,2))
results = model.fit()
start=len(train)
# One step ahead forecasting. You should set value of the end to what you prefer
predictions = results.predict(start = start , end = 1 , dynamic=False , typ='levels')
# reverse transformation
predictions = ((b-a)*np.exp(predictions)/(1+np.exp(predictions))) + a
Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values).
Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the
default typ='linear' we would have seen linear predictions in terms of the differenced
endogenous variables.

H2O DistributedRandomForest all tree predictions

I use Python's H2O (version 3.22.1.3), and I was wondering if it is possible to observe each tree's predictions in the Random Forest, like we do in the case of scikit-learn's RandomForestRegressor.estimators_ method. I tried to use h2o.predict_leaf_node_assignment(), but it brings either the prediction path for each tree or (supposedly) the id of the leaf node based on which the prediction was made. In the last version, H2O added the Tree class, but unfortunately, it does not have any predict() method. Although I can access any node in any of the random forest's trees, still my implementation of the tree predict function using the tree's recently implemented API (even if any correct), is extremely slow. So, my question is:
(a) Can I obtain tree predictions natively, and if yes, then how?
(b) If no, do the H2O developers plan to implement this feature in future releases?
Any response would be greatly appreciated.
UPDATE: Thank you, Joe, for your response. As for now (before the feature is directly implemented), here is the only workaround I could think of which generates tree predictions.
# Suppose we have random forest model called drf with ntrees=70 and want to make predictions on df_valid
# After executing the code below, we get a dataframe tree_predictions with ntrees (in our case 70) columns, where i-th column corresponds to the predictions of i-th tree, and the same number of rows as df_valid.
# Extract the trees to create prediction intervals
# Number of trees
ntrees = 70
from h2o.tree import H2OTree
# Extract all the tree of drf, create the list of prediction trees
list_of_trees = [H2OTree(model = drf, tree_number = t, tree_class = None) for t in range(ntrees)]
# leaf_nodes contains the node_id's of tree leaves with predictions
leaf_nodes = drf.predict_leaf_node_assignment(df_valid, type='Node_ID').as_data_frame()
# tree_predictions is the dataframe with predictions for all the 70 trees
tree_predictions = pd.DataFrame(columns=['T'+str(t+1) for t in range(ntrees)])
for t in range(ntrees):
tr = list_of_trees[t]
node_ids = np.array(tr.node_ids)
treePred = lambda n: tr.predictions[np.where(node_ids==n)[0][0]]
tree_predictions['T'+str(t+1)] = leaf_nodes['T'+str(t+1)].apply(treePred)enter code here
Right now the answer is no. We've created an issue for implementing a new feature in the Tree API. You can track the progress here: https://0xdata.atlassian.net/browse/PUBDEV-6322.

How can we predict using RandomForestClassifier obtained from pyspark.ml

I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.
So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.
I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.

Classifying text with scikit

I'm learning Scikit machine-learning for a project and while I'm beginning to grasp the general process the details are a bit fuzzy still.
Earlier I managed to build a classifier, train it and test it with test set. I saved it to disk with cPickle. Now I want to create a class which loads this classifier and lets user to classify single tweets with it.
I thought this would be trivial but I seem to get ValueError('dimension mismatch') from X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec) line with following code:
class TweetClassifier:
classifier = None
vect = TfidfVectorizer()
tfidf_transformer = TfidfTransformer()
#open the classifier saved to disk to be utilized later
def openClassifier(self, name):
with open(name+'.pkl', 'rb') as fid:
return cPickle.load(fid)
def __init__(self, classifierName):
self.classifier = self.openClassifier(classifierName)
self.classifyTweet(np.array([u"Helvetin vittu miksi aina pitää sataa vettä???"]))
def classifyTweet(self, tweetText):
fitTweetVec = self.vect.fit_transform(tweetText)
print self.vect.get_feature_names()
X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec)
print self.classifier.predict(X_new_tfidf)
What I'm doing wrong here? I used similar code while I made the classifier and ran test set for it. Have I forgotten some important step here?
Now I admit that I don't fully understand yet the fitting and transforming here since I found the Scikit's tutorial a bit ambiguous about it. If someone knows an as clear explanation of them as possible, I'm all for links :)
The problem is that your classifier was trained with a fixed number of features (the length of the vocabulary of your previous data) and now when you fit_transform the new tweet, the TfidfTransformer will produce a new vocabulary and a new number of features, and will represent the new tweet in this space.
The solution is to also save the previously fitted TfidfTransformer (which contains the old vocabulary), load it with the classifier and .transform (not fit_transform because it was already fitted to the old data) the new tweet in this same representation.
You can also use a Pipeline that contains both the TfidfTransformer and the Classifier and pickle the Pipeline, this is easier and recommended.

ml-py svm converges but classifying wrongly

I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)

Categories