I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.
So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.
I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.
Related
I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.
See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")
I'm trying to build a simple preprocessing Pipeline for a clustering model that uses K-Means and export it to PMML format.
I manage to make the Pipeline work but can't manage to finally export it to pmml.
I have divided the pipeline in two steps, handle numerical data and handle categorical data.
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
,df_out=True,default=None)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features]
,df_out=True,default=None)
pipeline = PMMLPipeline(steps=[
('num_mapper',num_mapper),
('cat_mapper',categorical_mapper)
])
Note that i have setted default to None in the first dataFrameMapper since it allows the output dataframe to preserve columns that haven't been selected (columns that indeed will be needed by the second mapper).
These workarounds work ok, the problem comes later when i try to export the pipeline to PMML
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
This line of code yields the following error
java.lang.IllegalArgumentException: Attribute 'sklearn_pandas.dataframe_mapper.DataFrameMapper.default' has a missing (None/null) value
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:46)
at org.jpmml.sklearn.PyClassDict.getObject(PyClassDict.java:97)
I know this error is probably generated by the fact that i'm setting default to None in both DataFrameMappers, but the thing is it was the only workaround i found in order to preserve the columns needed for the second mapper.
Is there any other workaround I could use? I know i could do all the transformations in the first DataFrameMapper but I don't like that idea since I want to separate numerical transformation from categorical transformation.
Recently could kinda understand the use of FeatureUnion, and realized it could be an elegant solution.
Create the same mappers
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features])
preprocessing = FeatureUnion(transformer_list=[('num_mapper',num_mapper),('cat_mapper',categorical_mapper)])
pipeline = PMMLPipeline(steps=[
('preprocessing',preprocessing)
])
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
With this workaround even managed to avoid the use of df_out and default flags in the function call.
I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount.
In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save(). Operationally, this is annoying since I have to retrain my model each time from scratch.
In trying to debug, I scaled my data down to around ~10k rows and had the same issue trying to save. However saving works fine if I reduce the number of class labels.
This leads me to believe that there is a limit to the number of labels. I am not able to reproduce my exact issues, but the code below is related. If I set num_labels to anything greater than 31, model.fit() throws an error.
My questions:
Is there a limit to the number of classes in the mllib implementation of NaiveBayes?
What could be some reasons that I am not able to save my model if I can successfully use it to make predictions?
If there is indeed a limit, would it be possible to split my data into groups of smaller classes, train separate models, and combine?
Full Working Example
Create some dummy data.
I'm going to use nltk.corpus.comparitive_sentences and nltk.corpus.sentence_polarity. Keep in mind that this is just an illustrative example with nonsense data - I'm not concerned with the performance of the fitted model.
import pandas as pd
from pyspark.sql.types import StringType
# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
{
'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
}
)
# assign a 'category' to each row
num_labels = 31 # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)
# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)
Data Preparation Pipeline
from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector
indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens")
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")
clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')
data_prep_pipe = Pipeline(
stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])
Train the model
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)
Evaluate Model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))
On my machine, this prints:
Accuracy of model at predicting label was: 0.0305764788269
Error Message
If I change num_labels to 32 or higher, this is the error I get when I call model.fit():
Py4JJavaError: An error occurred while calling o1336.fit. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 86.0 failed 4 times, most recent failure: Lost task
0.3 in stage 86.0 (TID 1984, someserver.somecompany.net, executor 22): org.apache.spark.SparkException: Kryo serialization failed: Buffer
overflow. Available: 7, required: 8 Serialization trace: values
(org.apache.spark.ml.linalg.DenseVector). To avoid this, increase
spark.kryoserializer.buffer.max value.
...
...
blah blah blah more java stuff that goes on forever
Notes
In this example, if I add a feature for bigrams, the error happens if num_labels > 15. I wonder if it is coincidence that this is also 1 less than a power of 2.
In my real-world project, I also get an error when trying to call model.theta. (I don't think the errors themselves are meaningful - they are just the exceptions passed back from the java/scala methods.)
Hard limitations:
Number of features * Number of classes has to be lower Integer.MAX_VALUE (231 - 1). You are nowhere near these value.
Soft limitations:
Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.
Since you use default settings for HashingTF.numFeatures (220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.
You can also try using standard Java serialization by setting:
spark.serializer org.apache.spark.serializer.JavaSerializer
Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.
Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).
Notes:
Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.
So, I am a rookie to machine learning and Spark and was going through Spark MLlibs documentation on Regression especially LinearRegressionWithSGD at this page. I am having a bit of difficulty in understanding the python code. Here iss what I have understood so far - The code loads the data and then forms labeledpoint. After that the model is build and then it is evaluated on the training data and MSE is calculated.
Now that part that is confusing me is that during the normal machine learning process we first divide the data into training set and test set. The we build the model using training set and finally evaluate using test set. Now in the code at the Spark MLlib's documentation I do not see any division into training and test set. And on top of that I see them building the model using the data and then evaluating using the same data.
Is there something that I am not able to understand in the code? Any help to understand the code will be helpful.
NOTE: THis is the code at Spark MLlib's documentation page for LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")
The procedure you are talking about is cross-validation. As you observed, the example above didn't do cross-validation. But this doesn't mean it's wrong.
The sole purpose of that example is to illustrate how to train and use a model. You are free to split the data and cross-validate the model, the procedure will be the same. Only the data changed.
And in addition, performance on training set is also valuable. It can tell you whether your model is overfitter or underfitting.
So to summurize, the example is all right, what you need is another example on cross-validation.
I'm learning Scikit machine-learning for a project and while I'm beginning to grasp the general process the details are a bit fuzzy still.
Earlier I managed to build a classifier, train it and test it with test set. I saved it to disk with cPickle. Now I want to create a class which loads this classifier and lets user to classify single tweets with it.
I thought this would be trivial but I seem to get ValueError('dimension mismatch') from X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec) line with following code:
class TweetClassifier:
classifier = None
vect = TfidfVectorizer()
tfidf_transformer = TfidfTransformer()
#open the classifier saved to disk to be utilized later
def openClassifier(self, name):
with open(name+'.pkl', 'rb') as fid:
return cPickle.load(fid)
def __init__(self, classifierName):
self.classifier = self.openClassifier(classifierName)
self.classifyTweet(np.array([u"Helvetin vittu miksi aina pitää sataa vettä???"]))
def classifyTweet(self, tweetText):
fitTweetVec = self.vect.fit_transform(tweetText)
print self.vect.get_feature_names()
X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec)
print self.classifier.predict(X_new_tfidf)
What I'm doing wrong here? I used similar code while I made the classifier and ran test set for it. Have I forgotten some important step here?
Now I admit that I don't fully understand yet the fitting and transforming here since I found the Scikit's tutorial a bit ambiguous about it. If someone knows an as clear explanation of them as possible, I'm all for links :)
The problem is that your classifier was trained with a fixed number of features (the length of the vocabulary of your previous data) and now when you fit_transform the new tweet, the TfidfTransformer will produce a new vocabulary and a new number of features, and will represent the new tweet in this space.
The solution is to also save the previously fitted TfidfTransformer (which contains the old vocabulary), load it with the classifier and .transform (not fit_transform because it was already fitted to the old data) the new tweet in this same representation.
You can also use a Pipeline that contains both the TfidfTransformer and the Classifier and pickle the Pipeline, this is easier and recommended.