Issue in understanding the Spark MLlib's LinearRegressionWithSGD example in python?

Issue in understanding the Spark MLlib's LinearRegressionWithSGD example in python? - python

So, I am a rookie to machine learning and Spark and was going through Spark MLlibs documentation on Regression especially LinearRegressionWithSGD at this page. I am having a bit of difficulty in understanding the python code. Here iss what I have understood so far - The code loads the data and then forms labeledpoint. After that the model is build and then it is evaluated on the training data and MSE is calculated.
Now that part that is confusing me is that during the normal machine learning process we first divide the data into training set and test set. The we build the model using training set and finally evaluate using test set. Now in the code at the Spark MLlib's documentation I do not see any division into training and test set. And on top of that I see them building the model using the data and then evaluating using the same data.
Is there something that I am not able to understand in the code? Any help to understand the code will be helpful.
NOTE: THis is the code at Spark MLlib's documentation page for LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

The procedure you are talking about is cross-validation. As you observed, the example above didn't do cross-validation. But this doesn't mean it's wrong.
The sole purpose of that example is to illustrate how to train and use a model. You are free to split the data and cross-validate the model, the procedure will be the same. Only the data changed.
And in addition, performance on training set is also valuable. It can tell you whether your model is overfitter or underfitting.
So to summurize, the example is all right, what you need is another example on cross-validation.

Related

How do you find feature names for Decision Tree Classification?

I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.

See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")

Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount.
In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save(). Operationally, this is annoying since I have to retrain my model each time from scratch.
In trying to debug, I scaled my data down to around ~10k rows and had the same issue trying to save. However saving works fine if I reduce the number of class labels.
This leads me to believe that there is a limit to the number of labels. I am not able to reproduce my exact issues, but the code below is related. If I set num_labels to anything greater than 31, model.fit() throws an error.
My questions:
Is there a limit to the number of classes in the mllib implementation of NaiveBayes?
What could be some reasons that I am not able to save my model if I can successfully use it to make predictions?
If there is indeed a limit, would it be possible to split my data into groups of smaller classes, train separate models, and combine?
Full Working Example
Create some dummy data.
I'm going to use nltk.corpus.comparitive_sentences and nltk.corpus.sentence_polarity. Keep in mind that this is just an illustrative example with nonsense data - I'm not concerned with the performance of the fitted model.
import pandas as pd
from pyspark.sql.types import StringType
# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
{
'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
}
)
# assign a 'category' to each row
num_labels = 31 # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)
# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)
Data Preparation Pipeline
from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector
indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens")
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")
clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')
data_prep_pipe = Pipeline(
stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])
Train the model
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)
Evaluate Model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))
On my machine, this prints:
Accuracy of model at predicting label was: 0.0305764788269
Error Message
If I change num_labels to 32 or higher, this is the error I get when I call model.fit():
Py4JJavaError: An error occurred while calling o1336.fit. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 86.0 failed 4 times, most recent failure: Lost task
0.3 in stage 86.0 (TID 1984, someserver.somecompany.net, executor 22): org.apache.spark.SparkException: Kryo serialization failed: Buffer
overflow. Available: 7, required: 8 Serialization trace: values
(org.apache.spark.ml.linalg.DenseVector). To avoid this, increase
spark.kryoserializer.buffer.max value.
...
...
blah blah blah more java stuff that goes on forever
Notes
In this example, if I add a feature for bigrams, the error happens if num_labels > 15. I wonder if it is coincidence that this is also 1 less than a power of 2.
In my real-world project, I also get an error when trying to call model.theta. (I don't think the errors themselves are meaningful - they are just the exceptions passed back from the java/scala methods.)

Hard limitations:
Number of features * Number of classes has to be lower Integer.MAX_VALUE (231 - 1). You are nowhere near these value.
Soft limitations:
Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.
Since you use default settings for HashingTF.numFeatures (220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.
You can also try using standard Java serialization by setting:
spark.serializer org.apache.spark.serializer.JavaSerializer
Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.
Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).
Notes:
Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.

formatting design matrix for regression

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks

If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.

How can we predict using RandomForestClassifier obtained from pyspark.ml

I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.

So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.

I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.

Classifying text with scikit

I'm learning Scikit machine-learning for a project and while I'm beginning to grasp the general process the details are a bit fuzzy still.
Earlier I managed to build a classifier, train it and test it with test set. I saved it to disk with cPickle. Now I want to create a class which loads this classifier and lets user to classify single tweets with it.
I thought this would be trivial but I seem to get ValueError('dimension mismatch') from X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec) line with following code:
class TweetClassifier:
classifier = None
vect = TfidfVectorizer()
tfidf_transformer = TfidfTransformer()
#open the classifier saved to disk to be utilized later
def openClassifier(self, name):
with open(name+'.pkl', 'rb') as fid:
return cPickle.load(fid)
def __init__(self, classifierName):
self.classifier = self.openClassifier(classifierName)
self.classifyTweet(np.array([u"Helvetin vittu miksi aina pitää sataa vettä???"]))
def classifyTweet(self, tweetText):
fitTweetVec = self.vect.fit_transform(tweetText)
print self.vect.get_feature_names()
X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec)
print self.classifier.predict(X_new_tfidf)
What I'm doing wrong here? I used similar code while I made the classifier and ran test set for it. Have I forgotten some important step here?
Now I admit that I don't fully understand yet the fitting and transforming here since I found the Scikit's tutorial a bit ambiguous about it. If someone knows an as clear explanation of them as possible, I'm all for links :)

The problem is that your classifier was trained with a fixed number of features (the length of the vocabulary of your previous data) and now when you fit_transform the new tweet, the TfidfTransformer will produce a new vocabulary and a new number of features, and will represent the new tweet in this space.
The solution is to also save the previously fitted TfidfTransformer (which contains the old vocabulary), load it with the classifier and .transform (not fit_transform because it was already fitted to the old data) the new tweet in this same representation.
You can also use a Pipeline that contains both the TfidfTransformer and the Classifier and pickle the Pipeline, this is easier and recommended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.