Classifying text with scikit

Classifying text with scikit - python

I'm learning Scikit machine-learning for a project and while I'm beginning to grasp the general process the details are a bit fuzzy still.
Earlier I managed to build a classifier, train it and test it with test set. I saved it to disk with cPickle. Now I want to create a class which loads this classifier and lets user to classify single tweets with it.
I thought this would be trivial but I seem to get ValueError('dimension mismatch') from X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec) line with following code:
class TweetClassifier:
classifier = None
vect = TfidfVectorizer()
tfidf_transformer = TfidfTransformer()
#open the classifier saved to disk to be utilized later
def openClassifier(self, name):
with open(name+'.pkl', 'rb') as fid:
return cPickle.load(fid)
def __init__(self, classifierName):
self.classifier = self.openClassifier(classifierName)
self.classifyTweet(np.array([u"Helvetin vittu miksi aina pitää sataa vettä???"]))
def classifyTweet(self, tweetText):
fitTweetVec = self.vect.fit_transform(tweetText)
print self.vect.get_feature_names()
X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec)
print self.classifier.predict(X_new_tfidf)
What I'm doing wrong here? I used similar code while I made the classifier and ran test set for it. Have I forgotten some important step here?
Now I admit that I don't fully understand yet the fitting and transforming here since I found the Scikit's tutorial a bit ambiguous about it. If someone knows an as clear explanation of them as possible, I'm all for links :)

The problem is that your classifier was trained with a fixed number of features (the length of the vocabulary of your previous data) and now when you fit_transform the new tweet, the TfidfTransformer will produce a new vocabulary and a new number of features, and will represent the new tweet in this space.
The solution is to also save the previously fitted TfidfTransformer (which contains the old vocabulary), load it with the classifier and .transform (not fit_transform because it was already fitted to the old data) the new tweet in this same representation.
You can also use a Pipeline that contains both the TfidfTransformer and the Classifier and pickle the Pipeline, this is easier and recommended.

Related

How do you find feature names for Decision Tree Classification?

I am trying to find the feature information for my decision trees. More specifically, I want to be able to tell what feature 183 is if it appears in my tree visualization. I have tried dtModel.getInputCol() but receive the following error.
AttributeError: 'DecisionTreeClassificationModel' object has no attribute 'getInputCol'
This is my current code:
from pyspark.ml.classification import DecisionTreeClassifier
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)
# Train model with Training Data
dtModel = dt.fit(trainingData)
display(dtModel)
If you can help or need more information, please let me know. Thank you.

See this example taken from Spark doc (I try to have the name consistent with your code, especially featuresCol="features").
I assume you have some code like this (before the code you posted in the question):
featureIndexer = VectorIndexer(inputCol="inputFeatures", outputCol="features", maxCategories=4).fit(data)
After this step, you have the "features" as indexed features, and then you feed to the DecisionTreeClassifier (like your posted code):
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
What you're looking for is inputFeatures above, which is the original features before being indexed. If you want to print it, simply do something like:
sc.parallelize(inputFeatures, 1).saveAsTextFile("absolute_path")

How to predict with Word2Vec?

I'm doing arabic dialect text classification and I've used Word2Vec to train the model, I got this so far:
def read_input(input_file):
with open (input_file, 'rb') as f:
for i, line in enumerate (f):
yield gensim.utils.simple_preprocess (line)
documents = list (read_input (data_file))
logging.info ("Done reading data file")
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)
What do I do now to predict a new text if it's of any of the 5 dialects I have?
Also, I looked around and found this code:
# load the pre-trained word-embedding vectors
embeddings_index = {}
for i, line in enumerate(open('w2vmodel.vec',encoding='utf-8')):
values = line.split()
embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')
# create a tokenizer
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index
# convert text to sequence of tokens and pad them to ensure equal length vectors
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)
# create token-embedding mapping
embedding_matrix = numpy.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
But it gives me this error when I run it and load my trained word2vec model:
ValueError: could not convert string to float: '\x00\x00\x00callbacksq\x04)X\x04\x00\x00\x00loadq\x05cgensim.utils'
Note:
Actually, there's another code that I didn't post here, I wanted to use word2vec with neural networks, I have the code for neural network, but I don't know how to make the features I got from word2vec to be as an input to the neural net and with labels as output. Is it possible to connect word2vec to a deep neural net and how?

Word2vec alone isn't something that would classify texts into dialects, so you haven't sketched out a plausible full approach here.
What made you think word2vec could or should be used as part of this task? (If there's a motivating theory-of-operation, or some other published precedent for this approach that gave you the idea, that could help guide what else should be done.)
What is your training data like?
If it's a large number of example texts, with accurate labels as to which dialect each text is from, have you tried classifiers working on simple bag-of-words or bag-of-character-n-grams representations of the texts? (By discovering relationships between the dialects, and the exact words, groups-of-words, or word-fragments in the texts, such a classifier might work far better than word2vec. Word2vec ignores word fragments, and drives the vectors of similar-meaning words close together, obscuring small differences in word-spelling or word-choice.)
You might also try:
FastText in classification mode, in which the word-vectors (and optionally, word-fragment-vectors) are trained to specifically be good at classifying amongst a set of known labels (rather than just to be good at predicting nearby words, as in classic word2vec
the technique of using multiple word2vec-models (one per dialect) as classifiers, as demonstrated in a notebook included with gensim: Deep Inverse Regression with Yelp Reviews
(Separately regarding your shown code:
you don't need to call train(documents,...) if you already supplied documents to the class-instantiation call – that will have already done training, as enabling INFO logging and watching the logs should make clear
you shouldn't need to use such code that tries to open/read your w2vmodel.vec file directly, because gensim includes methods for reading such files directly, such as .load_word2vec_format() or (if a full model was natively .save()d from gensim), just .load().
)

Linear regression load model doesn't predict as expected

I have trained a linear regression model, with sklearn, for a 5 star rating and it's good enough. I have used Doc2vec to create my vectors, and saved that model. Then I save the linear regression model to another file. What I'm trying to do is load the Doc2vec model and linear regression model and try to predict another review.
There is something very strange about this prediction: whatever the input it always predicts around 2.1-3.0.
Thing is, I have a suggestion that it predicts around the average of 5 (which is 2.5 +/-) but this is not the case. I have printed when training the model the prediction value and the actual value of the test data and they range normally 1-5. So my idea is, that there is something wrong with the loading part of the code. This is my load code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from bs4 import BeautifulSoup
from joblib import dump, load
import pickle
import re
model = Doc2Vec.load('../vectors/750000/doc2vec_model')
def cleanText(text):
text = BeautifulSoup(text, "lxml").text
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = text.replace('x', '')
return text
review = cleanText("Horrible movie! I don't recommend it to anyone!").split()
vector = model.infer_vector(review)
pkl_filename = "../vectors/750000/linear_regression_model.joblib"
with open(pkl_filename, 'rb') as file:
linreg = pickle.load(file)
review_vector = vector.reshape(1,-1)
predict_star = linreg.predict(review_vector)
print(predict_star)

Your example code shows imports of both joblib.dump and joblib.load – even though neither is used in this excerpt. And, the suffix of your file is suggestive that the model may have originally been saved with joblib.dump(), not vanilla pickle.
But, this code shows the file being loaded only via plain pickle.load() – which may be the source of the error.
The joblib.load() docs suggest that its load() may do things like load numpy arrays from multiple separate files created by its own dump(). (Oddly, the dump() docs are less clear on this, but supposedly dump() has a return-value that may be a list of filenames.)
You can check where the file was saved for extra files that appear to be related, and try using joblib.load() rather than plain-pickle, to see if that loads a more-functional/more-complete version of your linreg object.

(Update: I overlooked the .split() tokenization being done in the question code after .cleanText(), so this isn't the real problem. But keeping answer up for reference & because the real issue was discovered in the comments.)
Very commonly, users get mysteriously-weak results from Doc2Vec when they provide a plain string to infer_vector(). Doc2Vec infer_vector() requires a list-of-words, not a string.
If providing a string, the function will see it as a list-of-one-character words – per Python's modeling of strings as lists-of-characters, and type-conflation of characters and one-character-strings. Most of these one-character words probably aren't known by the model, and those that might be – 'i', 'a', etc – aren't very meaningful. So the inferred doc-vector will be weak & meaningless. (And, it isn't surprising such a vector, fed to your linear regression, always gives a middling predicted value.)
If you break the text into the expected list-of-words, your results should improve.
But more generally, the words provided to infer_vector() should be preprocessed and tokenized exactly however the training documents were.
(A fair sanity test of whether you're doing inference properly is to infer vectors for some of your training documents, then ask the Doc2Vec model for the doc-tags closest to these re-inferred vectors. In general, the same document's training-time tag/ID should be the top result, or at least one of the top few. If it isn't, there may be other problems in the data, model parameters, or inference.)

How can we predict using RandomForestClassifier obtained from pyspark.ml

I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.

So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.

I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.

Issue in understanding the Spark MLlib's LinearRegressionWithSGD example in python?

So, I am a rookie to machine learning and Spark and was going through Spark MLlibs documentation on Regression especially LinearRegressionWithSGD at this page. I am having a bit of difficulty in understanding the python code. Here iss what I have understood so far - The code loads the data and then forms labeledpoint. After that the model is build and then it is evaluated on the training data and MSE is calculated.
Now that part that is confusing me is that during the normal machine learning process we first divide the data into training set and test set. The we build the model using training set and finally evaluate using test set. Now in the code at the Spark MLlib's documentation I do not see any division into training and test set. And on top of that I see them building the model using the data and then evaluating using the same data.
Is there something that I am not able to understand in the code? Any help to understand the code will be helpful.
NOTE: THis is the code at Spark MLlib's documentation page for LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

The procedure you are talking about is cross-validation. As you observed, the example above didn't do cross-validation. But this doesn't mean it's wrong.
The sole purpose of that example is to illustrate how to train and use a model. You are free to split the data and cross-validate the model, the procedure will be the same. Only the data changed.
And in addition, performance on training set is also valuable. It can tell you whether your model is overfitter or underfitting.
So to summurize, the example is all right, what you need is another example on cross-validation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.