CountVectorizer not working for test string in sklearn - python

I've been working on Sentiment Analysis using sklearn. I have a csv file of 3000 odd reviews and I am training my classifier on 60% of those reviews.
When I try to give a custom review for the classifier to predict the label using CountVectorizer.transform() it is throwing the following error :
Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 864, in transform
raise ValueError("Vocabulary wasn't fitted or is empty!")
ValueError: Vocabulary wasn't fitted or is empty!
Please Help me, this is the code for fitting the training set :
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true',min_df=1)
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
return tfidf_data
And this is the code for predicting the sentiment of a custom review:
def customQuestionScorer(question, clf):
X_new_tfidf = vectorizer.transform([question]).toarray()
print (clf.predict(X_new_tfidf))
q = "I really like this movie"
customQuestionScorer(q,classifier)

I didn't see classifier here, you are using only transformers (CountVectorizer, TfidfTransformer). To get predictions - you must train classifier on output of TfidfTransformer.
It's not clear whether you are using same CountVectorizer and TfidfTransformer (which were trained on training set before) to transform test-set texts, or some new. To provide correct input for previously fitted classifier - you have to feed it from previously fitted transformers (Not new).
Look here for good example of text processing http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#example-model-selection-grid-search-text-feature-extraction-py

Related

Semi-supervised sentiment analysis in Python?

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

KNN query data dimension must match training data dimension

I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()

How to predict output of my naive bayes classifier applied on nlp(Restaurant Review) for a single external input text?

I have a build my naive Bayes classifier model for nlp using bags of word. Now I want to predict output for a single external input
. How can I do it?please find this github link for correction thanks
https://github.com/Kundan8296/Machine-Learning/blob/master/NLP.ipynb
You need to apply the same preprocessing steps that you applied on your training data, and use it as an input to the classifier. Make sure you don't use fit_transform() on the new data, use transform() only.
#Change this part in your preprocessing, so you can keep the original vectorizer.
vect = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
bag_of_words = vect.fit_transform(corpus)
...
...
# Now when predicting, use this
new_data = ... # your new input
new_x = vect.transform(new_data)
y_pred = classifier.predict(new_x)

Add features to Multinomial Naive Bayes classifier - Python

Using MultinomialNB() from Scikit learn in Python, I want to classify documents not only by word features in the documents but also in sentiment dictionary(meaning just word lists not Python data type).
Suppose these are documents to train
train_data = ['i hate who you welcome for','i adore him with all my heart','i can not forget his warmest welcome for me','please forget all these things! this house smells really weird','his experience helps a lot to complete all these tedious things', 'just ok', 'nothing+special today']
train_labels = ['Nega','Posi','Posi','Nega','Posi','Other','Other']
psentidict = ['welcome','adore','helps','complete','fantastic']
nsentidict = ['hate','weird','tedious','forget','abhor']
osentidict = ['ok','nothing+special']
I can train the lists like these below
from sklearn import naive_bayes
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('clf', naive_bayes.MultinomialNB(alpha = 1.0)),])
text_clf = text_clf.fit(train_data, train_labels)
Even though I trained the data by calculation of all tokens according to corresponding labels, I want to use my sentiment dictionary as additional classifying features.
This is because with the features trained by the dictionaries, it is possible to predict OOV (out of vocabulary). Only with clumsy Laplace Smoothing(alpha = 1.0), overall accuracy would be severely limited.
test_data = 'it is fantastic'
predicted_labels = text_clf.predict(test_data)
With the dictionary feature added, it would be possible to predict a sentence above although every single token is out of training documents.
How to add features of psentidict, nsentidict, and osentidict to Multinomial Naive Bayes classifier? (training them like documents can distort measurement so I think it is better to find another way)
I believe there is no other way to include the features of your Multinomial Naive Bayes Model. This is simply because you want to associated some sort of label to the features ( say 'positive' for the values in psentidict and so on). That can only be achieved by training your model with the said pair of features and labels. What you can do is, improve the model, by creating sentences with the said features, rather than using the words directly, like for example, for the word 'hate', you could instead use ' I hate you with all my heart' and add the sentiment as 'negative', instead of only using the pair 'hate':'negative'. So, you have create more such examples for your dataset.
Hope this link helps.

Supervised machine learning with scikit-learn

This is the first time I'm doing supervised machine learning. This is a pretty advanced topic (at least for me) and I find it hard to specify a question, since I'm not sure what is going wrong.
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
Here I get the error:
float() argument must be a string or a number, not 'CountVectorizer'
This confuses me, since the original ratings that I have zipped up in with the reviews are:
type(ratings_new[0])
int
You should change the line
test_features = vectorizer.fit([r[0] for r in test])
to:
test_features = vectorizer.transform([r[0] for r in test])
The reason is that you already used your training data to fit vectorizer, so you don't need to fit it again on your test data. Instead, you need to transform it.

Categories