Add features to Multinomial Naive Bayes classifier - Python - python

Using MultinomialNB() from Scikit learn in Python, I want to classify documents not only by word features in the documents but also in sentiment dictionary(meaning just word lists not Python data type).
Suppose these are documents to train
train_data = ['i hate who you welcome for','i adore him with all my heart','i can not forget his warmest welcome for me','please forget all these things! this house smells really weird','his experience helps a lot to complete all these tedious things', 'just ok', 'nothing+special today']
train_labels = ['Nega','Posi','Posi','Nega','Posi','Other','Other']
psentidict = ['welcome','adore','helps','complete','fantastic']
nsentidict = ['hate','weird','tedious','forget','abhor']
osentidict = ['ok','nothing+special']
I can train the lists like these below
from sklearn import naive_bayes
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('clf', naive_bayes.MultinomialNB(alpha = 1.0)),])
text_clf = text_clf.fit(train_data, train_labels)
Even though I trained the data by calculation of all tokens according to corresponding labels, I want to use my sentiment dictionary as additional classifying features.
This is because with the features trained by the dictionaries, it is possible to predict OOV (out of vocabulary). Only with clumsy Laplace Smoothing(alpha = 1.0), overall accuracy would be severely limited.
test_data = 'it is fantastic'
predicted_labels = text_clf.predict(test_data)
With the dictionary feature added, it would be possible to predict a sentence above although every single token is out of training documents.
How to add features of psentidict, nsentidict, and osentidict to Multinomial Naive Bayes classifier? (training them like documents can distort measurement so I think it is better to find another way)

I believe there is no other way to include the features of your Multinomial Naive Bayes Model. This is simply because you want to associated some sort of label to the features ( say 'positive' for the values in psentidict and so on). That can only be achieved by training your model with the said pair of features and labels. What you can do is, improve the model, by creating sentences with the said features, rather than using the words directly, like for example, for the word 'hate', you could instead use ' I hate you with all my heart' and add the sentiment as 'negative', instead of only using the pair 'hate':'negative'. So, you have create more such examples for your dataset.
Hope this link helps.

Related

Semi-supervised sentiment analysis in Python?

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

CountVectorizer not working for test string in sklearn

I've been working on Sentiment Analysis using sklearn. I have a csv file of 3000 odd reviews and I am training my classifier on 60% of those reviews.
When I try to give a custom review for the classifier to predict the label using CountVectorizer.transform() it is throwing the following error :
Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 864, in transform
raise ValueError("Vocabulary wasn't fitted or is empty!")
ValueError: Vocabulary wasn't fitted or is empty!
Please Help me, this is the code for fitting the training set :
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true',min_df=1)
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
return tfidf_data
And this is the code for predicting the sentiment of a custom review:
def customQuestionScorer(question, clf):
X_new_tfidf = vectorizer.transform([question]).toarray()
print (clf.predict(X_new_tfidf))
q = "I really like this movie"
customQuestionScorer(q,classifier)
I didn't see classifier here, you are using only transformers (CountVectorizer, TfidfTransformer). To get predictions - you must train classifier on output of TfidfTransformer.
It's not clear whether you are using same CountVectorizer and TfidfTransformer (which were trained on training set before) to transform test-set texts, or some new. To provide correct input for previously fitted classifier - you have to feed it from previously fitted transformers (Not new).
Look here for good example of text processing http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#example-model-selection-grid-search-text-feature-extraction-py

How can I print accuracy of the model in a different way?

I am carrying out supervised machine learning. At present, by using scikit's metrics, it prints out the accuracy of the entire corpus.
I also wish to print out the accuracy of top 3 topics and then top 5 topics. How can I do so?
model = LogisticRegression()
model = model.fit(matrix, label)
y_train_pred = model1.predict(matrix_test)
print(metrics.accuracy_score(label_test, y_train_pred))
You could use a confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Example: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
This way you get specific information applied to each category prediction.

stacking 3 variables for kmeans scikit

I have 3 variable that i want to fit into a kmeans model. One is the TFIDF vector, One is the Count vector and the third one is the number of words in a document (sentence_list_len).
Here is my code:
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
count_vectorized=count_vectorizer.fit_transform(sentence_list)
sentence_list_len # for each document, how many words are there
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
How do i fit the 3 variables into km.fit? specifically how do i stack all three of them and feed it to km.fit()?
Simply concatenate your vectors. See numpy.concatenate or numpy.vstack / numpy.hstack. However, be aware that kmeans does not work well with high dimensional data and that it will probably ignore "small" features. You have three types of features in different scales, this will heavily affect clustering results. In general kmeans is not a good approach to the NLP clustering tasks.
The official way is to use a FeatureUnion:
from sklearn.pipeline import FeatureUnion
tfidf =TfidfVectorizer()
cvect = CountVectorizer()
features = FeatureUnion([('cvect', cvect), ('tfidf', tfidf)])
X = features.fit_transform(sentence_list)

Python vectorization for classification [duplicate]

This question already has an answer here:
Scikit learn - fit_transform on the test set
(1 answer)
Closed 8 years ago.
I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).
####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421
####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)
####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()
### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619
clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())
How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?
Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.
You cannot introduce new features into the test set that were not part of your training set. The model is trained on a specific dictionary of terms and that same dictionary of terms must be used across training, validating, testing, and production. Further more, the indices of the words in your feature vector cannot change either.
You should be creating one large matrix using all of your data and then split the rows into your train and test sets. This will guarantee that you will have the same feature set for train and test.

Categories