I am building a sentiment analysis model using NLTK and scikitlearn. I have decided to test a few different classifiers in order to see which is most accurate, and eventually use all of them as a means of producing a confidence score.
The datasets used for this testing were all reviews, labelled as either positive or negative.
I trained each classifier with 5,000 reviews, 5 separate times, with 6 different (but very similar) datasets. Each test was done with a new set of 5000 reviews.
I averaged the accuracy for each test and dataset, to arrive at an overall mean accuracy. Take a look:
Multinomial Naive Bayes: 91.291%
Logistic Regression: 96.103%
SVC: 95.844%
In some tests, the accuracy was as high as 99.912%. In fact, the lowest mean accuracy for one of the datasets was 81.524%.
Here's a relevant code snippet:
def get_features(comment, word_features):
features = {}
for word in word_features:
features[word] = (word in set(comment))
return features
def main(dataset_name, column, limit):
data = get_data(column, limit)
data = clean_data(data) # filter stop words
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = nltk.FreqDist(all_words).keys()
feature_set = [(get_features(comment, word_features), category) for
(comment, category) in data]
run = 0
while run < 5:
random.shuffle(feature_set)
training_set = feature_set[:int(len(data) / 2.)]
testing_set = feature_set[int(len(data) / 2.):]
classifier = SklearnClassifier(SVC())
classifier.train(training_set)
acc = nltk.classify.accuracy(classifier, testing_set) * 100.
save_acc(acc) # function to save results as .csv
run += 1
Although I know that these kinds of classifiers can typically return great results, this seems a little too good to be true.
What are some things that I need to check to be sure this is valid?
It's not so good if you get a range from 99,66% to 81,5%.
To analyze dataset in case of text classification, you can check:
If the dataset is balanced?
Distribution words for each label, sometimes the vocabulary used for each label can be really different.
Positive/negative, but for the same source? Like the point before maybe if the domain is not the same, the reviews can use different expressions for a positive o negative review. This helps to get a high accuracy in several source.
Try with a review from different source.
If after all you get that high accuracy, congrat! your get_features is really good. :)
Related
I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
dm=1)
I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.
inferred_vector = model.infer_vector(row['cleaned'].split())
sims = model.docvecs.most_similar([inferred_vector], topn=4)
I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.
cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)
Am I doing something wrong or is the smallish corpus the problem?
Edit:
Prep and training code
def unesc(s):
for idx, row in s.iteritems():
s[idx] = html.unescape(row)
return s
custom_pipeline = [
preprocessing.lowercase,
unesc,
preprocessing.remove_urls,
preprocessing.remove_html_tags,
preprocessing.remove_diacritics,
preprocessing.remove_digits,
preprocessing.remove_punctuation,
lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
preprocessing.remove_whitespace,
preprocessing.tokenize
]
ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
epochs=20,
dm=1)
model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")
Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec training technique!
Note that min_count=1 is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.
Is a mere .split() also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)
How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?
"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.
I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.
I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..
I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)
I have built a binary text classifier. Trained it to recognize sentences for clients based on 'New' or 'Return'. My issue is that real data may not always have a clear distinction between new or return, even to an actual person reading the sentence.
My model was trained to 0.99% accuracy with supervised learning using Logistic Regression.
#train model
def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
classifier.fit(feature_vector_train, label)
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return classifier , metrics.accuracy_score(predictions, valid_y)
# Linear Classifier on Count Vectors
model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count,test_y)
print ('::: Accuracy on Test Set :::')
print ('Linear Classifier, BoW Vectors: ', accuracy)
And this would give me an accuracy of 0.998.
I now can pass a whole list of sentences to test this model and it would catch if the sentences has a new or return word yet I need an evaluation metric because some sentences will have no chance of being new or return as real data is messy as always.
My question is: What evaluation metrics can I use so that each new sentence that gets passed through the model shows a score?
Right now I only use the following code
with open('realdata.txt', 'r') as f:
samples = f.readlines()
vecs = count_vect.transform(sentence)
visit = model.predict(vecs)
num_to_label= {0:'New', 1:'Return'}
for s, p in zip(sentence, visit):
#printing each sentence with the predicted label
print(s + num_to_label[p])
For example I would expect
Sentence Visit (Metric X)
New visit 2nd floor New 0.95
Return visit Evening Return 0.98
Afternoon visit North New 0.43
Therefore I'd know to not trust those will metrics below a certain percentage because the tool isnt reliable.
You can use predict_proba() instead of predict(). This will give you probability estimates of your predictions for each possible label.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I am building a NLP chat application in Python using gensim library through doc2vec model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
Result:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
Similarity of SENT_4 and SENT_3 is only -0.08253869414329529 when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?
Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.
In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3 and SENT_4 can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.
You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size), but really: these vectors need large, varied datasets to become meaningful.
That's the main issue. Some other inefficiencies or errors in your example code:
Your code doesn't use the class LabeledLineSentence, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument is the preferred name for the words+tags document class in recent gensim versions, rather than LabeledSentence.)
Your custom-management of alpha and min_alpha is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.
train() will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter iterations at alpha values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha values, would likely yield weird results.)