Evaluate Machine Learning Text Classifier

Evaluate Machine Learning Text Classifier - python

I have built a binary text classifier. Trained it to recognize sentences for clients based on 'New' or 'Return'. My issue is that real data may not always have a clear distinction between new or return, even to an actual person reading the sentence.
My model was trained to 0.99% accuracy with supervised learning using Logistic Regression.
#train model
def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
classifier.fit(feature_vector_train, label)
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return classifier , metrics.accuracy_score(predictions, valid_y)
# Linear Classifier on Count Vectors
model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count,test_y)
print ('::: Accuracy on Test Set :::')
print ('Linear Classifier, BoW Vectors: ', accuracy)
And this would give me an accuracy of 0.998.
I now can pass a whole list of sentences to test this model and it would catch if the sentences has a new or return word yet I need an evaluation metric because some sentences will have no chance of being new or return as real data is messy as always.
My question is: What evaluation metrics can I use so that each new sentence that gets passed through the model shows a score?
Right now I only use the following code
with open('realdata.txt', 'r') as f:
samples = f.readlines()
vecs = count_vect.transform(sentence)
visit = model.predict(vecs)
num_to_label= {0:'New', 1:'Return'}
for s, p in zip(sentence, visit):
#printing each sentence with the predicted label
print(s + num_to_label[p])
For example I would expect
Sentence Visit (Metric X)
New visit 2nd floor New 0.95
Return visit Evening Return 0.98
Afternoon visit North New 0.43
Therefore I'd know to not trust those will metrics below a certain percentage because the tool isnt reliable.

You can use predict_proba() instead of predict(). This will give you probability estimates of your predictions for each possible label.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Related

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..

I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)

How to chain ML models/pipeline models sequentially?

Premise: I have been working on this ML dataset and I found my ADA boost and SVM to be extremely good when it comes to detecting TP. The confusion matrix for both models is identical shown below.
Here's the image:
Out of the 10 models I have trained, 2 of them are ADA and SVM. The other 8, some have lower accuracy and others higher by ~+-2%
MAIN QUESTION:
How do I chain/pipeline so that all my test cases are handled in the following manner?
Pass all the cases through SVM and ADA. If the either SVM or ADA has 80%+ confidence return the result
Else, if SVM or ADA don't have a high confidence, have only those test cases evaluated by the other 8 models for a final decision
Potential Solution:
My potential attempt involved the use of 2 voting classifiers. One classifier with just ADA and SVM, the second classifier with the other 8 models. But I don't know hot to make this work
Here's the code for my approach:
from sklearn.ensemble import VotingClassifier
ensemble1=VotingClassifier(estimators=[
('SVM',model[5]),
('ADA',model[7]),
], voting='hard').fit(X_train,Y_train)
print('The accuracy for ensembled model is:',ensemble1.score(X_test, Y_test))
#I was trying to make ensemble 1 the "first pass" if it was more than 80% confident in it's decision, return the result
#ELSE, ensemble 2 jumps in to make a decision
ensemble2=VotingClassifier(estimators=[
('LR',model[0]),
('DT',model[1]),
('RFC',model[2]),
('KNN',model[3]),
('GBB',model[4]),
('MLP',model[6]),
('EXT',model[8]),
('XG',model[9])
], voting='hard').fit(X_train,Y_train)
#I don't know how to make these two models work together though.
Extra Questions:
These questions are to facilitate some extra concerns I had and are NOT the main question:
Is what I am trying to do worth it?
Is it normal to have a Confusion matrix with just True Positives and False Positives? Or is this indicative of incorrect training? As seen above in the picture for Model 5.
Are the accuracies of my models on an individual level considered to be good? The models are predicting likelihood of developing heart disease. Accuracies below:
Sorry for the long post and thanks for all your input and suggestions. I'm new to ML so I'd appreciate any pointers.

This is a simple implementation, that hopefully solves your main problem of chaining multiple estimators:
class ChainEstimator(BaseEstimator,ClassifierMixin):
def __init__(self,est1,est2):
self.est1 = est1
self.est2 = est2
def fit(self,X,y):
self.est1.fit(X,y)
self.est2.fit(X,y)
return self
def predict(self,X):
ans = np.zeros((len(X),)) - 1
probs = self.est1.predict_proba(X) #averaging confidence of Ada & SVC
conf_samples = np.any(probs>=.8,axis=1) #samples with >80% confidence
ans[conf_samples] = np.argmax(probs[conf_samples,:],axis=1) #Predicted Classes of confident samples
if conf_samples.sum()<len(X): #Use est2 for non-confident samples
ans[~conf_samples] = self.est2.predict(X[~conf_samples])
return ans
Which you can call like this:
est1 = VotingClassifier(estimators=[('ada',AdaBoostClassifier()),('svm',SVC(probability=True))],voting='soft')
est2 = VotingClassifier(estimators=[('dt',DecisionTreeClassifier()),('knn',KNeighborsClassifier())])
clf = ChainEstimator(est1,est2).fit(X_train,Y_train)
ans = clf.predict(X_test)
Now if you want to base your chaining on the performance of est1, you can do something like this to record its performance during training, and add a few more ifs on the predict function:
def fit(self,X,y):
self.est1.fit(X,y)
self.est1_perf = cross_val_score(self.est1,X,y,cv=4,scoring='f1_macro')
self.est2.fit(X,y)
self.est2_perf = cross_val_score(self.est2,X,y,cv=4,scoring='f1_macro')
return self
Note that you shouldn't be using simple accuracy for problem like this.

How to test NLP model against many strings

I have trained a classifier model using logistic regression on a set of strings that classifies strings into 0 or 1. I currently have it where I can only test one string at a time. How can I have my model run through more than one sentence at a time, maybe from a .csv file so I dont have to input each sentence individually?
def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return classifier , metrics.accuracy_score(predictions, valid_y)
then
model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count,test_y)
Currently how I test my model
sent = ['here I copy a string']
converting text to count bag of words vectors
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',ngram_range=(1, 2))
x_feature_vector = count_vect.transform(sent)
pred = model.predict(x_feature_vector)
and I get the sentence and its prediction
I wanted the model to classify all my new sentences at once and give a classification to each sentence.

model.predict(X) takes a list of samples, the same for count_vec.transform(X) so you can read sentences from file and predict them together like this:
with open('file.txt', 'r') as f:
samples = f.readlines()
vecs = count_vec.transform(samples)
preds = model.predict(vecs)
for s, p in zip(samples, preds):
#printing each sentence with the predicted label
print(s + " Label: " + p)

Much easier way to go will be
vecs=count_vec.transform(test['column_name_on_which_you_want_to_predict'])
pred=model.predict(vecs)
data=pd.DataFrame({'Text':column_name,'SECTION':pred})
you can export it then as you want.

Suspiciously high accuracy in sentiment analysis model

I am building a sentiment analysis model using NLTK and scikitlearn. I have decided to test a few different classifiers in order to see which is most accurate, and eventually use all of them as a means of producing a confidence score.
The datasets used for this testing were all reviews, labelled as either positive or negative.
I trained each classifier with 5,000 reviews, 5 separate times, with 6 different (but very similar) datasets. Each test was done with a new set of 5000 reviews.
I averaged the accuracy for each test and dataset, to arrive at an overall mean accuracy. Take a look:
Multinomial Naive Bayes: 91.291%
Logistic Regression: 96.103%
SVC: 95.844%
In some tests, the accuracy was as high as 99.912%. In fact, the lowest mean accuracy for one of the datasets was 81.524%.
Here's a relevant code snippet:
def get_features(comment, word_features):
features = {}
for word in word_features:
features[word] = (word in set(comment))
return features
def main(dataset_name, column, limit):
data = get_data(column, limit)
data = clean_data(data) # filter stop words
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = nltk.FreqDist(all_words).keys()
feature_set = [(get_features(comment, word_features), category) for
(comment, category) in data]
run = 0
while run < 5:
random.shuffle(feature_set)
training_set = feature_set[:int(len(data) / 2.)]
testing_set = feature_set[int(len(data) / 2.):]
classifier = SklearnClassifier(SVC())
classifier.train(training_set)
acc = nltk.classify.accuracy(classifier, testing_set) * 100.
save_acc(acc) # function to save results as .csv
run += 1
Although I know that these kinds of classifiers can typically return great results, this seems a little too good to be true.
What are some things that I need to check to be sure this is valid?

It's not so good if you get a range from 99,66% to 81,5%.
To analyze dataset in case of text classification, you can check:
If the dataset is balanced?
Distribution words for each label, sometimes the vocabulary used for each label can be really different.
Positive/negative, but for the same source? Like the point before maybe if the domain is not the same, the reviews can use different expressions for a positive o negative review. This helps to get a high accuracy in several source.
Try with a review from different source.
If after all you get that high accuracy, congrat! your get_features is really good. :)

Incremental training of random forest model using python sklearn

I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally.
Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.
rf = RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
print ("Trained model already pickled -- >")
with open('rf.pkl', 'rb') as f:
rf = cPickle.load(f)
else:
df_x_train = x_train[col_feature]
rf.fit(df_x_train,y_train)
print ("Training for the model done ")
with open('rf.pkl', 'wb') as f:
cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)
EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.

What you're talking about, updating a model with additional data incrementally, is discussed in the sklearn User Guide:
Although not all algorithms can learn incrementally (i.e. without
seeing all the instances at once), all estimators implementing the
partial_fit API are candidates. Actually, the ability to learn
incrementally from a mini-batch of instances (sometimes called “online
learning”) is key to out-of-core learning as it guarantees that at any
given time there will be only a small amount of instances in the main
memory.
They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.
Some possible ways forward:
Use a regressor which does implement partial_fit(), such as SGDRegressor
Check your RandomForest model's feature_importances_ attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features
Train your model on only the most recent two years of data, if you can only use two years
Train your model on a random subset drawn from all four years of data.
Change the tree_depth parameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem
Set your RF model's param n_jobs=-1 if you haven't already,to use multiple cores/processors on your machine.
Use a faster ensemble-tree-based algorithm, such as xgboost
Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab

You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learning with previous learn using fit call.
Same model learning incrementally two times (train_X[:1], train_X[1:2]) after setting ' warm_start '
forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 1290000.0
pred_y : [ 1630000.]
mae : 925000.0
pred_y : [ 1630000.]
Model only with the last learnt values ( train_X[1:2] )
forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 515000.0
pred_y : [ 1220000.]
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.