python nltk naive bayes probabilities

python nltk naive bayes probabilities - python

Is there a way to get at the individual probabilities using nltk.NaiveBayesClassifier.classify? I want to see the probabilities of classification to try and make a confidence scale. Obviously with a binary classifier the decision is going to be one or the other, but is there some way to see the inner workings of how the decision was made? Or, do I just have to write my own classifier?
Thanks

How about nltk.NaiveBayesClassifier.prob_classify?
http://nltk.org/api/nltk.classify.html#nltk.classify.naivebayes.NaiveBayesClassifier.prob_classify
classify calls this function:
def classify(self, featureset):
return self.prob_classify(featureset).max()
Edit: something like this should work (not tested):
dist = classifier.prob_classify(features)
for label in dist.samples():
print("%s: %f" % (label, dist.prob(label)))

I know this is utterly old. But as I struggled some time to found this out i sharing this code.
It show the probability associatte with each feature in Naive Bayes Classifer. It helps me understand better how show_most_informative_features worked. Possible it is the best option to everyone (and much possible that's why they created this funcion). Anyway, for those like me that MUST SEE the individual probabily for each label and word, you can use this code:
for label in classifier.labels():
print(f'\n\n{label}:')
for (fname, fval) in classifier.most_informative_features(50):
print(f" {fname}({fval}): ", end="")
print("{0:.2f}%".format(100*classifier._feature_probdist[label, fname].prob(fval)))

Related

How to compute auc score manually without using sklearn?

I want to compute auc_score with out using sklearn.
I have a csv file with 2 columns (actual,predicted(probability)). And I want to compute auc score using numpy.trapz() function .
And here is my code
from tqdm import tqdm
def AUC_SCORE(x):
t=[]
f=[]
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
x['y_pred'] =np.where( x['proba']>=t,1,0)
tp=(x["y"]==1)&(x["y_pred"]==1).sum()
fp=(x["y"]==0)&(x["y_pred"]==1).sum()
tn=(x["y"]==0)&(x["y_pred"]==0).sum()
fn=(x["y"]==1)&(x["y_pred"]==0).sum()
tpr= tp/(fp+fn)
fpr= fp/(tn+fp)
t.append(tpr)
f.append(fpr)
return np.trapz(t,f)
e=AUC_SCORE(a)
and i have around 10100 points and it almost takes above 1 hr using google colab.
and i din't get my result and i am getting errors while modifying my code.
is there there any better/any way to compute auc score with out using sklearn.

The problem with your implementation seems to be here:
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
You seem to get through each unique values of probabilities, but these are in range 0-1 (probably) and are most likely barely unique, which leads to very long run. You need to translate probability into the label. If you are using binary labels (which from your attempt seems so), you can do following list comprehension:
df["prediction"] = [0 if x<0.5 else 1 for x in df["proba"]]
This way you translate the probability to label and then can sort according to prediction and use unique values in predictions. If you use multilabel predictions, you can extend the above condition according to your needs.

For the performance matter of 1 hr,try to emove tqdm in the loop
for t in tqdm(x["proba"].unique()):
so modify it to:
for t in (x["proba"].unique()):
tqdm is used to show a progess line in loop with ranges, and does not have any effect in the caluculated results.
But I do not know its effects in performance with x["proba"].unique(), it was tested against direct sequences like ranges.
I am anxious to know the result of your try
waiting for your test results in comments.

Non overlapping data in train test validation split python

I'm trying to create a function for some deep learning issues for satellite images classification. I have searched through a lot of libraries and I haven't found my needs I tried this sikit-learn but I feel that it is not what I need
Any hint for a specialised function that I may not see?

The sklearn train_test_split seems to fit all your needs.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

This should do the trick. You can use the permutation array on the X and y data separately if you like.
num_tr, num_va = int(len(data)*0.5), int(len(data)*0.2)
perm = np.random.permutation(len(data))
tr_data = data[perm[:num_tr]]
va_data = data[perm[num_tr:num_tr+num_va]]
te_data = data[perm[num_tr+num_va:]]

Using a trained classifer on a new DataFrame

I have built a classifier, trained and tested on labeled data. Now I want to test it further by making predictions on a dataset without the labels. I already know the labels myself, but I want to remove them for the purpose of testing, and have it print out the values with a 0 prediction so I can compare the accuracy myself. I'm using the following code to iterate through my dataset and make a prediction for each row in the DataFrame;
malware = set()
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
malware.add(index)
print(malware)
f.write(str(malware) + "\n")
It seems to be working, however it's not a quick process, is there a better way or anything I can do to speed it up?

Using a for loop to iterate through elements in a dataset is slow in general. What you want to do is apply your function to every element in the column(s), and generate a series of labels according to the result. (Assuming you're using Pandas for the dataframe, by the way)
labels=dataset.apply(clf.predict)
You can then just scan through this series with a for loop. That should be relatively instant.

After a bit of work I have turned the comment from Ding into a workable answer that is much quicker. My new code is;
from collections import OrderedDict
malware = []
malware.append(OrderedDict.fromkeys(dataset.index[clf.predict(dataset) == 0]))
print (malware)
Thanks very much Ding!

very slow function with two for loops using Arcpy in python

I wrote a code which is working perfectly with the small size data, but when I run it over a dataset with 52000 features, it seems to be stuck in the below function:
def extract_neighboring_OSM_nodes(ref_nodes,cor_nodes):
time_start=time.time()
print "here we start finding neighbors at ", time_start
for ref_node in ref_nodes:
buffered_node = ref_node[2].buffer(10)
for cor_node in cor_nodes:
if cor_node[2].within(buffered_node):
ref_node[4].append(cor_node[0])
cor_node[4].append(ref_node[0])
# node[4][:] = [cor_nodes.index(x) for x in cor_nodes if x[2].within(buffered_node)]
time_end=time.time()
print "neighbor extraction took ", time_end
return ref_nodes
the ref_node and cor_node are a list of tuples as follows:
[(FID, point, geometry, links, neighbors)]
neighbors is an empty list which is going to be populated in the above function.
As I said the last message printed out is the first print command in this function. it seems that this function is so slow but for 52000 thousand features it should not take 24 hours, should it?
Any Idea where the problem would be or how to make the function faster?

You can try multiprocessing, here is an example - http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/.

If you want to get K Nearest Neighbors of every (or some, it doesn't matter) sample of a dataset or eps neighborhood of samples, there is no need to implement it yourself. There is libraries out there specially for this purpose.
Once they built the data structure (usually some kind of tree) you can query the data for neighborhood of a certain sample. Usually for high dimensional data these data structure are not as good as they are for low dimensions but there is solutions for high dimensional data as well.
One I can recommend here is KDTree which has a Scipy implementation.
I hope you find it useful as I did.

Support Vector Machine training using scikit-learn

I'm trying to train a Support Vector Machine using scikit-learn. It is correctly giving the output for trainingdata1. But it is always not giving the expected result for trainingdata2 (trainingdata2 is what I actually need). What is wrong?
from sklearn import svm
trainingdata1 = [[11.0, 2, 2, 1.235, 5.687457], [11.3, 2, 2,7.563, 10.107477]]
#trainingdata2 = [[1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742], [11.3, 2, 2,7.563, 10.107477]]
clf = svm.OneClassSVM()
clf.fit(trainingdata1)
def alert(data):
if clf.predict(data) < 0:
print ('\n\nThere is something wrong')
else:
print('\nCorrect')
alert([11.3, 2, 2,7.563, 10.107477])
#alert([1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742])

Well I have to admit I actually hadn't heard about one-class SVMs before. As far as I understand, their goal is to find out if test examples are similar to previously provided example. Now, the difference between the two cases is that the two vectors are quite similar in the first, working example, and kind of different in the other one (if we compare the numeric values of the different components of the value). Could it be this is actually behaving as intended? Note that SVM training does not necessarily mean that all training examples are classified as labelled for training, due to generalization.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python nltk naive bayes probabilities - python

Related

How to compute auc score manually without using sklearn?

Non overlapping data in train test validation split python

Using a trained classifer on a new DataFrame

very slow function with two for loops using Arcpy in python

Support Vector Machine training using scikit-learn

Categories

Resources