I am trying to use NLTK for training a Naive Bayes classifier for multi-class text classification. But I do not have access to the original texts. I am provided with is a file in SVM Light format (one instance each line with feature:value pair). I simply have to import this file and train and test Naive Bayes classifier using this dataset. I was wondering if there is some way to import this file into NLTK and use it directly for training classifiers.
According to nltk's own documentation this is achieved something like this:
Excerpt from Documentation:
scikit-learn (http://scikit-learn.org) is a machine learning library
for Python. It supports many classification algorithms, including
SVMs, Naive Bayes, logistic regression (MaxEnt) and decision trees.
This package implement a wrapper around scikit-learn classifiers. To
use this wrapper, construct a scikit-learn estimator object, then use
that to construct a SklearnClassifier. E.g., to wrap a linear SVM with
default settings:
Example:
>>> from sklearn.svm import LinearSVC
>>> from nltk.classify.scikitlearn import SklearnClassifier
>>> classif = SklearnClassifier(LinearSVC())
See: http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearn
Related
I know there is a library in python
from sklearn.naive_bayes import MultinomialNB
but I want to know how to create one from scratch without using libraries like TfIdfVectorizer and MultinomialNB?
Here is the step-by-step about how to make simple MNB Classifier with TF-IDF
First, you need to import the method TfIdfVectorizer to tokenize the terms inside the dataset, the MultinomialNB as the classifier, and the train_test_split for splitting the dataset. (Both are available in sklearn).
Split the dataset into train and test sets.
Initialize the constructor of TfIdfVectorizer, then Vectorize/Tokenize the train set by the method fit_transform.
Vectorize/Fit the test set with the method fit.
Initialize the classifier by calling the constructor MultinomialNB().
model = MultinomialNB() # with default hyperparameters
Train the classifier with the train set.
model.fit(X_train, y_train)
Test/Validate the classifier with the test set.
model.predict(X_test, y_test)
Those 7 steps above are the simple steps. Apparently you can also do the text preprocessing and also model evaluation.
I have been able to create a RandomForestClassifier on a dataset.
clf = RandomForestClassifier(n_estimators=100, random_state = 101)
I can then use it on the test data like this:
prediction = pd.DataFrame(clf.predict(x)) # x = Matrix of predictor values
So my question is, how can I test clf.predict outside of Python, how can I see the values that is using and how can I test it "manually" for example if you get the betas in a Regression you can then use those values in Excel and replicate the model. How to do this with RandomForests in Python?
Also is there a similar metric to Rsquared to test the model's explication power?
Thanks!
The RandomForestClassifier is an ensemble of trees which means it is composed by multiple trees.
To be able to test the trees I would suggest to do it in Python itself, you can access all the trees in the estimators_ attribute of the classifier and subsequently export them as graphs with export_graphviz from sklearn.tree module.
If you insist on exporting the trees you will need to export all the rules that each tree is composed by. For that, you can follow this instructions from the sklearn docs.
Regarding the metrics, for a classification problem you could use accuracy_score from sklearn.metrics module.
I thought it should be the same, but for method decision_function() I get different results. And SVC with only decision_function_shape='ovr' is really faster.
Related: Scikit learn multi-class classification for support vector machines
I got some clarification on the documentation of LinearSVC in the See also heading, where SVC is mentioned.
SVC
Implementation of Support Vector Machine classifier using libsvm:
....
....
Furthermore SVC multi-class mode is implemented using one vs one scheme while LinearSVC uses one
vs the rest. It is possible to implement one vs the rest with SVC by
using the sklearn.multiclass.OneVsRestClassifier wrapper.
....
Also, SVC delegates all the training to the underlying libsvm library, which handles the multi-class case as 'OvO' (even if the decision_function_shape = 'ovr').
Its mentioned in the issue #delusionX mentioned that decision_function_shape is just for compatibility with scikit API. Its most probably, that all other estimators handle the multi-class as OvR and so when SVC is used in combination with other things, (Like for example in a Pipeline, GridSearchCV, Or wrappers like OneVsRestClassifier) returning a OvO decision function breaks the working of others. But I could not find that written explicitly anywhere.
Fun fact: OneVsOneClassifier also returns a decision function which confirms with the shape of OvR.
I want to do POS tagging using SVM with non-English corpus in Python.
It looks like Python does not support tagging using SVM yet (http://www.nltk.org/_modules).
scikit-learn has a SVM module. So I installed scikit-learn and use it in Python but I cannot find any tutorials about POS tagging using SVM.
I really have no clue what to do, any help would be appreciated.
Does it have to be an SVM? NTLK has built-in tools to do POS tagging: Categorizing and Tagging Words
If you want to use a custom classifier, look here: http://www.nltk.org/api/nltk.classify.html, Ctrl+F "svm", NTLK provides a wrapper for scikit-learn algorithms called SklearnClassifier. Then take a look here http://www.nltk.org/api/nltk.tag.html, Ctrl+F "classifier", there is a class nltk.tag.sequential.ClassifierBasedPOSTaggerwhich apparently can use wrapped up classifiers from sklearn.
I haven't tried this but it might work.
EDIT:
It should work like this:
from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
clf = SklearnClassifier(SVC(),sparse=False)
cpos = nltk.tag.sequential.ClassifierBasedPOSTagger(train=train_sents,classifier_builder
= lambda train_feats: clf.train(train_feats))
The only problem is that sklearn classifiers take numerical features only, so you need to convert yours somehow.
How do you train Scikit's LinearSVC on a dataset too big or impractical to fit into memory? I'm trying to use it to classify documents, and I have a few thousand tagged example records, but when I try to load all this text into memory and train LinearSVC, it consumes over 65% of my memory and I'm forced to kill it before my system becomes totally unresponsive.
Is it possible to format my training data as a single file and feed it into LinearSVC with a filename instead of having to call the fit() method?
I found this guide, but it only really covers classification, and assumes training is done incrementally, something LinearSVC doesn't support.
As far as I know, non-incremental implementations like LinearSVC would need the entire data set to train on. Unless you create an incremental version of it, you might be unable to use LinearSVC.
There are classifiers in scikit-learn that can be used incrementally just like in the guide you found wherein it was using an SGDClassifier. The SGDClassifier has the *partial_fit* method which allows you to train it in batches. There are a couple of other classifiers that support incremental learning such as SGDCLassifier, Multinomial Naive Bayes and Bernoulli Naive Bayes
You can use a Generator function like this.
def lineGenerator():
with open(INPUT_FILENAMES_TITLE[0],'r') as f1:
title_reader = csv.reader(f1)
for line in title_reader:
yield line[0]
Then you can call the Classifier as
clf = LinearSVC()
clf.fit(lineGenerator())
This assumes INPUT_FILENAMES_TITLE[0] is your filename.