I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.
If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?
I'm open to suggestions including other classifiers that can accept new training data over time.
There's 2 options that I know of:
1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.
2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.
I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.
There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:
from textblob.classifiers import NaiveBayesClassifier
train = [
('training test totally tubular', 't'),
]
cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])
print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))
This prints out:
t t
s s
As Jacob said, the second method is the right way
And hopefully someone write a code
Look
https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/
Related
i'm new to python and have to make a natural language processing task.
Using a kaggle dataset a sentiment classify should be implemented using python.
For this i'm using a dataframe and the LogisticRegression, as described in this article and everythin works fine.
Now i want to know if it is possible to classify another string which is not in the dataset, so that i can experiment with the classifier interactively.
Is this possible?
Thank you!
You will have to manually run all the preprocessing on youur new data, than predict.
That is:
So first (Data Cleaning) and other functions which you've called which edit the data,
then run the (Create a bag of words) part, and only
Then use the fitted LR model to predict on this (preprocessed) data.
Yes, this is possible.
To make this more modular, you can create a function and pass input string to that function for preprocessing. This could reduce the code redundancy. For train data preprocessing also, you can directly pass data to that function.
Once that is done, you need to create Bag of Words for the test sentence.
Then you can use predict function for trained LR model to predict the output.
Thank You.
Suppose that I have preprocessed some text data, removed stopwords, urls and so on.
How should I structure these cleaned data in order to make them usable for a classifier like a Neural Network? Is there a preferred structure, or a rule of thumb? (Bag of words, tf-idf or anything else?) Also, can you suggest some package which will automatically do all the work in python?
Now I train the model, and things work properly.
The model performs well on test set too.
How do I have to treat unseen data?
When I decide to implement the model in a real life project it will encounter new data: do I have to store the structure (like the tf-idf structure) I used for training and apply it to these new data?
Also, let's suppose that in the training/validation/test data there was not the word "hello", so it has not a representation. A real life sentence I have to classify contains the word "hello"
How do I cope with this problem?
Thanks for all the clarifications.
What you can do the make a class and inside that define the function like
import dataset
data cleaning
data preprocessing(BOW, TfIDf)
model building
predictions
You can follow up the code from the below like to get understanding
https://github.com/azeem110201/lifecycledatascienceproject
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).
I am scraping approximately 200,000 websites, looking for certain types of media posted on the websites of small businesses. I have a pickled linearSVC, which I've trained to predict the probability that a link found on a web page contains media of the type that I'm looking for, and it performs rather well (overall accuracy around 95%). However, I would like the scraper to periodically update the classifier with new data as it scrapes.
So my question is, if I have loaded a pickled sklearn LinearSVC, is there a way to add in new training data without re-training the whole model? Or do I have to load all of the previous training data, add the new data, and train an entirely new model?
You cannot add data to SVM and achieve the same result as if you would add it to the original training set. You can either retrain with extended training set starting with the previous solution (should be faster) or train on new data only and completely diverge from the previous solution.
There are only few models that can do what you would like to achieve here - like for example Ridge Regression or Linear Discriminant Analysis (and their Kernelized - Kernel Ridge Regression or Kernel Fischer Discriminant, or "extreme"-counterparts - ELM or EEM), which have a property of being able to add new training data "on the fly".
I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.
First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:
bigram = [[('word','word'),...,('word','word')]]
Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).
I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).
What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.
Your question is very vague. There are books and courses on the subject you can access.
Have a look at this blog for a start 1 and these course 2 and 3.
The easiest option is load_files, which expects a directory layout
data/
positive/ # class label
1.txt # arbitrary filename
2.txt
...
negative/
1.txt
2.txt
...
...
(This isn't really a standard, it's just convenient and customary. Some ML datasets on the web are offered in this format.)
The output of load_files is a dict with the data in them.
1) larsmans has already mentioned a convenient way to arrange and store your data. 2) When using scikit, numpy arrays always make life easier as they have many features for changing the arrangement of your data easily. 3) Training data and testing data are labeled in the same way. So you would usually have something like:
bigramFeatureVector = [(featureVector0, label), (featureVector1, label),..., (featureVectorN, label)]
The proportion of training data to testing data highly depends on the size of your data. You should indeed learn about n-fold cross validation. Because it will resolve all your doubts, and most probably you have to use it for more accurate evaluations. Just to briefly explain it, for doing a 10-fold cross validation lets say you will have an array in which all your data along with labels are held (something like my above example). Then in a loop running for 10 times, you would leave one tenth of the data for testing and the rest for training. If you learn this then you would have no confusions about how training or testing data should look like. They both should look exactly the same. 4) How to visualize your classification results, depends on what evaluation meausures you would like to use. Its unclear in your question, but let me know if you have further questions.