Text Categorization Test NLTK python - python

I have using nltk packages and train a model using Naive Bayes. I have save the model to a file using pickle package. Now i wonder how can i use this model to test like a random text not in the dataset and the model will tell if the sentence belong to which categorize?
Like my idea is i have a sentence : " Ronaldo have scored 2 goals against Egypt" And pass it to the model file and return categorize "sport".

Just saving the model will not help. You should also save your VectorModel (like tfidfvectorizer or countvectorizer what ever you have used for fitting the train data). You can save those the same way using pickle. Also save all those models you used for pre-processing the train data like normalization/scaling models, etc. For the test data repeat the same steps by loading the pickle models that you saved and transform the test data in train data format that you used for model building and then you will be able to classify.

Related

IsolationForest is always predicting 1

I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.

Is it possible to fine tune FastText models

I'm working on a project for text similarity using FastText, the basic example I have found to train a model is:
from gensim.models import FastText
model = FastText(tokens, size=100, window=3, min_count=1, iter=10, sorted_vocab=1)
As I understand it, since I'm specifying the vector and ngram size, the model is been trained from scratch here and if the dataset is small I would spect great resutls.
The other option I have found is to load the original Wikipedia model which is a huge file:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.simple')
My question is, can I load the Wikipedia or any other model, and fine tune it with my dataset?
If you have a labelled dataset, then you should be able to fine-tune to it. This GitHub issue explains that you want to use the pretrainedVectors option. You would start with the Wikipedia pretrained vectors, then train on your dataset. It seems that gensim can do this, but according to this GH issue, there has been some bugs.

How to evaluate the PMML file using python

I have pmml file generated by python having random forest classifier, I need to test the model again in python. Kindly let me know how to import the pmml file back to python so that I can test the model using new dataset.
I have tried using titanium package but it went to error because of the version issue of PMML.
The expected output to be the predicted value of the model so that I can verify the accuracy of the model.
You could use PyPMML to load PMML in Python, then make predictions on new dataset, e.g.
from pypmml import Model
model = Model.fromFile('the/pmml/file/path')
result = model.predict(data)
The data could be dict, string in JSON, Series or DataFrame of Pandas.

How to load a pre-trained Word2vec MODEL File and reuse it?

I want to use a pre-trained word2vec model, but I don't know how to load it in python.
This file is a MODEL file (703 MB).
It can be downloaded here:
http://devmount.github.io/GermanWordEmbeddings/
just for loading
import gensim
# Load pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load("modelName.model")
now you can train the model as usual. also, if you want to be able to save it and retrain it multiple times, here's what you should do
model.train(//insert proper parameters here//)
"""
If you don't plan to train the model any further, calling
init_sims will make the model much more memory-efficient
If `replace` is set, forget the original vectors and only keep the normalized
ones = saves lots of memory!
replace=True if you want to reuse the model
"""
model.init_sims(replace=True)
# save the model for later use
# for loading, call Word2Vec.load()
model.save("modelName.model")
Use KeyedVectors to load the pre-trained model.
from gensim.models import KeyedVectors
from gensim import models
word2vec_path = 'path/GoogleNews-vectors-negative300.bin.gz'
w2v_model = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
I used the same model in my code and since I couldn't load it, I asked the author about it. His answer was that the model has to be loaded in binary format:
gensim.models.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
This worked for me, and I think it should work for you, too.
I met the same issue and I downloaded GoogleNews-vectors-negative300 from Kaggle. I saved and extracted the file in my descktop. Then I implemented this code in python and it worked well:
model = KeyedVectors.load_word2vec_format=(r'C:/Users/juana/descktop/archive/GoogleNews-vectors-negative300.bin')

How to incrementally train an nltk classifier

I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.
If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?
I'm open to suggestions including other classifiers that can accept new training data over time.
There's 2 options that I know of:
1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.
2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.
I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.
There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:
from textblob.classifiers import NaiveBayesClassifier
train = [
('training test totally tubular', 't'),
]
cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])
print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))
This prints out:
t t
s s
As Jacob said, the second method is the right way
And hopefully someone write a code
Look
https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/

Categories