i'm new to python and have to make a natural language processing task.
Using a kaggle dataset a sentiment classify should be implemented using python.
For this i'm using a dataframe and the LogisticRegression, as described in this article and everythin works fine.
Now i want to know if it is possible to classify another string which is not in the dataset, so that i can experiment with the classifier interactively.
Is this possible?
Thank you!
You will have to manually run all the preprocessing on youur new data, than predict.
That is:
So first (Data Cleaning) and other functions which you've called which edit the data,
then run the (Create a bag of words) part, and only
Then use the fitted LR model to predict on this (preprocessed) data.
Yes, this is possible.
To make this more modular, you can create a function and pass input string to that function for preprocessing. This could reduce the code redundancy. For train data preprocessing also, you can directly pass data to that function.
Once that is done, you need to create Bag of Words for the test sentence.
Then you can use predict function for trained LR model to predict the output.
Thank You.
Related
Suppose that I have preprocessed some text data, removed stopwords, urls and so on.
How should I structure these cleaned data in order to make them usable for a classifier like a Neural Network? Is there a preferred structure, or a rule of thumb? (Bag of words, tf-idf or anything else?) Also, can you suggest some package which will automatically do all the work in python?
Now I train the model, and things work properly.
The model performs well on test set too.
How do I have to treat unseen data?
When I decide to implement the model in a real life project it will encounter new data: do I have to store the structure (like the tf-idf structure) I used for training and apply it to these new data?
Also, let's suppose that in the training/validation/test data there was not the word "hello", so it has not a representation. A real life sentence I have to classify contains the word "hello"
How do I cope with this problem?
Thanks for all the clarifications.
What you can do the make a class and inside that define the function like
import dataset
data cleaning
data preprocessing(BOW, TfIDf)
model building
predictions
You can follow up the code from the below like to get understanding
https://github.com/azeem110201/lifecycledatascienceproject
First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.
You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.
If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.
currently i am looking for a way to filter specific classes out of my training dataset (MNIST) to train a neural network on different constellations, e.g. train a network only on classes 4,5,6 then train it on 0,1,2,3,4,5,6,7,8,9 to evaluate the results with the test dataset.
I'd like to do it with an argument parser via console to chose which classes should be in my training dataset so i can split this into mini batches. I think i could do it with sorting out via labels but i am kinda stuck at this moment... would appreciate any tip!!!
Greetings,
Alex
Found the answer i guess... one hot=True transformed the scalar into a one hot vector :) thanks for your time anyway!
I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.
Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.
I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.
If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?
I'm open to suggestions including other classifiers that can accept new training data over time.
There's 2 options that I know of:
1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.
2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.
I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.
There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:
from textblob.classifiers import NaiveBayesClassifier
train = [
('training test totally tubular', 't'),
]
cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])
print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))
This prints out:
t t
s s
As Jacob said, the second method is the right way
And hopefully someone write a code
Look
https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/