Prepare data for text classification using Scikit Learn SVM - python

I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.

Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

Related

scikit-learn LogisticRegression Classify another value

i'm new to python and have to make a natural language processing task.
Using a kaggle dataset a sentiment classify should be implemented using python.
For this i'm using a dataframe and the LogisticRegression, as described in this article and everythin works fine.
Now i want to know if it is possible to classify another string which is not in the dataset, so that i can experiment with the classifier interactively.
Is this possible?
Thank you!
You will have to manually run all the preprocessing on youur new data, than predict.
That is:
So first (Data Cleaning) and other functions which you've called which edit the data,
then run the (Create a bag of words) part, and only
Then use the fitted LR model to predict on this (preprocessed) data.
Yes, this is possible.
To make this more modular, you can create a function and pass input string to that function for preprocessing. This could reduce the code redundancy. For train data preprocessing also, you can directly pass data to that function.
Once that is done, you need to create Bag of Words for the test sentence.
Then you can use predict function for trained LR model to predict the output.
Thank You.

How do I deal with preprocessing and with unseen data in a NLP problem?

Suppose that I have preprocessed some text data, removed stopwords, urls and so on.
How should I structure these cleaned data in order to make them usable for a classifier like a Neural Network? Is there a preferred structure, or a rule of thumb? (Bag of words, tf-idf or anything else?) Also, can you suggest some package which will automatically do all the work in python?
Now I train the model, and things work properly.
The model performs well on test set too.
How do I have to treat unseen data?
When I decide to implement the model in a real life project it will encounter new data: do I have to store the structure (like the tf-idf structure) I used for training and apply it to these new data?
Also, let's suppose that in the training/validation/test data there was not the word "hello", so it has not a representation. A real life sentence I have to classify contains the word "hello"
How do I cope with this problem?
Thanks for all the clarifications.
What you can do the make a class and inside that define the function like
import dataset
data cleaning
data preprocessing(BOW, TfIDf)
model building
predictions
You can follow up the code from the below like to get understanding
https://github.com/azeem110201/lifecycledatascienceproject

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.
Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.
You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks
Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

What is the standard way in scikit-learn to arrange textual data for text classification?

I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.
First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:
bigram = [[('word','word'),...,('word','word')]]
Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).
I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).
What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.
Your question is very vague. There are books and courses on the subject you can access.
Have a look at this blog for a start 1 and these course 2 and 3.
The easiest option is load_files, which expects a directory layout
data/
positive/ # class label
1.txt # arbitrary filename
2.txt
...
negative/
1.txt
2.txt
...
...
(This isn't really a standard, it's just convenient and customary. Some ML datasets on the web are offered in this format.)
The output of load_files is a dict with the data in them.
1) larsmans has already mentioned a convenient way to arrange and store your data. 2) When using scikit, numpy arrays always make life easier as they have many features for changing the arrangement of your data easily. 3) Training data and testing data are labeled in the same way. So you would usually have something like:
bigramFeatureVector = [(featureVector0, label), (featureVector1, label),..., (featureVectorN, label)]
The proportion of training data to testing data highly depends on the size of your data. You should indeed learn about n-fold cross validation. Because it will resolve all your doubts, and most probably you have to use it for more accurate evaluations. Just to briefly explain it, for doing a 10-fold cross validation lets say you will have an array in which all your data along with labels are held (something like my above example). Then in a loop running for 10 times, you would leave one tenth of the data for testing and the rest for training. If you learn this then you would have no confusions about how training or testing data should look like. They both should look exactly the same. 4) How to visualize your classification results, depends on what evaluation meausures you would like to use. Its unclear in your question, but let me know if you have further questions.

Categories