giving more weight to a feature using sklearn svm

giving more weight to a feature using sklearn svm - python

I 'm using svm to predict the label from the title of a book. However I want to give more weight to some features pre defined. For example, if the title of the book contains words like fairy, Alice I want to label them as children's books. I'm using word n-gram svm. Please suggest how to achieve this using sklearn.

You can create one more feature vector in your training data, if the name of the book contains your predefined words then make it one otherwise zero.

Related

TF-IDF of just one sentence

I have used TF-IDF to extract features from a sentiment annotated dataset, I have used the extracted features to train a ML model using random forest algorithm. Is it possible for me to now input a sentence into the model and have it return what it believes the sentiment is?
I would need to take that sentence and convert it to TF-IDF values for my model to understand it.
Do i need to recalculate TF-IDF values for the entire dataset in order to get the values for this new sentence ?
Does anyone know a way of doing this preferably in python?

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) :
I am trying to generate dictionaries of common words under each category so that I can use it for classification.
Is there a method to generate these dictionaries for a custom set of words automatically?
Then I can use these for classifying the twitter data using a tf-idf classifier and get the degree of correspondence of the tweet to each of the categories. The highest value will give us the most probable category of the tweet.
But since the categorisation is based on these pre-generated dictionaries, I am looking for a way to generate them automatically for a custom list of categories.
Sample dictionaries :
Education - ['book','teacher','student'....]
Automobiles - ['car','auto','expo',....]
Example I/O:
**Input :**
UserA - "students visited share learning experience eye opening
article important preserve linaugural workshop students teachers
others know coding like know alphabets vision driving codeindia office
initiative get students tagging wrong people apologies apologies real
people work..."
.
.
UserN - <another corpus of cleaned tweets>
**Expected output** :
UserA - Education (61%)
UserN - Automobiles (43%)

TL;DR
Labels are necessary for supervised machine learning. And if you don't have training data that contains Xs (input texts) and Y (output labels) then (i) supervised learning might not be what you're looking for or (ii) you have to create a dataset with texts and their corresponding labels.
In Long
Lets try to break it down and see reflect what you're looking for.
I have a list twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology
So your ultimate task is to label tweets into 7 categories.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
100 data points is definitely insufficient to do anything if you want to train a supervised machine learning model from scratch.
Another thing is the definition of corpus. A corpus is a body of text so it's not wrong to call any list of strings a corpus. However, to do any supervised training, each text should come with the corresponding label(s)
But I see some people do unsupervised classification without any labels!
Now, that's an oxymoron =)
Unsupervised Classification
Yes, there are "unsupervised learning" which often means to learn representation of the inputs, generally the representation of the inpus is use to (i) generate or (ii) sample.
Generation from a representation means to create from the representation a data point that is similar to the data which an unsupervised model has learnt from. In the case of text process / NLP, this often means to generate new sentences from scratch, e.g. https://transformer.huggingface.co/
Sampling a representation means to give the unsupervised model a text and the model is expected to provide some signal from which the unsupervised model has learnt from. E.g. given a language model and novel sentence, we want to estimate the probability of the sentence, then we use this probability to compare across different sentences' probabilities.
Algorithmia has a nice summary blogpost https://algorithmia.com/blog/introduction-to-unsupervised-learning and a more modern perspective https://sites.google.com/view/berkeley-cs294-158-sp20/home
That's a whole lot of information but you don't tell me how to #$%^&-ing do unsupervised classification!
Yes, the oxymoron explanation isn't finished. If we look at text classification, what are we exactly doing?
We are fitting the input text into some pre-defined categories. In your case, the labels are pre-defined but
Q: Where exactly would the signal come from?
A: From the tweets, of course, stop distracting me! Tell me how to do classification!!!
Q: How do you tell the model that a tweet should be this label and not another label?
A: From the unsupervised learning, right? Isn't that what unsupervised learning supposed to do? To map the input texts to the output labels?
Precisely, that's the oxymoron,
Supervised learning maps the input texts to output labels not unsupervised learning
So what do I do? I need to use unsupervised learning and I want to do classification.
Then the question is ask is:
Do you have labelled data?
If no, then how to get labels?
Use proxies, find signals that tells you a certain tweet is a certain label, e.g. from the hashtags or make some assumptions that some people always tweets on certain category
Use existing tweet classifiers to label your data and then train the classification model on the data
Do I have to pay for these classifiers? Most often, yes you do. https://english.api.rakuten.net/search/text%20classification
If yes, then how much?
If it's too little,
then how to create more? Maybe https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
or maybe use some modern post-training algorithm https://towardsdatascience.com/https-medium-com-chaturangarajapakshe-text-classification-with-transformer-models-d370944b50ca
How about all these AI I keep hearing about, that I can do classification with 3 lines of code.
Don't they use unsupervised language models that sounds like Sesame Street characters, e.g. ELMO, BERT, ERNIE?
I guess you mean something like https://github.com/ThilinaRajapakse/simpletransformers#text-classification
from simpletransformers.classification import ClassificationModel
import pandas as pd
# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)
eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)
# Create a ClassificationModel
model = ClassificationModel('bert', 'bert-base') # You can set class weights by using the optional weight argument
# Train the model
model.train_model(train_df)
Take careful notice of the comment:
Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
Yes that's the more modern approach to:
First use a pre-trained language model to convert your texts into input representations
Then feed the input representations and their corresponding labels to a classifier
Note, you still can't avoid the fact that you need labels to train the supervised classifier
Wait a minute, you mean all these AI I keep hearing about is not "unsupervised classification".
Genau. There's really no such thing as "unsupervised classification" (yet), somehow the (i) labels needs to be manually defined, (ii) the mapping between the inputs to the labels should exist
The right word to define the paradigm would be transfer learning, where the language is
learned in a self-supervised manner (it's actually not truly unsupervised) so that the model learns to convert any text into some numerical representation
then use the numerical representation with labelled data to produce the classifier.

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks

Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

Prepare data for text classification using Scikit Learn SVM

I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.

Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

What area deals with the extraction of words with similar characteristics?

I have a dataset with annotations in the form: <Word/Phrase, Ontology Class>, where Ontology Class can be one of the following {Physical Object, Action, Quantity}. I have created this dataset manually for my particular ontology model from a large corpus of text.
Because this process was manual, I am sure that I may have missed some words/phrases from my corpus. If such is the case, I am looking at ways to automatically extract other words from the same corpus that have the "characteristics" as these words in the labeled dataset. Therefore, the first task is to define "characteristics" before I even go with the task of extracting other words.
Are there any standard techniques that I can use to achieve this?
EDIT: Sorry. I should have mentioned that these are domain-specific words not found in WordNet.

Take a look at chapter 6 of the NLTK book. From what you have described, it sounds like a supervised classification technique based on feature ("characteristic") extraction might be a good choice. From the book:
A classifier is called supervised if it is built based on training
corpora containing the correct label for each input.
You can use some of the data that you have manually encoded to train your classifier. It might look like this:
def word_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
Next you can train your classifier on some of the data you have already tagged:
>> words = [('Rock', 'Physical Object'), ('Play', 'Action'), ... ]
>>> featuresets = [(word_features(n), g) for (n,g) in words]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
You should probably train on half of the data you already tagged. That way you can test the accuracy of the classifier with the other half. Keep working on the features until the accuracy of the classifier is as you desire.
nltk.classify.accuracy(classifier, test_set)
You can check individual classifications as follows:
classifier.classify(word_features('Gold'))
If you are not familiar with NLTK, then you can read the previous chapters as well.

As jfocht has said, you need a classifier to do this. To train a classifier, you need a set of training data of 'things' with features and their classification. You can then feed in a new 'thing' with features and get out the classification.
The kicker here is that you don't have features, you just have the words. One idea is to use WordNet, which is a fancy dictionary, to generate features from the definitions of the words. One of WordNet's best features is it has a hierarchy for a word e.g.,
cat -> animal -> living thing -> thing ....
You might be able to do this simply by following the hierarchy, but if you can't, you could add features from it and train it. This will likely work much better than using the words themselves as features.
Regardless of whether you find Wordnet to be useful, you need a feature set to train your classifier, and you also have to label all your unclassified data with those features, so unless you have some way to do the feature part computationally, it's going to be less work to do it by hand

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.