TD/IDF in scikit-learn - python

Is there a complete Python 2.7 example about how to use TfidfTransformer (http://scikit-learn.org/stable/modules/feature_extraction.html) to generate TF/IDF for n-grams for a corpus? Look around scikit-learn pages and it only have code snippet (not complete samples).
regards,
Lin

For TF-IDF feature extraction, scikit-learn has 2 classes TfidfTransformer and TfidfVectorizer. Both these classes essentially serves the same purpose but are supposed to be used differently. For textual feature extraction, scikit-learn has the notion of Transformers and Vectorizers. The Vectorizers directly work on the raw text to generate the features, whereas the Transformer works on existing features and transforms them into the new features. So going by that analogy, TfidfTransformer works on the existing Term-Frequency features and converts them to TF-IDF features, whereas the TfidfVectorizer takes as input the raw text and directly generates the TF-IDF features. You should always use the TfidfVectorizer if at the time of feature building you do not have an existing Document-Term Matrix. At a black box level you should think of the TfidfVectorizer as CountVectorizer followed by a TfidfTransformer.
Now coming to the working example of a Tfidfectorizer. Note that at if this example is clear then you will have no difficulty in understanding the example given for TfidfTransformer.
Now consider you have the following 4 documents in your corpus:
text = [
'jack and jill went up the hill',
'to fetch a pail of water',
'jack fell down and broke his crown',
'and jill came tumbling after'
]
You can use any iterable as long as it iterates over strings. The TfidfVectorizer also supports reading texts from files, about which they have talked in detail in the docs. Now in the simplest case, we can initialize a TfidfVectorizer object and fit our training data to it. This is done as follows:
tfidf = TfidfVectorizer()
train_features = tfidf.fit_transform(text)
train_features.shape
This code simply fits the Vectorizer on our input data and generates a sparse matrix of dimensions 4 x 20. Hence it transforms each document in the given text to a vector of 20 features, where the size of the vocabulary is 20.
In the case of TfidfVectorizer, when we say fit the model, it means that the TfidfVectorizer learns the IDF weights from the corpus. 'Transforming the data' means to use the fitted model (learnt IDF weights) to convert the documents into TF-IDF vectors. This terminology is a standard throughout scikit-learn. It is extremely useful in the case of classification problems. Consider if you want to classify documents as positive or negative based on some labelled training data using TF-IDF vectors as features. In that case you will build your TF-IDF vectorizer using your training data and when you see new test documents, you will simply transform them using the already fitted TfidfVectorizer.
So if we had the following test_txt:
test_text = [
'jack fetch water',
'jill fell down the hill'
]
we would build test features by simply doing
test_data = tfidf.transform(test_text)
This will again give us a sparse matrix of 2 x 20.The IDF weights used in this case were the ones learnt from the training data.
This is how a simple TfidfVectorizer works. You can make it more intricate by passing more parameters in the constructor. These are very well documented in the Scikit-Learn docs. Some of the parameters, that I use frequently are:
ngram_range - This allows us to build TF-IDF vectors using n gram tokens. For example, if I pass (1,2), then this will build both unigrams and bigrams.
stop_words - Allows us to give stopwords separately to ignore in the process. It is a common practice to filter out words such as 'the', 'of' etc which are present in almost all documents.
min_df and max_df - This allows us to dynamically filter the vocabulary based on the Document Frequency. For example, by giving a max_df of 0.7, I can let my application automatically remove domain specific stop words. For instance, in a corpus of medical journals, the word disease can be thought of as a stop word.
Beyond this, you can also refer to a sample code that I had written for a project. Though it is not well documented but the functions are very well named.
Hope this helps!

Related

Why does KNN algorithm perform better on Word2Vec than on TF-IDF vector representation?

I am doing a project on multi-class text classification and could do with some advice.
I have a dataset of reviews which are classified into 7 product categories.
Firstly, I create a term document matrix using TF-IDF (tfidfvectorizer from sklearn). This generates a matrix of n x m where n in the number of reviews in my dataset and m is the number of features.
Then after splitting term document matrix into 80:20 train:test, I pass it through the K-Nearest Neighbours (KNN) algorithm and achieve an accuracy of 53%.
In another experiment, I used the Google News Word2Vec pretrained embedding (300 dimensional) and averaged all the word vectors for each review. So, each review consists of x words and each of the words has a 300 dimensional vector. Each of the vectors are averaged to produce one 300 dimensional vector per review.
Then I pass this matrix through KNN. I get an accuracy of 72%.
As for other classifiers that I tested on the same dataset, all of them performed better on the TF-IDF method of vectorization. However, KNN performed better on word2vec.
Can anyone help me understand why there is a jump in accuracy for KNN in using the word2vec method as compared to when using the tfidf method?
By using the external word-vectors, you've introduced extra info about the words to the word2vec-derived features – info that simply may not be deducible at all to the plain word-occurrece (TF-IDF) model.
For example, imagine just a single review in your train set, and another single review in your test set, use some less-common word for car like jalopy – but then zero other car-associated words.
A TFIDF model will have a weight for that unique term in a particular slot - but may have no other hints in the training dataset that jalopy is related to cars at all. In TFIDF space, that weight will just make those 2 reviews more-distant from all other reviews (which have a 0.0 in that dimension). It doesn't help or hurt much. A review 'nice jalopy' will be no closer to 'nice car' than it is to 'nice movie'.
On the other hand, if the GoogleNews has a vector for that word, and that vector is fairly close to car, auto, wheels, etc, then reviews with all those words will be shifted a little in the same direction in the word2vec-space, giving an extra hint to some classifiers, especially, perhaps the KNN one. Now, 'nice jalopy' is quite a bit closer to 'nice car' than to 'nice movie' or most other 'nice [X]' reviews.
Using word-vectors from an outside source may not have great coverage of your dataset's domain words. (Words in GoogleNews, from a circa-2013 training run on news articles, might miss both words, and word-senses in your alternative & more-recent reviews.) And, summarizing a text by averaging all its words is a very crude method: it can learn nothing from word-ordering/grammar (that can often reverse intended sense), and aspects of words may all cancel-out/dilute each other in longer texts.
But still, it's bringing in more language info that otherwise wouldn't be in the data at all, so in some cases it may help.
If your dataset is sufficiently large, training your own word-vectors may help a bit, too. (Though, the gain you've seen so far suggests some useful patterns of word-similarities may not be well-taught from your limited dataset.)
Of course also note that you can use blended techniques. Perhaps, each text can be even better-represented by a concatenation of the N TF-IDF dimensions and the M word2vec-average dimensions. (If your texts have many significany 2-word phrases that mean hings different than the individual words, adding in word 2-grams features may help. If your texts have many typos or rare word variants that still share word-roots with other words, than adding in character-n-grams – word fragments – may help.)

How to do supervised learning with Gensim/Word2Vec/Doc2Vec having large corpus of text documents?

I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words.
I am trying to do a supervised learning with these documents.
My approach would be:
Vectorize each document in the corpus. Say we have 2347 docs.
I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
Using any ML classification supervised model train above dataset with 2347 rows.
How to vectorize and create such dataset?
One of the things you can try is using Doc2Vec. This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.
There are other alternatives to doc2vec mentioned here. Try the Average of Word2Vec vectors with TF-IDF approach as well.
Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are
The proportion of uppercase words
Slang words present or not
Emoticons present or not
Language of the text
The sentiment of the text - this is a whole new topic altogether though
I hope this was helpful...

Can I match words or sentences to a pre-vectorized corpus of sentences in Python for NL processing?

I've been searching for an answer to this specific question for a few hours and while I've learned a lot, I still haven't figured it out.
I have a dataset of ~70,000 sentences with subset of about 4,000 sentences that have been appropriately categorized, the rest are uncategorized. Currently I'm using a scikit pipeline with CountVectorizer and TfidfTransformer to vectorize the data, however I'm only vectorizing based off the 4,000 sentences and then testing various models via cross-validation.
I'm wondering if there is a way to use Word2Vec or something similar to vectorize the entire corpus of data and then use these vectors with my subset of 4,000 sentences. My intention is to increase the accuracy of my model predictions by using word vectors that incorporate all of the semantic data in the corpus rather than just data from the 4,000 sentences.
The code I'm currently using is:
svc = Pipeline([('vect', CountVectorizer(ngram_range=(3, 5))),
('tfidf', TfidfTransformer()),
('clf', LinearSVC()),
])
nb.fit(X_train, y_train)
y_pred = svc.predict(X_test)
Where X_train and y_train are my features and labels, respectively. I also have a list z_all which includes all remaining uncategorized features.
Just getting pointed in the right direction (or told whether or not this is possible) would be super helpful.
Thank you!
I would say that the answer is yes: you can use Word2Vec or another similar word-embedding method to get vectors of each sentence in your data, and then use these vectors both as training and testing data in a linear Support Vector Machine (SVC).
And yes, you can first create those vectors for your entire corpus of ~70,000 sentences before actually doing any training on your data.
It is however not as straightforward as the approach you're currently using.
There are many different ways to do this so I'll just go through one of them to help you get the basics of how this can be done.
Before we start and see what possible steps you can follow, let's remember that the goal here is to get one vector for each and every sentence of your corpus.
If you don't know what word-embeddings are, I highly suggest you to read about it, but in short this is just a way to link each word of a pre-defined vocabulary to a vector of a given dimension.
For instance, you would have:
# the vector associated with the word "cat" is the following vector of fixed-length
word_embeddings["cat"] = [0.0014, 0.6710, ..., 0.3281]
Now that you know this, here are the steps you could be following:
Tokenization - The first thing that you want to do is to tokenize each of your sentences. This can be done using a NLP library (SpaCy for instance) that will help you to:
split each sentence in a list of words
remove any punctuation from these words and converting them to lowercase
remove stopwords - optionally
lemmatize all the words - optionally
Train a word embedding model - Now that you have each sentence as a pre-processed list of words, you need to train a word-embedding model using your corpus. There are many different algorithms to do that. I would suggest using GenSim and Word2Vec or fastText. What you can also do is using pre-trained word embeddings, like GloVe or anything that best fits your corpus in terms of language/context. Either way, this will allow you to:
have one vector of pre-defined size for each and every word in your corpus' vocabulary
get a list of equally-sized vectors for each sentence in your corpus
Adopting a weighting method - Once you have a list of vectors for each sentence in your corpus, and mainly because your sentences vary in length (some have 6 words, some others have 13 words, etc.) what you want to do is getting a single vector for each and every sentence. To do this, what you can do is simply weighting the vectors corresponding to the words in each sentence. You can:
average all vectors
using weights like TF-IDF weights to give some words more importance than others
use other weighting methods...
Training and testing - Finally, all you're left to do is training a model using these vectors, for instance with a linear Support Vector Machine (SVC), and testing the accuracy of your model on a test dataset (you can also use a validation dataset).
My opinion is, if you are going to use a word2vec embedding, use one pre-trained or used generic text to generate it.
Word2vec embedding are usually used to give meaning and context to your text data, if you train an embedding using only your data, it might be biased and not represent a language. And that means it vectors doesn't carry any meaning.
After having your embedding working, you also has to think about what to do with your words, because a sentence has 1 or more words (embedding works at word level), and you want to feed your models with 1 sentence -> 1 vector. not 1 sentences -> N vectors.
People usually average or multiply those vectors so for example, for the sentence "Hello there" and an embedding of 5 dims:
Hello -> [0, 0, .2, .3, .8]
there -> [.1, .2, 0, 0, .5]
AVG Hello there -> [.05, .1, .1, .15, .65]
This is what you want to use for your models!
So instead of using TF-IDF to generate your sentence vectors, use word2vec like this and you shouldn't have any problem. I already work in a text calssification project and we ended usind a self-trained w2v embedding an ExtraTrees model with amazing results.

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks
Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

dictionary for sentiment analysis in NLTK

I am new to python and NLTk. I have a model created for sentiment analysis of survey in NLTK (NaivesBayesCalssifier). To improve the accuracy, i wanted to add some dictionary containing list of positive and negative statements in the model. Is there any module in NLTK and are there any additional features that can improve my model?
You can have a look at some public sentiment lexicons which would provide you a corpus of positive and negative words.
One of them can be found at https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Since, you haven't specified any details about your current model, I'm assuming you are using a very basic Naive Bayes classifier. If you are using unigrams(words) to vectorize your text right now, then you can consider using bigrams or trigrams for generating the feature vectors.This would basically, enable you to use the contextual information of the words to a certain extent.
If you are currently using a bag of words model like Tfidf to convert your text to converts then you can consider using word embeddings instead of that. Bag of words doesn't consider the contextual information of the words whereas, word embeddings are able to capitalise on that.
You could use somethings like gensim which uses deep learning to convert words to vectors. Have a look at : https://radimrehurek.com/gensim/models/word2vec.html
Furthermore, you can always try using a linearSVC classifier or a logistic regression classifier and choose whichever one gives the best accuracy.
you can download one from NLTK,just like:
from nltk.corpus import opinion_lexicon
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())

Categories