How to calculate key terms from a document with chi-squared test?

How to calculate key terms from a document with chi-squared test? - python

I would like to extract key terms from documents with chi-squared test, thus I tried the following:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
Texts=["should schools have uniform","schools discipline","legalize marriage","marriage culture"]
vectorizer = TfidfVectorizer()
term_doc=vectorizer.fit_transform(Texts)
ch2 = SelectKBest(chi2, "all")
X_train = ch2.fit_transform(term_doc)
print (ch2.scores_)
vectorizer.get_feature_names()
However, I do not have labels and when I run the above code I got:
TypeError: fit() missing 1 required positional argument: 'y'
Is there any way of using chi-squared test to extract most important words without having any labels?

The Chi-square statistic tests for dependence between two variables. So it's not the right statistic for feature selection in an unsupervised (no labels) problem.
Depending on what your goal is in removing features, you could instead apply some feature preprocessing in your TfidfVectorizer. You might threshold your vectorizer to discard very common words and very rare words. For example, defining your vectorizer as:
TfidfVectorizer(min_df=0.01, max_df=0.9)
will remove words that occur in fewer than 1% of documents or more than 90% of documents.
If your goal in removing unimportant features is to significantly reduce the dimensionality of the problem for subsequent analysis, you may also find dimensionality reduction techniques like TruncatedSVD to be useful.

Related

Why use LSA before K-Means when doing text clustering

I'm following this tutorial from Scikit learn on text clustering using K-Means:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.
Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.
I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.
Example code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["some text", "some other text", "more text"]
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)
svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)
model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)

LSA transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).
As with PCA, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.
Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW tf-idf feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.
Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).

There is a paper that shows that PCA eigenvectors are good initializers for K-Means.
Controlling the dimension with the max_features parameter is equivalent to cutting off the vocabulary size which has negative effects. For example if you set max_features to 10 the model will work with the most common 10 words in the corpus and ignore the rest.

Predicting Classifications with some Words not in the training set (Naive Bayes)

I am created a Naive Bayes model to predict if the outcome is 'negative' or 'positive'. The problem I am having is running the model on a new set of data with some of the words not in the model. The error I receive for predicting a new data set is :
ValueError: Expected input with 6 features, got 4 instead
I read that I would have to put a Laplace Smoother in my model and Bernoulli() already has a default alpha of 1. What else can I do to fix my error? Thank you
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score
import textblob as TextBlob
#scikit
comments = list(['happy','sad','this is negative','this is positive', 'i like this', 'why do i hate this'])
classes = list(['positive','negative','negative','positive','positive','negative'])
# preprocess creates the term frequency matrix for the review data set
stop = stopwords.words('english')
count_vectorizer = CountVectorizer(analyzer =u'word',stop_words = stop, ngram_range=(1, 3))
comments = count_vectorizer.fit_transform(comments)
tfidf_comments = TfidfTransformer(use_idf=True).fit_transform(comments)
# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(tfidf_comments,classes,test_size=0.2,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
#new data
comments_new = list(['positive','zebra','george','nothing'])
comments_new = count_vectorizer.fit_transform(comments_new)
tfidf_comments_new = TfidfTransformer(use_idf=True).fit_transform(comments_new)
classifier.predict(tfidf_comments_new)

You should not fit a new estimator on the new data using fit_transform, but use the previously build count_vectorizer, just using transform. That will ignore all words that were not in the dictionary.
I disagree with Maxim: While this doesn't make a difference for CountVectorizer, using TfidfTransformer on the joined dataset will leak information from the test set to the training set, which you need to avoid.

You are creating a count matrix from 'comments' words. While creating count matrix you must use all of possible words you will encounter in your problem.Imagine the simpler case when you create a membership matrix. Each column states for specific word, each row - for specific example from dataset (for example, email text). Matrix holds 0 if specific word is not in the example, and 1 if it is in example. Obviously, if you have built such matrix for emails which hold, for example, 100 different words, the matrix will have 100 of columns. But if after that you will try to use trained classifier against new data in which you will have some new word, which wasn't in the training set - it will just fail. Since there was no column in original matrix to hold values for these new word. So once again, during vectorization of text you must provide all terms you will ever face in train and test datasets.
So instead of calling CountVectorizer and tfidfTransformer against 'comments' you must join both comments and comments_new into one list and call CountVectorizer and tfidfTransformer against the joined list.

stacking 3 variables for kmeans scikit

I have 3 variable that i want to fit into a kmeans model. One is the TFIDF vector, One is the Count vector and the third one is the number of words in a document (sentence_list_len).
Here is my code:
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
count_vectorized=count_vectorizer.fit_transform(sentence_list)
sentence_list_len # for each document, how many words are there
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
How do i fit the 3 variables into km.fit? specifically how do i stack all three of them and feed it to km.fit()?

Simply concatenate your vectors. See numpy.concatenate or numpy.vstack / numpy.hstack. However, be aware that kmeans does not work well with high dimensional data and that it will probably ignore "small" features. You have three types of features in different scales, this will heavily affect clustering results. In general kmeans is not a good approach to the NLP clustering tasks.

The official way is to use a FeatureUnion:
from sklearn.pipeline import FeatureUnion
tfidf =TfidfVectorizer()
cvect = CountVectorizer()
features = FeatureUnion([('cvect', cvect), ('tfidf', tfidf)])
X = features.fit_transform(sentence_list)

Classifying text documents with random forests

I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)

Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).

Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor

Naive Bayes probability always 1

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?!
The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).

"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate key terms from a document with chi-squared test? - python

Related

Why use LSA before K-Means when doing text clustering

Predicting Classifications with some Words not in the training set (Naive Bayes)

stacking 3 variables for kmeans scikit

Classifying text documents with random forests

Naive Bayes probability always 1

Categories

Resources