stacking 3 variables for kmeans scikit - python

I have 3 variable that i want to fit into a kmeans model. One is the TFIDF vector, One is the Count vector and the third one is the number of words in a document (sentence_list_len).
Here is my code:
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
count_vectorized=count_vectorizer.fit_transform(sentence_list)
sentence_list_len # for each document, how many words are there
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
How do i fit the 3 variables into km.fit? specifically how do i stack all three of them and feed it to km.fit()?

Simply concatenate your vectors. See numpy.concatenate or numpy.vstack / numpy.hstack. However, be aware that kmeans does not work well with high dimensional data and that it will probably ignore "small" features. You have three types of features in different scales, this will heavily affect clustering results. In general kmeans is not a good approach to the NLP clustering tasks.

The official way is to use a FeatureUnion:
from sklearn.pipeline import FeatureUnion
tfidf =TfidfVectorizer()
cvect = CountVectorizer()
features = FeatureUnion([('cvect', cvect), ('tfidf', tfidf)])
X = features.fit_transform(sentence_list)

Related

How to calculate key terms from a document with chi-squared test?

I would like to extract key terms from documents with chi-squared test, thus I tried the following:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
Texts=["should schools have uniform","schools discipline","legalize marriage","marriage culture"]
vectorizer = TfidfVectorizer()
term_doc=vectorizer.fit_transform(Texts)
ch2 = SelectKBest(chi2, "all")
X_train = ch2.fit_transform(term_doc)
print (ch2.scores_)
vectorizer.get_feature_names()
However, I do not have labels and when I run the above code I got:
TypeError: fit() missing 1 required positional argument: 'y'
Is there any way of using chi-squared test to extract most important words without having any labels?
The Chi-square statistic tests for dependence between two variables. So it's not the right statistic for feature selection in an unsupervised (no labels) problem.
Depending on what your goal is in removing features, you could instead apply some feature preprocessing in your TfidfVectorizer. You might threshold your vectorizer to discard very common words and very rare words. For example, defining your vectorizer as:
TfidfVectorizer(min_df=0.01, max_df=0.9)
will remove words that occur in fewer than 1% of documents or more than 90% of documents.
If your goal in removing unimportant features is to significantly reduce the dimensionality of the problem for subsequent analysis, you may also find dimensionality reduction techniques like TruncatedSVD to be useful.

Why use LSA before K-Means when doing text clustering

I'm following this tutorial from Scikit learn on text clustering using K-Means:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.
Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.
I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.
Example code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["some text", "some other text", "more text"]
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)
svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)
model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)
LSA transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).
As with PCA, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.
Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW tf-idf feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.
Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).
There is a paper that shows that PCA eigenvectors are good initializers for K-Means.
Controlling the dimension with the max_features parameter is equivalent to cutting off the vocabulary size which has negative effects. For example if you set max_features to 10 the model will work with the most common 10 words in the corpus and ignore the rest.

Working with, preparing bag-of-word data for Regression

Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.
Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.
I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)
Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]
My data is generally sparse like in the Example.
When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the
distribution of my data (age/amount):
I used diffrerent Regression Models: Linear Regression, Lasso,
SGDRegressor from scikit-learn with no improvement.
So the questions are:
1.How do I improve the r² score?
2.Do I have to change the data to fit the Regression better? If yes with what method?
3.Which Regressor/Methods should I use for text classification?
To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.
None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.
I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.
Here I show an example script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD()),
('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
'tfidf__max_df': (0.5, 0.75, 1.0),
'tfidf__ngram_range': ((1,1), (1,2)),
'svd__n_components': (50, 100, 150, 200),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, params, scoring='r2',
n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

Classifying text documents with random forests

I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).
Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor

Naive Bayes probability always 1

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?!
The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).
"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

Categories