Classifying text documents with random forests - python

I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)

Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).

Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor

Related

Data scaling before call SMOTENC for continuos and categorical features

so far my code is the following to run SMOTENC.
from imblearn.over_sampling import SMOTENC
smt = SMOTENC(random_state=seed, categorical_features=[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53], ratio=1.0, n_jobs = -1)
# n_jobs = The number of threads to open if possible. ``-1`` means using all processors.
# default K=5
X_res, y_res = smt.fit_sample(X_tra, y_tra)
The problem here is I was reading about SMOTE and, as it use KNN algorithm with euclidean distance, data should be scaled before to call SMOTENC().
If the dataset has the first 10 features as integer and the rest as categorical ones, how should I do the scaling process in this situation?
I suggest to scale numerical/continuous features either using MinMaxScaler or StandardScaler depending on the distribution of your data.
Regarding categorical features, isn't that the point of choosing SMOTENC over SMOTE algorithm?

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

Is there a way to implement sklearn isolation forest for a 1D array or list? All the examples I came across are for data of 2 Dimension or more.
I have right now developed a model with three features and the example code snipped is mentioned below:
# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]
from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)
#testing it using test set
y_pred_test = clf.predict(w_test)
The reference I mainly relied upon: IsolationForest example | scikit-learn
The df_data is a data frame with three columns. I am actually looking to find outlier in 1 Dimension or list data.
The other question is how to tune an isolation forest model? One of the ways is to increase the contamination value to reduce the false positives. But how to use the other parameters like n_estimators, max_samples, max_features, versbose, etc.
It won't make sense to apply Isolation forest to 1D array or list. This is because in that case it would simply be a one to one mapping from feature to target.
You can read the official documentation to get a better idea of the different parameters helps
contamination
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Try experimenting with different values in range [0,0.5] to see which one gives the best results
max_features
The number of features to draw from X to train each base estimator.
Try values like 5,6,10, etc. any int of your choice and validate it with the final test data
n_estimators try multiple values like 10,20,50, etc. to see which works best.
You can also use GridSearchCV to automate this process of parameter estimation.
Just try experimenting with different values using gridSearchCV and see which one gives the best results.
Try this
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for = IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters, scoring=my_scoring_func)
Then use clf to fit the data. Although note that GridSearchCV requires bot x and y (i.e. train data and labels) for the fit method.
Note :You can read this blog post for further reference if you wish to use GridSearchCv with Isolation forest, else you can manually try with different values and plot graphs to see the results.

How can I set sub-sample size in Random Forest Classifier in Scikit-Learn? Especially for imbalanced data

Currently, I am implementing RandomForestClassifier in Sklearn for my imbalanced data. I am not very clear about how RF works in Sklearn exactly. Here are my concerns as follows:
According to the documents, it seemed that there is no way to set the sub-sample size (i.e smaller than the original data size) for each tree learner. But in fact, in random forest algo, we need to get both subsets of samples and subsets of features for each tree. I am not sure can we achieve that via Sklearn? If yes, how?
Follwoing is the description of RandomForestClassifier in Sklearn.
"A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default)."
Here I found a similar question before. But not many answers for this question.
How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?
For imbalanced data, if we could do sub-sample pick-up via Sklearn (i.e solve the question #1 above), can we do balanced-random forest? i.e. for each tree learner, it will pick up a subset from less-populated class, and also pick up the same number of samples from more-populated class to make up an entire training set with equal distribution of two classes. Then repeat the process for a batch of times (i.e. # of trees).
Thank you!
Cheng
There is no obvious way, but you can hack into the sampling method in sklearn.ensemble.forest.
(Updated on 2021-04-23 as I found sklearn refactor the code)
By using set_rf_samples(n), you can force the tree to sub-sample n rows, and call reset_rf_samples() to sample the whole dataset.
for version < 0.22.0
from sklearn.ensemble import forest
def set_rf_samples(n):
""" Changes Scikit learn's random forests to give each tree a random sample of
n random rows.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n))
def reset_rf_samples():
""" Undoes the changes produced by set_rf_samples.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n_samples))
for version >=0.22.0
There is now a parameter available https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
max_samples: int or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).
reference: fast.ai Machine Learning Course

Multiclass classification using Gaussian Mixture Models with scikit learn

I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.

Naive Bayes probability always 1

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?!
The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).
"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

Categories