troubleshooting random forests classifier in sci-kit learn - python

I am trying to run the random forests classifier from sci-kit learn and getting suspiciously bad output - less than 1% of predictions are correct. The model is performing much worse than chance. I am relatively new to Python, ML, and sci-kit learn (a triple whammy) and my concern is that I am missing something fundamental, rather than needing to fine-tune the parameters. What I'm hoping for is more veteran eyes to look through the code and see if something is wrong with the setup.
I'm trying to predict classes for rows in a spreadsheet based on word occurrences - so the input for each row is an array representing how many times each word appears, e.g. [1 0 0 2 0 ... 1]. I am using sci-kit learn's CountVectorizer for do this processing - I feed it strings containing the words in each row, and it outputs the word occurrence array(s). If this input isn't suitable for some reason, that is probably where things are going awry, but I haven't found anything online or in the documentation suggesting that's the case.
Right now, the forest is answering correctly about 0.5% of the time. Using the exact same inputs with an SGD classifier yields close to 80%, which suggests to me that the preprocessing and vectorizing I'm doing is fine - it's something specific to the RF classifier. My first reaction was to look for overfitting, but even when I run the model on the training data, it still gets almost everything wrong.
I've played around with number of trees and amount of training data but that hasn't seemed to change much for me. I'm trying to only show the relevant code but can post more if that's helpful. First SO post so all thoughts and feedback appreciated.
#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()
#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)
#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()
#export
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile)
for item in clf.predict(testdata):
spamwriter.writerow([item])

if with Random Forest (RF) you get so bad on the training set X_train, then something is definitely wrong, because you should get a huge percentage, above 90%.
Try the following (code snippet first):
print "K-means"
clf = KMeans(n_clusters=len(train_targets), n_init=1000, n_jobs=2)
print "Gaussian Mixtures: full covariance"
covar_type = 'full' # 'spherical', 'diag', 'tied', 'full'
clf = GMM(n_components=len(train_targets), covariance_type=covar_type, init_params='wc', n_iter=10000)
print "VBGMM: full covariance"
covar_type = 'full' # 'spherical', 'diag', 'tied', 'full'
clf = VBGMM(n_components=len(train_targets), covariance_type=covar_type, alpha=1.0, random_state=None, thresh=0.01, verbose=False, min_covar=None, n_iter=1000000, params='wc', init_params='wc')
print "Random Forest"
clf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=2)
print "MultiNomial Logistic Regression"
clf = LogisticRegression(penalty='l2', dual=False, C=1.0, fit_intercept=True, intercept_scaling=1, tol=0.0001)
print "SVM: Gaussian Kernel, infty iterations"
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma=3.0, coef0=1.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, random_state=None)
different classifiers, the interface in sci-kit learn is basically always the same and see how they behave (maybe RF is not really the best). See code above
Try to create some randomly generated datasets to give to RF classifier, I strongly suspect something goes wrong in the mapping process that generates the vectorizer objects. Therefore, start creating your X_train and see.
Hope that helps

Related

Can we test the Naive Bayes algorithm with unlabeled data?

I have trained the model using labeled data for Naive Bayes algorithm. And tested the same model with the other set of labeled data. And I have calculated accuracy, precision and recall scores using the below code.
My code :
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from io import open
def load_data(filename):
reviews = list()
labels = list()
with open(filename, encoding='utf-8') as file:
file.readline()
for line in file:
line = line.strip().split(' ',1)
labels.append(line[0])
reviews.append(line[1])
return reviews, labels
X_train, y_train = load_data('./train_data.txt')
X_test, y_test = load_data('./test_data.txt')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))
print("Recall Score : ",recall_score(y_test, y_pred,average='micro'))
But, now I have another test set which contains unlabeled data. Now, can I test the model with this unlabeled data using the above code ?
This is what I could interpret from your question.
You have trained Naive Bayes model use train data & tested it using test data and you have used confusion matrix & accuracy as a metric to measure the performance of the model.
Now your question may be
Using this model, is it possible to predict label of the unseen data which don't have any labels ?
if that is your question, then, YES it is possible. Moreover that is the reason why you have trained the model i.e, to predict the labels on unseen data.
Since the unseen data don't have labels, how do you know predicted labels are correct ? For this reason only you have tested the model with test data & measured the performance of the model. If the accuracy of the model is 70%, then 70% of the times your model is predicting correctly.
I strongly suggest you to think why are you doing what are you doing before start doing it!!
If you want to automatically calculate the accuracy of the model with unseen data then answer is NO.
To create confusion matix and find out these matrices you need to pass Y label variable. Good practice is to split your training data into training and test data.

Predict training data in sklearn

I use scikit-learn's SVM like so:
clf = svm.SVC()
clf.fit(td_X, td_y)
My question is when I use the classifier to predict the class of a member of the training set, could the classifier ever be wrong even in scikit-learns implementation. (eg. clf.predict(td_X[a])==td_Y[a])
Yes definitely, run this code for example:
from sklearn import svm
import numpy as np
clf = svm.SVC()
np.random.seed(seed=42)
x=np.random.normal(loc=0.0, scale=1.0, size=[100,2])
y=np.random.randint(2,size=100)
clf.fit(x,y)
print(clf.score(x,y))
The score is 0.61, so nearly 40% of the training data is missclassified. Part of the reason is that even though the default kernel is 'rbf' (which in theory should be able to classify perfectly any training data set, as long as you don't have two identical training points with different labels), there is also regularization to reduce overfitting. The default regularizer is C=1.0.
If you run the same code as above but switch clf = svm.SVC() to clf = svm.SVC(C=200000), you'll get an accuracy of 0.94.

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

What is the theorical foundation for scikit-learn dummy classifier?

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.
This classifier is useful as a simple baseline to compare with other
(real) classifiers. Do not use it for real problems.
What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:
generates predictions by respecting the training set’s class
distribution.
Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.
The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.
Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.
Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.
If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:
from sklearn.dummy import DummyClassifier
import numpy as np
two_dimensional_values = []
class_labels = []
for i in xrange(90):
two_dimensional_values.append( [1,1] )
class_labels.append(1)
for i in xrange(10):
two_dimensional_values.append( [0,0] )
class_labels.append(0)
#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )
#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )
#this produces 100 predictions that say "1"
for i in two_dimensional_values:
print( dummy_classifier.predict( [i]) )
#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )
#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
print( new_dummy_classifier.predict( [i]) )
A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.
Using the Doc To illustrate DummyClassifier, first let’s create an imbalanced dataset:
>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
>>> clf = SVC(gamma='scale', kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...
We see that the accuracy was boosted to almost 100%. So this is better.

Categories