Why data cleaning decreases accuracy?

Why data cleaning decreases accuracy? - python

Using the 20newsgroups from the scikit learn for reproducibility. When I train an svm model and then perform data cleaning by removing headers, footers and quotes the accuracy decreases. Isn't it supposed to be improved by data cleaning? What is the point in doing all that and then get worse accuracy?
I have created this example with data cleaning to help you understand what I am referring at:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
y_train = newsgroups_train.target
y_test = newsgroups_test.target
vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5, ngram_range=(1, 2),stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
from sklearn.svm import SVC
from sklearn import metrics
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
clf = clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print('Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred))
print('Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred))
print("-"*12)
print("Train Metrics: ", metrics.classification_report(y_train, y_train_pred))
print("-"*12)
print("Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred))
Results before data cleaning:
Train accuracy_score: 1.0
Test accuracy_score: 0.9731638418079096
Results after data cleaning:
Train accuracy_score: 0.9887218045112782
Test accuracy_score: 0.9209039548022598

It is not necessarily your data cleaning, I assume you run the script twice?
The problem is this line of code:
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
random_state=NoneYou should fix random state to e.g. random_state=42, otherwise you cannot produce the same result, if you would run this code again right now, you will again have a different result.
Edit:
The explanation is on the dataset site itself:
If you implement:
import numpy as np
def show_top10(classifier, vectorizer, categories):
feature_names = np.asarray(vectorizer.get_feature_names())
for i, category in enumerate(categories):
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s: %s" % (category, " ".join(feature_names[top10])))
You can now see many things that these features have overfit to:
Almost every group is distinguished by whether headers such as NNTP-Posting-Host: and Distribution: appear more or less often.
Another significant feature involves whether the sender is affiliated with a university, as indicated either by their headers or their signature.
The word “article” is a significant feature, based on how often people quote previous posts like this: “In article [article ID], [name] <[e-mail address]> wrote:”
Other features match the names and e-mail addresses of particular people who were posting at the time.
With such an abundance of clues that distinguish newsgroups, the classifiers barely have to identify topics from text at all, and they all perform at the same high level.
For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what kinds of information to strip out of each file. remove should be a tuple containing any subset of
Summarize:
The remove thingy prevents you from data leakage, that means you have information in your training data which you will not have in a prediction phase, so you have to remove it, otherwise you will get a better result, but this will be not there for new data.

Related

show overfitting with sklearn & random forest

I followed this tutorial to create a simple image classification script:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
train_data = scipy.io.loadmat('extra_32x32.mat')
# extract the images and labels from the dictionary object
X = train_data['X']
y = train_data['y']
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
....
clf = RandomForestClassifier()
print(clf)
start_time = time.time()
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
It gave me an accuracy of approximately 0.7.
Is there someway to visualize or show where/when/if the model is overfitting? I believe this can be shown by training the model until we see that the accuracy of training is increasing and the validation data is decreasing. But how can I do so in the code?

There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate]. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

How do I test a final classifier in test set?

After tunning parameters for a classifier I build a model using the best ones. I decided to perform a StratifiedKFold as a validation. First I split my dataset in train and test so that I can check my model on a different test set. The thing is that since I applied a Kfold cross-validation the accuracies I get are from the virtual validation set. Now I get those performances I want to test the model in that test set we kept above but don't know how to proceed properly.
I know that an alternative could be to perform the kfold cross validation with the entire dataset (X,y) but I decided to keep a test set beacuse I have to build more classifiers.
Here is my code for a classifier:
from xgboost.sklearn import XGBClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.pipeline import Pipeline
kfold = StratifiedKFold(y_train,
n_folds=10,
random_state=42)
pipe_xgb = Pipeline([('xgb', XGBClassifier(learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=2.7, #ratio of positive and negative classes
seed=42))])
pipe_xgb.fit(X_train, y_train)
scores = []
for k, (train, val) in enumerate(kfold):
pipe_xgb.fit(X_train[train], y_train[train])#fit on train
score = pipe_xgb.score(X_train[val], y_train[val])#test on val
scores.append(score)
print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,
np.bincount(y_train[train]), score))
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
I was thinking in making predictions with the test set an call for some parameters using the model I train above but I am not sure if that is the proper way of testing on test_set:
This is what I tried on the test set:
y_pred = pipe_xgb.predict(X_test)
from sklearn.metrics import classification_report
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred, target_names=target_names))
from sklearn.metrics import matthews_corrcoef
print('Matthew coefficient')
print()
print(matthews_corrcoef(y_test, pipe_xgb.predict(X_test)))
print('Confusion matrix')
print(metrics.confusion_matrix(y_test,pipe_xgb.predict(X_test)))

Usually when people set aside a validation set, and then go for a cross validation, it's because they want to find the best parameters on the training set. sklearn gives you two nice functions for doing so, which are under the GridSearch. When you use GridSearchCV, you can set the refit to True, which would result in training the model with the best parameters on the whole training data, which then you can use on your set aside validation set.
Also, you don't have to use a Pipeline if there's only one item in it.
A GridSearchCV example taken from here:
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape='ovr', degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params=None, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
scoring=..., verbose=...)
>>> sorted(clf.cv_results_.keys())
...
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'mean_train_score', 'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split0_train_score', 'split1_test_score', 'split1_train_score',...
'split2_test_score', 'split2_train_score',...
'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]

Iterating through functions and outputting results in organized pandas dataframe

Hoping to output a clean dataframe that shows the model name, the parameters used in the model, and the resulting scoring metrics. Would be even better if there was a smarter way to iterate through the metric functions (given the varying parameters). Example picture of what I'm aiming for.
Here's what I have so far:
def train_predict_score(clf, X_train, y_train, X_test, y_test):
clf = clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
result = []
result.append(roc_auc_score(y_train, y_pred_train))
result.append(roc_auc_score(y_test, y_pred_test))
result.append(cohen_kappa_score(y_train, y_pred_train))
result.append(cohen_kappa_score(y_test, y_pred_test))
result.append(f1_score(y_train, y_pred_train, pos_label=1))
result.append(f1_score(y_test, y_pred_test, pos_label=1))
result.append(precision_score(y_train, y_pred_train, pos_label=1))
result.append(precision_score(y_test, y_pred_test, pos_label=1))
result.append(recall_score(y_train, y_pred_train, pos_label=1))
result.append(recall_score(y_test, y_pred_test, pos_label=1))
return result
# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)
results = []
# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
result = []
result.append(clf) # name and parameters - how can I show all info? it gets truncated
result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
results.append(result)
results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
'prec_test', 'recall_train', 'recall_test'])
results

Iterating through functions
Because functions are objects, you can make a list out of them and simply iterate over that. So for example:
def add1(x):
return x+1
def sub1(x):
return x-1
for func in [add1, sub1]:
print(func(10))
yields
11
9
Getting model name and parameters
As far as I understand, you want to store the name of a model (e.g. LogisticRegression) and it's parameters in different columns.
First of, you can get the parameters like this:
clf.get_params()
This returns all model parameters as a dictionary.
For getting the model name, you can take the string representation of the model and split it once on '('. The first element of the resulting list is the name of the model. So
>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
becomes
>>>str(clf).split('(',1)[0]
LogisticRegression
Example
Here is a small example that should do what you want. It trains 3 different classifiers on sklearn's breast_cancer dataset and returns the roc_auc, f1, precision and recall score on both the train- and test-set as a DataFrame:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]
results_list = []
for clf in clf_list:
clf.fit(X_train, y_train)
res = {}
#extract the model name from the object string
res['Model'] = str(clf).split('(', 1)[0]
#get parameters via get_params() method
res['Parameters'] = clf.get_params()
#for every metric, record performance on train and test set
for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
metric_name = metric_score.__name__
res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))
results_list.append(res)
results_df = pd.DataFrame(results_list)
The resulting DataFrame:
print(results_df.to_string())
Model Parameters f1_test f1_train precision_test precision_train recall_test recall_train roc_au_test roc_au_train
0 LogisticRegression {'fit_intercept': True, 'warm_start': False, '... 0.922384 0.969697 0.922384 0.966038 0.922384 0.973384 0.922384 0.959085
1 RandomForestClassifier {'criterion': 'gini', 'warm_start': False, 'n_... 0.928137 0.998095 0.928137 1.000000 0.928137 0.996198 0.928137 0.998099
2 SVC {'decision_function_shape': None, 'verbose': F... 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000
Note: Because you mentioned DataFrame contents being truncated in your question: That happens only for displaying purposes when you try to print the DF in a console for example, like I did above. When you access the respective cells directly, the content is still there.

sklearn SVM performing awfully poor

I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.
Following How to split into train, test and evaluation sets in sklearn?, and since my dataset is a tuple of 3 items (id, vector, label), I do:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
def predict_labels(test_data, classifier):
labels = []
for test_d in test_data:
labels.append(classifier.predict([test_d]))
return np.array(labels)
def construct_labels(df):
labels = []
for index, row in df.iterrows():
if row[2] == 'Trump':
labels.append('Atomium')
else:
labels.append('Not Trump')
return np.array(labels)
def construct_data(df):
first_row = df.iloc[0]
data = np.array([first_row[1]])
for index, row in df.iterrows():
if first_row[0] != row[0]:
data = np.concatenate((data, np.array([row[1]])), axis=0)
return data
and then:
>>> classifier = SVC(verbose=True)
>>> classifier.fit(train_data, train_labels)
[LibSVM].......*..*
optimization finished, #iter = 9565
obj = -2718.376533, rho = 0.132062
nSV = 5497, nBSV = 2550
Total nSV = 5497
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
>>> predicted_labels = predict_labels(test_data, classifier)
>>> for p, t in zip(predicted_labels, test_labels):
... if p == t:
... correct = correct + 1
and I get correct only 943 labels out of 1833 (=len(test_labels)) -> (943*100/1843 = 51.4%)
I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?
Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...
Attempt:
I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy.
Edit after the answer:
>>> print(clf.best_estimator_)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
... decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
... max_iter=-1, probability=False, random_state=None, shrinking=True,
... tol=0.001, verbose=False)
>>> classifier.fit(train_data, train_labels)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Also I tried clf.fit(train_data, train_labels), which performed the same.
Edit with data (the data are not random):
>>> train_data[0]
array([ 20.21062112, 27.924016 , 137.13815308, 130.97432804,
... # there are 256 coordinates in total
67.76352596, 56.67798138, 104.89566517, 10.02616417])
>>> train_labels[0]
'Not Trump'
>>> train_labels[1]
'Trump'

Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.
For simplicity, let's consider kernel which can be 'rbf' or ‘linear’ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.
One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.
Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:
from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.
It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.

After the comments of sascha and the answer of shahins, I did this eventually:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
from sklearn.svm import SVC
# Classifier found with shahins' answer
classifier = SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
classifier = classifier.fit(train_data, train_labels)
test_data = scaler.fit_transform(test_data)
predicted_labels = predict_labels(test_data, classifier)
and got:
>>> correct_labels = count_correct_labels(predicted_labels, test_labels)
>>> print_stats(correct_labels, len(test_labels))
Correct labels = 1624
Accuracy = 88.5979268958
with these methods:
def count_correct_labels(predicted_labels, test_labels):
correct = 0
for p, t in zip(predicted_labels, test_labels):
if p[0] == t:
correct = correct + 1
return correct
def print_stats(correct_labels, len_test_labels):
print "Correct labels = " + str(correct_labels)
print "Accuracy = " + str((correct_labels * 100 / float(len_test_labels)))
I was able to optimize more with more hyper parameter tuning!
Helpful link: RBF SVM parameters
Note: If I don't transform the test_data, accuracy is 52.7%.

troubleshooting random forests classifier in sci-kit learn

I am trying to run the random forests classifier from sci-kit learn and getting suspiciously bad output - less than 1% of predictions are correct. The model is performing much worse than chance. I am relatively new to Python, ML, and sci-kit learn (a triple whammy) and my concern is that I am missing something fundamental, rather than needing to fine-tune the parameters. What I'm hoping for is more veteran eyes to look through the code and see if something is wrong with the setup.
I'm trying to predict classes for rows in a spreadsheet based on word occurrences - so the input for each row is an array representing how many times each word appears, e.g. [1 0 0 2 0 ... 1]. I am using sci-kit learn's CountVectorizer for do this processing - I feed it strings containing the words in each row, and it outputs the word occurrence array(s). If this input isn't suitable for some reason, that is probably where things are going awry, but I haven't found anything online or in the documentation suggesting that's the case.
Right now, the forest is answering correctly about 0.5% of the time. Using the exact same inputs with an SGD classifier yields close to 80%, which suggests to me that the preprocessing and vectorizing I'm doing is fine - it's something specific to the RF classifier. My first reaction was to look for overfitting, but even when I run the model on the training data, it still gets almost everything wrong.
I've played around with number of trees and amount of training data but that hasn't seemed to change much for me. I'm trying to only show the relevant code but can post more if that's helpful. First SO post so all thoughts and feedback appreciated.
#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()
#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)
#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()
#export
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile)
for item in clf.predict(testdata):
spamwriter.writerow([item])

if with Random Forest (RF) you get so bad on the training set X_train, then something is definitely wrong, because you should get a huge percentage, above 90%.
Try the following (code snippet first):
print "K-means"
clf = KMeans(n_clusters=len(train_targets), n_init=1000, n_jobs=2)
print "Gaussian Mixtures: full covariance"
covar_type = 'full' # 'spherical', 'diag', 'tied', 'full'
clf = GMM(n_components=len(train_targets), covariance_type=covar_type, init_params='wc', n_iter=10000)
print "VBGMM: full covariance"
covar_type = 'full' # 'spherical', 'diag', 'tied', 'full'
clf = VBGMM(n_components=len(train_targets), covariance_type=covar_type, alpha=1.0, random_state=None, thresh=0.01, verbose=False, min_covar=None, n_iter=1000000, params='wc', init_params='wc')
print "Random Forest"
clf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=2)
print "MultiNomial Logistic Regression"
clf = LogisticRegression(penalty='l2', dual=False, C=1.0, fit_intercept=True, intercept_scaling=1, tol=0.0001)
print "SVM: Gaussian Kernel, infty iterations"
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma=3.0, coef0=1.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, random_state=None)
different classifiers, the interface in sci-kit learn is basically always the same and see how they behave (maybe RF is not really the best). See code above
Try to create some randomly generated datasets to give to RF classifier, I strongly suspect something goes wrong in the mapping process that generates the vectorizer objects. Therefore, start creating your X_train and see.
Hope that helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.