I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress.
What could be the possible reason for this difference between these two programming languages?
The MATLAB code is below:
load 'path\to\feature vector'; % Observations X Features, loaded as segment_features
load 'path\to\targetValues'; % Observations X Target value, loaded as targets
% Set up Division of Data for Training, Validation, Testing
trainRatio = 70/100;
valRatio = 0/100;
testRatio = 30/100;
[trainInd,valInd,testInd] = dividerand(size(segment_features,1),trainRatio,...
% Train the Forest
B=TreeBagger(10,segment_features(trainInd,:), target(trainInd),...
% Test the Network
outputs_test = predict(B,segment_features(testInd, :));
outputs_test = str2num(cell2mat(outputs_test));
targets_test = target(testInd,:);
oobErrorBaggedEnsemble = oobError(B);
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';

The Problem
There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results.
First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often choose features at each split randomly and use bootstrapped samples in the construction of each tree.
Second, different programming languages may have different default values set for the hyperparameters of a random forest (e.g., scikit-learn's random forest classifier uses gini as its default criterion to measure the quality of a split.)
Third, it will depend on the size of your data (which you do not specify in your question). Smaller datasets will yield more variability in the structure of your random forests and, in turn, their output will differ more from one forest to another.
Finally, a decision tree is susceptible to variability in input data (slight data perturbations can yield very different trees). Random forests try to get more stable and accurate solutions by growing many trees, but often 10 (or even 100 or 200) are often not enough trees to get stable output.
Toward a Solution
I can recommend several strategies. First, ensure that the way in which the data are loaded into each respective program are equivalent. Is MATLAB misreading a critical variable in a different way from Python, causing the variable to become non-predictive (e.g., misreading a numeric variable as a string variable?).
Second, once you are confident that your data are loaded identically across your two programs, read the documentation of the random forest functions closely and ensure that you are specifying the same hyperparameters (e.g., criterion) in your two programs. You want to ensure that the random forests in each are being created as similarly as possible.
Third, it will likely be necessary to increase the number of trees to get more stable output from your forests. Ensure that the number of trees in both implementations is the same.
Fourth, a potential difference between programs may come from how the data are split into training vs testing sets. It may be necessary to ensure some method that allows you to replicate the same cross-validation sets across your two programming languages (e.g., if you have a unique ID for each record, assign those with even numbers to training and those with odd numbers to testing).
Finally, you may also benefit from creating multiple forests in each programming language and compare the mean accuracy numbers across iterations. These will give you a better sense of whether differences in accuracy are truly reliable and significant or just a fluke.
Good luck!


Can doc2vec training result could change with same input data, and same parameter?

I'm using Doc2Vec in gensim library, and finding similiarity between movie, with its name as input.
model = doc2vec.Doc2Vec(vector_size=100, alpha=0.025, min_alpha=0.025, window=5)
model.train(tagged_corpus_list, total_examples=model.corpus_count, epochs=50)
I set parameter like this, and didn't change preprocessing mechanism of input data, didn't changed original data.
similar_doc = model.dv.most_similar(input)
I also used this code to find most similar movie.
When I restarted code to train this model, the most similar movie has changed, with changed score.
Is this possible? Why? If then, how can I fix the training result?
Yes, this sort of change from run to run is normal. It's well-explained in question 11 of the Gensim FAQ:
Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
If the change between runs is small – nearest neighbors mostly the same, with a few in different positions – it's best to tolerate it.
If the change is big, there's likely some other problem, like insufficient training data or poorly-chosen parameters.
Notably, min_alpha=0.025 isn't a sensible value - the training is supposed to use a gradually-decreasing value, and the usual default (min_alpha=0.0001) usually doesn't need changing. (If you copied this from an online example: that's a bad example! Don't trust that site unless it explains why it's doing an odd thing.)
Increasing the number of training epochs, from the default epochs=5 to something like 10 or 20 may also help make run-to-run results more consistent, especially if you don't have plentiful training data.

Continue a RandomizedSearchCV fit

In order to tune some machine learning's (or even pipeline's) hyperparameters, sklearn proposes the exhaustive "GridsearchCV" and the randomized "RandomizedSearchCV". The latter samples the provided distributions and test them out, to finally select the best model (and provide the result of each tentative).
But let's say I train 1'000 models using this randomized method. Later, I decide this isn't precise enough, and want to try 1'000 more models. Can I resume the training? Aka, asking to sample more, and try more models without losing current progress. Calling fit() a second time "restarts" and discards previous hyperparameters combinations.
My situation looks like the following:
pipeline_cv = RandomizedSearchCV(pipeline, distribution, n_iter=1000, n_jobs=-1)
pipeline_cv = pipeline_cv.fit(trainX, trainy)
predictions = pipeline_cv.predict(targetX)
Then, later, I'd decide that 1000 iterations are not enough to cover my distributions' space, so I would do something like
pipeline_cv = pipeline_cv.resume(trainX, trainy, n_iter=1000) # doesn't exist?
And then I'd have a model trained across 2'000 hyperparameters combinations.
Is my goal achievable?
There is a Github issue on that back from Sep 2017, but it is still open:
In practice it is useful to search over some parameter space and then continue the search over some related space. We could provide a warm_start parameter to make it easy to accumulate results for further candidates into cv_results_ (without reevaluating parameter combinations that have already tested).
And a similar question in Cross Validated is also effectively unanswered.
So, the answer would seem to be no (plus that the scikit-learn community has not felt the need to include such a possibility).
But let's stop for a moment to think if something like that would be really valuable...
RandomizedSearchCV essentially works by random sampling parameter values from a given distribution; e.g., using the example from the docs:
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
According to the very basic principles of such random sampling and random number generation (RNG) in general, there is not any guarantee that such a randomly sampled value will not be randomly sampled more than one time, especially if the number of iterations is large. Factor in the fact that RandomizedSearchCV does not do any bookkeeping itself either, hence in principle it can happen that same parameter combinations will be tried more than once in any single run (again, provided that the number of iterations is sufficiently large).
Even in cases of continuous distributions (like the uniform one used above), where the probability of getting exact values already sampled may be very small, there is the routine case of two samples being like 0.678918 and 0.678919, which, however close, they are still different, and count as different trials.
Given the above, I cannot see how "warm starting" a RandomizedSearchCV will be of any practical use. The real value of RandomizedSearchCV lies at the possibility of sampling a usually large area of parameter values - so large that we consider useful to unleash the power of simple random sampling, which, let me repeat, does not itself "remember" past samples returned, and it may very well return samples that are (exactly or approximately) equal to what has been already returned in the past, thus rendering any "warm start" practically irrelevant.
So effectively, simply running two (or more) RandomizedSearchCV processes sequentially (and storing their results) does the job adequately, provided that we do not use the same random seed for different runs (i.e. what is effectively suggested in the Cross Validated thread mentioned above).

How to approach feature elimination?

I have been working on a couple of dataset to build predictive models based on them. However I am left a bit bewildered when its coming to elimination of features.
The first one is the Boston Housing dataset and the second is Bigmart Sales dataset. I will focus my question around these two however I would also appreciate relatively generalized answers too.
Boston Housing : I have constructed a correlation coefficient matrix and eliminated the features which has an absolute correlation coefficient of less than 0.50 with respect to the target variable medv. That is leaving me with three features. However, I also do understand that a correlation matrix can be highly deceptive and does not capture non-linear relationships and as a matter of fact features such as crim, indus etc does have non-linear relationship with medv and intuitively it simply does not feel correct to discard them right away.
Bigmart Sales : There are around 30+ features that is created after OneHotEncoding in Python. I have given a go to backward elimination method while I was constructing a linear regression model but I am not exactly sure how to apply backward elimination when I was working on a Decision Tree model for this dataset (not sure if it can actually be applied to Decision Tree at all).
It would be of great help if I can get some idea on how to approach to feature elimination for the above two cases. Let me know if you need more info, I will gladly provide.
It's extremely general question. I don't think that it possible to answer to your question in StackOverFlow format.
For every ML / Statistical model you need different Feature Elimination / Feature Engineering approach:
Linear / Logistic / GLM models require removal of correlated features
For Neural Nets / Boosted trees removal of features will heart performance of the model
Even for one type of models there's no single best way of doing Feature Elimination
If you can add more specific information to your question it'll be possible to discuss it in details.
This is a fun one without any definitive answers (No Free Lunch Theorems) that apply across the board. That said, there are many guidelines which typically have success in real-world problems. Those guidelines will work fine in the specific datasets you explicitly mentioned as well.
As with just about anything else, one must always consider the purpose of feature elimination. Without a goal or set of goals, any answer is valid. With an objective, not only can you hone in on a good answer, but it can open up the door to other ideas you may not have considered. Typically feature elimination is done for one of four reasons:
Increased Accuracy
Increased Generalization
Decreased Bias
Decreased Variance
Decreased Computational Costs
Ease of Explanation
Of course there are other reasons, but these cover the main use cases. With respect to any of those metrics, the obvious (and awful -- never do this) way to choose which ones to keep is to try all combinations in your model and see what happens. In the Boston Housing dataset, this yields 2^13=8192 possible combinations of features to test. The combinatorial growth is exponential, and not only is this approach likely to lead to survivorship bias, it is too expensive for most people and most data.
Barring any sort of a comprehensive examination of all possible options, one must use a heuristic of some kind to attempt to find the same results. I'll mention several:
Train the model n times, each with precisely one feature removed (a different feature each time). If a model has poor performance it indicates that the removed feature is important.
Train the model once with all features, and randomly perturb each input one feature at a time (this can be done stochastically if you don't want to waste time on every input). The features which cause the most classification error when perturbed are the ones which matter the most.
As you said, perform some sort of correlation testing with the target variable to determine feature importance and a cross-correlation to remove duplicated linear information.
These different approaches have different assumptions and goals. Feature removal is important from a computational standpoint (many machine learning algorithms are quadratic or worse in the number of features), and with that perspective the goal is to preserve the behavior of the model as best as possible while removing as much information (i.e., as much complexity) as possible. In the Boston Housing data set, your cross-correlation analysis would probably leave you with Charles River Proximity, Nitrous Oxide Concentration, and Average Room Number as the most relevant variables. Between those three you capture nearly all the accuracy a linear model can obtain on the data.
One thing to point out is that feature removal by definition removes information. This can improve accuracy and generalization for only a few reasons.
By removing redundant information, the model has less bias toward those features and is better able to generalize.
By removing noisy information, the model can focus its efforts on features with high informational content. Note that this affects non-deterministic models like neural networks more than models like linear regressions. Linear regressions always converge to the one unique solution (except in special cases that happen with a true 0% probability where there are multiple solutions).
When you're throwing a lot of features into an algorithm (50k different genes for an organism for example), it makes a lot of sense that some of them won't carry any information. By definition then, any variance they have is noise that the model may inadvertently pick up instead of the signal we want. Feature removal is a common strategy in that domain which improves accuracy dramatically.
Contrast that with the Boston Housing data which has 13 carefully curated features, all of which carry information (based on eyeballing crude scatter plots with respect to the target variable). That particular reasoning isn't likely to affect accuracy much. Moreover, there aren't enough features for there to be very much bias introduced with duplicated information.
On top of that, there are hundreds of data points covering the majority of the input space, so even if we did have bias problems or extraneous features, there is more than enough data that the effects will be negligible. Perhaps enough to make or break the 1st or 2nd place winners in Kaggle, but not enough to make the difference between a good analysis and a great analysis.
Especially if you're using a linear algorithm on top though, having fewer features can greatly aid in the explainability of a model. If you restrict your model to those three variables, it's pretty easy to tell a person that you know houses in the area are expensive because they're all waterfront, they're huge, and they have nice lawns (nitrous oxide indicates fertilizer usage).
Removing features is only a small portion of feature engineering, and another important technique is the addition of features. Adding features usually amounts to low-order polynomial interactions (as an example, the age variable has a fairly weak correlation to the medv variable, but if you square it then the data straightens out a bit and improves the correlation).
Adding features (and removing them) can be aided greatly with a little domain knowledge. I don't know a ton about housing, so I can't add a lot of help here, but in other domains like credit worthiness you can easily imagine combining debt and income features to get a ratio of debt to income as a single feature. Reshaping those features so that they linearly correlate to your output and represent physically meaningful quantities in the domain is a big part of obtaining accuracy and generalizability.
With respect to generalizability and domain knowledge, even with something as simple as a linear model it's important to be able to explain why a feature is important. Just because the data says that nitrous oxide matters in the test set doesn't mean that it will carry any predictive weight in the train set as well. Especially as the number of features grows and the amount of data shrinks, you will expect such correlations to occur purely by accident. Having a physical interpretation (nitrous oxide corresponds to nice lawns) yields confidence that the model isn't learning spurious correlations.

data preparation for random forest and predictive modeling in python

I am working on a predictive modeling exercise using a categorical output (pass/fail: binary 1 or 0) and about 200 features. I have about 350K training examples for this, but I can increase the size of my dataset if needed. Here are a few issues that I running into:
1- I am dealing with severely imbalanced classes. Out of those 350K examples, only 2K are labelled as “fail” (i.e. categorical output = 1). How do I account for this? I know there are several techniques, such as up-sampling with bootstrap;
2- Most of my features (~ 95%) are categorical (e.g. city, language, etc.) with less than 5-6 levels each. Do I need to transform them into binary data for each level of a feature? For instance if the feature “city” has 3 levels with New York, Paris, and Barcelona, then I can transform it into 3 binary features: city_New_york, city_Paris, and city_Barcelona;
3 - Picking the model itself: I am thinking about a few such as SVM, K-neighbors, Decision tree, Random Forest, Logistic Regression, but my guess is that Random Forest will be appropriate for this because of the large number of categorical features. Any suggestions there?
4 - If I use Random Forest, do I need to (a) do feature scaling for the continuous variables (I am guessing not), (b) change my continuous variables to binary, as explained in question 2 above (I am guessing not), (c) account for my severe imbalanced classes, (d) remove missing values.
Thanks in advance for your answers!
It helps to train with balanced classes (but don't cross validate with them) RF is surprisingly efficient with data, so you won't need all 350k negative samples to train, likely. Choose an equal number of positive examples by sampling with replacement from that pool. Don't forget to leave some positive examples out for validation though.
If you are in scikit-learn, use pandas' df.get_dummies() to generate the binary encoding. R does the binary encoding for you for variables that are factors. Behind the scenes it makes a bit vector.
I always start with RF because there are so few knobs, it's a good benchmark. After I've straightened out my feature transforms and gotten AUC up, I try the other methods.
a) no b) no c) yes d) Yes it needs to be fixed somehow. If you can get away with removing data where any predictor has missing values, great. However if that's not possible, median is a common choice. Let's say a tree is being built, and variable X4 is chosen to split on. RF needs to choose a point on a line and send all the data to either the left or right. What should it do for data where X4 has no value ? Here is the strategy the 'randomForest' package takes in R:
For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

Feature Selection and Reduction for Text Classification

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).
This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.
I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.
There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
#Initialize parameter for TextFeatureSelectionEnsemble and start training
Check the project for details: https://pypi.org/project/TextFeatureSelection/
Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.
I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).
