data preparation for random forest and predictive modeling in python

data preparation for random forest and predictive modeling in python - python

I am working on a predictive modeling exercise using a categorical output (pass/fail: binary 1 or 0) and about 200 features. I have about 350K training examples for this, but I can increase the size of my dataset if needed. Here are a few issues that I running into:
1- I am dealing with severely imbalanced classes. Out of those 350K examples, only 2K are labelled as “fail” (i.e. categorical output = 1). How do I account for this? I know there are several techniques, such as up-sampling with bootstrap;
2- Most of my features (~ 95%) are categorical (e.g. city, language, etc.) with less than 5-6 levels each. Do I need to transform them into binary data for each level of a feature? For instance if the feature “city” has 3 levels with New York, Paris, and Barcelona, then I can transform it into 3 binary features: city_New_york, city_Paris, and city_Barcelona;
3 - Picking the model itself: I am thinking about a few such as SVM, K-neighbors, Decision tree, Random Forest, Logistic Regression, but my guess is that Random Forest will be appropriate for this because of the large number of categorical features. Any suggestions there?
4 - If I use Random Forest, do I need to (a) do feature scaling for the continuous variables (I am guessing not), (b) change my continuous variables to binary, as explained in question 2 above (I am guessing not), (c) account for my severe imbalanced classes, (d) remove missing values.
Thanks in advance for your answers!

It helps to train with balanced classes (but don't cross validate with them) RF is surprisingly efficient with data, so you won't need all 350k negative samples to train, likely. Choose an equal number of positive examples by sampling with replacement from that pool. Don't forget to leave some positive examples out for validation though.
If you are in scikit-learn, use pandas' df.get_dummies() to generate the binary encoding. R does the binary encoding for you for variables that are factors. Behind the scenes it makes a bit vector.
I always start with RF because there are so few knobs, it's a good benchmark. After I've straightened out my feature transforms and gotten AUC up, I try the other methods.
a) no b) no c) yes d) Yes it needs to be fixed somehow. If you can get away with removing data where any predictor has missing values, great. However if that's not possible, median is a common choice. Let's say a tree is being built, and variable X4 is chosen to split on. RF needs to choose a point on a line and send all the data to either the left or right. What should it do for data where X4 has no value ? Here is the strategy the 'randomForest' package takes in R:
For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

Related

Question regarding DecisionTreeClassifier

I am making an explainable model with the past data, and not going to use it for future prediction at all.
In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).
I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable
Here are my questions:
Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)
After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.
If here is not the right place to ask such question, please let me know.
Thanks.

No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.
Getting any number of important features is not an issue since it does depend very solely on your data.
Example:
Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23.
If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits.
So, having less number of important features is normal and does not cause any problems.

No, it isn't. However, I would still split train-test and measure performance separately. While an explainable model is nice, it is significantly less nicer if it's a crap model. I'd make sure it had at least a reasonable performance before considering interpretation, at which point the splitting is unnecessary.
The number of important features is data-dependent. Random forests do a good job providing this as well. In any case, fewer branches is better. You want a simpler tree, which is easier to explain.

Using Sklearn random forest for feature selection does not give me expected outcome when having categorical data

I would like to use SKlearn random forest feature selection function to understand what are the key factors impacting my independent variable (TN pollutant concentration).
I have one categorical variables - Climate type, with five types of climate (Temperature-hot, temperature-dry, temperatre warm, tropical and arid), I knew that the climate type has big impact on my independent variable, however, when I used the One hot encoding approach (through pandas get_dummies)，I found these climate types (that become five variables with false/true after one hot encoding) were least important, which is not true.
As shown here, the climate variables have the least feature importance score:
My question is that whether the feature selection of random forest is still useful when dealing with categorical variables? and if I did something wrong about it?
This is part of my code:
model = RandomForestRegressor(n_estimators=100, bootstrap = True,max_features = 'sqrt')
model.fit(x_train, y_train)
fi = pd.DataFrame({'feature': list(x_train),'importance':
model.feature_importances_}).sort_values('importance', ascending = False)
plt.bar(fi['feature'],fi['importance'])

It all depends how the feature-importance is calculated. If the feature-importance of a feature is a function of how many times it is splitting a node, then it is difficult to compare numeric and categorical values, since numerical featues can (and often are) be split multiple times in a tree, where categories only are split once.
I am not completely sure, but I think that the feature importance in sklearn is some function of the amount of splits of a feature, thus the mis-leading importance.

Impute multiple missing values in a feature-vector

Edited post
This is a short and somewhat clarified version of the original post.
We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
The purpose for all this is described below.
A small data sample is down below.
Original post (TL;DR)
Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation
This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
We don't loose data during training.
scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.
Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.
Linear models are not an option for several reasons:
all features are correlated to some extent;
theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.
Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:
I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).
So here is what I would like to hear from you:
Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).
Data
Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.
v1 v2 v3 v4 v5 category
0.40524 0.71542 NA 0.81033 0.8209 1
0.78421 0.76378 0.84324 0.58814 0.9348 2
0.30055 NA 0.84324 NA 0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA 0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA 0.85404 2
0.56404 0.5481 0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA NA 0.81033 0.95988 2
0.5463 0.89865 0.84324 0.81033 NA 2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA 0.49118 0.58814 0.77973 2
0.06389 NA 0.22424 0.08979 0.7556 2

Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.
To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.
The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.
It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html
This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.

Random Forest Classifier Matlab v/s Python

I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress.
What could be the possible reason for this difference between these two programming languages?
The MATLAB code is below:
load 'path\to\feature vector'; % Observations X Features, loaded as segment_features
load 'path\to\targetValues'; % Observations X Target value, loaded as targets
% Set up Division of Data for Training, Validation, Testing
trainRatio = 70/100;
valRatio = 0/100;
testRatio = 30/100;
[trainInd,valInd,testInd] = dividerand(size(segment_features,1),trainRatio,...
valRatio,testRatio);
% Train the Forest
B=TreeBagger(10,segment_features(trainInd,:), target(trainInd),...
'OOBPred','On');
% Test the Network
outputs_test = predict(B,segment_features(testInd, :));
outputs_test = str2num(cell2mat(outputs_test));
targets_test = target(testInd,:);
Accuracy_test=sum(outputs_test==targets_test)/size(testInd,2);
oobErrorBaggedEnsemble = oobError(B);
plot(oobErrorBaggedEnsemble)
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';

The Problem
There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results.
First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often choose features at each split randomly and use bootstrapped samples in the construction of each tree.
Second, different programming languages may have different default values set for the hyperparameters of a random forest (e.g., scikit-learn's random forest classifier uses gini as its default criterion to measure the quality of a split.)
Third, it will depend on the size of your data (which you do not specify in your question). Smaller datasets will yield more variability in the structure of your random forests and, in turn, their output will differ more from one forest to another.
Finally, a decision tree is susceptible to variability in input data (slight data perturbations can yield very different trees). Random forests try to get more stable and accurate solutions by growing many trees, but often 10 (or even 100 or 200) are often not enough trees to get stable output.
Toward a Solution
I can recommend several strategies. First, ensure that the way in which the data are loaded into each respective program are equivalent. Is MATLAB misreading a critical variable in a different way from Python, causing the variable to become non-predictive (e.g., misreading a numeric variable as a string variable?).
Second, once you are confident that your data are loaded identically across your two programs, read the documentation of the random forest functions closely and ensure that you are specifying the same hyperparameters (e.g., criterion) in your two programs. You want to ensure that the random forests in each are being created as similarly as possible.
Third, it will likely be necessary to increase the number of trees to get more stable output from your forests. Ensure that the number of trees in both implementations is the same.
Fourth, a potential difference between programs may come from how the data are split into training vs testing sets. It may be necessary to ensure some method that allows you to replicate the same cross-validation sets across your two programming languages (e.g., if you have a unique ID for each record, assign those with even numbers to training and those with odd numbers to testing).
Finally, you may also benefit from creating multiple forests in each programming language and compare the mean accuracy numbers across iterations. These will give you a better sense of whether differences in accuracy are truly reliable and significant or just a fluke.
Good luck!

Feature Selection and Reduction for Text Classification

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).

This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.

I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.

There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/

Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.

I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.