Question regarding DecisionTreeClassifier - python

I am making an explainable model with the past data, and not going to use it for future prediction at all.
In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).
I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable
Here are my questions:
Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)
After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.
If here is not the right place to ask such question, please let me know.
Thanks.

No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.
Getting any number of important features is not an issue since it does depend very solely on your data.
Example:
Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23.
If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits.
So, having less number of important features is normal and does not cause any problems.

No, it isn't. However, I would still split train-test and measure performance separately. While an explainable model is nice, it is significantly less nicer if it's a crap model. I'd make sure it had at least a reasonable performance before considering interpretation, at which point the splitting is unnecessary.
The number of important features is data-dependent. Random forests do a good job providing this as well. In any case, fewer branches is better. You want a simpler tree, which is easier to explain.

Related

Why performs the NN better with OneHotEncoding?

i have a question just for a general case. So i am working with the poker-hand-dataset, which has 10 possible outputs from 0-9, each number gives a poker-hand, for example royal flush.
So i read in the internet, that it is necessary to use OHE in a multiclass problem because if not there would be like a artificial order, for example if you work with cities. But in my case with the poker hands there is a order from one pair over flush and straight to royal flush, right?
Even though my nn performs better with OHE, but it works also (but bad) without.
So why does it work better with the OHE? I did a Dense Network with 2 hidden layer.
Short answer - depending on the use of the feature in the classification and according to the implementation of the classifier you use, you decide if to use OHE or not. If the feature is a category, such that the rank has no meaning (for example, the suit of the card 1=clubs, 2=hearts...) then you should use OHE (for frameworks that require categorical distinction), because ranking it has no meaning. If the feature has a ranking meaning, with regards to the classification, then keep it as-is (for example, the probability of getting a certain winnig hand).
As you did not specify to what task you are using the NN nor the loss function and a lot of other things - I can only assume that when you say "...my nn performs better with OHE" you want to classify a combination to a class of poker hands and in this scenario the data just presents for the learner the classes to distinguish between them (as a category not as a rank). You can add a feature of the probability and/or strength of the hand etc. which will be a ranking feature - as for the resulted classifier, that's a whole other topic if adding it will improve or not (meaning the number of features to classification performance).
Hope I understood you correctly.
Note - this is a big question and there is a lot of hand waving, but this is the scope.

Classification algorithms that work well with high dimensional dataset?

I have a dataset with low data points but very high dimensions/features. I wanted to know if there's any classification algorithm that work well with such dataset without having to perform dimensionality reduction techniques such as PCA, TSNE?
df.shape
(2124, 466029)
This is the classic curse of dimensionality (or p>>n) problem (with p being the number of predictors and n the number of observations).
Many techniques have been developed to try and address this problem.
You can randomly restrict your variables (you select different random subsets) and then asses their importance using cross-validation.
A preferable approach (imho) would be to use ridge-regression, lasso, or elastic net for regularization, however be aware that their oracle properties are rarely satisfied in practice.
Finally, there are algorithms that are able to deal with a very large number of predictors (and tweaks in their implementation that improve the performance when p>>n).
Examples of such models are support vector machine or random forest.
There are many resources on the topic, which are freely available.
You can have a look at these slides from Duke University for example.
Oracle properties (Lasso)
I will not explain in a sound mathematical way but I'll briefly give you some intuition.
Y= dependent variable, your target
X= regressors, your features
ε= your errors
We define a shrinkage procedure oracle if it is asymptotically able to:
Identify the right subset of regressors (i.e. retain only the features that have a true causal relationship with your dependent variable.
Has an optimal estimation rate (I'll leave the details out)
There are three assumptions that, if satisfied, make the lasso oracle.
Beta-min condition: The coefficients of the "true" regressors is above a certain threshold.
Your regressors are uncorrelated with each other.
X and ε are normally distributed and homoskedatisc
In practice you rarely have these assumptions satisfied.
What happens in that case is that your shrinkage will not necessarily retain the right variables.
This implies that you can't make statistically sound inference on the final model (you can't say X_1 explains Y for this and this other reason).
The intuition is simple. If assumption 1 is not satisfied one of the true variables might be incorrectly removed. If assumption 2 is not satisfied then a variable highly correlated with one of the true variables might be incorrectly retained in stead of the right one.
All in all, you shouldn't worry if your aim is forecasting. Your forecast will still be good! The only difference is that mathematically you can't say anymore that you are selecting the correct variables with probability -> 1.
PS: Lasso is a special case of elastic net, I vaguely remember that the oracle property of the elastic net has been proved as well but I might be wrong.
PPS: Corrections are appreciated as I haven't studied these things in a long while and there might be inaccuracies.
You could try a lasso/ridge/elastic net logistic regression.

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Deal with excessive number of zeros

ipdb> np.count_nonzero(test==0) / len(ytrue) * 100
76.44815766923736
I have a datafile counting 24000 prices where I use them for a time series forecasting problem. Instead of trying predicting the price, I tried to predict log-return, i.e. log(P_t/P_P{t-1}). I have applied the log-return over the prices as well as all the features. The prediction are not bad, but the trend tend to predict zero. As you can see above, ~76% of the data are zeros.
Now the idea is probably to "look up for a zero-inflated estimator : first predict whether it's gonna be a zero; if not, predict the value".
In details, what is the perfect way to deal with excessive number of zeros? How zero-inflated estimator can help me with that? Be aware originally I am not probabilist.
P.S. I am working trying to predict the log-return where the units are "seconds" for High-Frequency Trading study. Be aware that it is a regression problem (not a classification problem).
Update
That picture is probably the best prediction I have on the log-return, i.e log(P_t/P_{t-1}). Although it is not bad, the remaining predictions tend to predict zero. As you can see in the above question, there is too many zeros. I have probably the same problem inside the features as I take the log-return on the features as well, i.e. if F is a particular feature, then I apply log(F_t/F_{t-1}).
Here is a one day data, log_return_with_features.pkl, with shape (23369, 30, 161). Sorry, but I cannot tell what are the features. As I apply log(F_t/F_{t-1}) on all the features and on the target (i.e. the price), then be aware I added 1e-8 to all the features before applying the log-return operation to avoid division by 0.
Ok, so judging from your plot: it's the nature of the data, the price doesn't really change that often.
Try subsampling your original data a bit (perhaps by a factor of 5, just look at the data), so that you generally see a price movement with every time-tick. This should make any modeling much MUCH easier.
For the subsampling: I suggest you do simple regular downsampling in time domain. So if you have price data with a second resolution (i.e. one price tag every second), then simply take every fifth datapoint. Then proceed as you usually do, specifically, compute the log-increase in the price from this subsampled data. Remember that whatever you do, it must be reproducible during the test time.
If that is not an option for you for whatever reasons, have a look at something that can handle multiple time scales, e.g. WaveNet or Clockwork RNN.

How to approach feature elimination?

I have been working on a couple of dataset to build predictive models based on them. However I am left a bit bewildered when its coming to elimination of features.
The first one is the Boston Housing dataset and the second is Bigmart Sales dataset. I will focus my question around these two however I would also appreciate relatively generalized answers too.
Boston Housing : I have constructed a correlation coefficient matrix and eliminated the features which has an absolute correlation coefficient of less than 0.50 with respect to the target variable medv. That is leaving me with three features. However, I also do understand that a correlation matrix can be highly deceptive and does not capture non-linear relationships and as a matter of fact features such as crim, indus etc does have non-linear relationship with medv and intuitively it simply does not feel correct to discard them right away.
Bigmart Sales : There are around 30+ features that is created after OneHotEncoding in Python. I have given a go to backward elimination method while I was constructing a linear regression model but I am not exactly sure how to apply backward elimination when I was working on a Decision Tree model for this dataset (not sure if it can actually be applied to Decision Tree at all).
It would be of great help if I can get some idea on how to approach to feature elimination for the above two cases. Let me know if you need more info, I will gladly provide.
It's extremely general question. I don't think that it possible to answer to your question in StackOverFlow format.
For every ML / Statistical model you need different Feature Elimination / Feature Engineering approach:
Linear / Logistic / GLM models require removal of correlated features
For Neural Nets / Boosted trees removal of features will heart performance of the model
Even for one type of models there's no single best way of doing Feature Elimination
If you can add more specific information to your question it'll be possible to discuss it in details.
This is a fun one without any definitive answers (No Free Lunch Theorems) that apply across the board. That said, there are many guidelines which typically have success in real-world problems. Those guidelines will work fine in the specific datasets you explicitly mentioned as well.
As with just about anything else, one must always consider the purpose of feature elimination. Without a goal or set of goals, any answer is valid. With an objective, not only can you hone in on a good answer, but it can open up the door to other ideas you may not have considered. Typically feature elimination is done for one of four reasons:
Increased Accuracy
Increased Generalization
Decreased Bias
Decreased Variance
Decreased Computational Costs
Ease of Explanation
Of course there are other reasons, but these cover the main use cases. With respect to any of those metrics, the obvious (and awful -- never do this) way to choose which ones to keep is to try all combinations in your model and see what happens. In the Boston Housing dataset, this yields 2^13=8192 possible combinations of features to test. The combinatorial growth is exponential, and not only is this approach likely to lead to survivorship bias, it is too expensive for most people and most data.
Barring any sort of a comprehensive examination of all possible options, one must use a heuristic of some kind to attempt to find the same results. I'll mention several:
Train the model n times, each with precisely one feature removed (a different feature each time). If a model has poor performance it indicates that the removed feature is important.
Train the model once with all features, and randomly perturb each input one feature at a time (this can be done stochastically if you don't want to waste time on every input). The features which cause the most classification error when perturbed are the ones which matter the most.
As you said, perform some sort of correlation testing with the target variable to determine feature importance and a cross-correlation to remove duplicated linear information.
These different approaches have different assumptions and goals. Feature removal is important from a computational standpoint (many machine learning algorithms are quadratic or worse in the number of features), and with that perspective the goal is to preserve the behavior of the model as best as possible while removing as much information (i.e., as much complexity) as possible. In the Boston Housing data set, your cross-correlation analysis would probably leave you with Charles River Proximity, Nitrous Oxide Concentration, and Average Room Number as the most relevant variables. Between those three you capture nearly all the accuracy a linear model can obtain on the data.
One thing to point out is that feature removal by definition removes information. This can improve accuracy and generalization for only a few reasons.
By removing redundant information, the model has less bias toward those features and is better able to generalize.
By removing noisy information, the model can focus its efforts on features with high informational content. Note that this affects non-deterministic models like neural networks more than models like linear regressions. Linear regressions always converge to the one unique solution (except in special cases that happen with a true 0% probability where there are multiple solutions).
When you're throwing a lot of features into an algorithm (50k different genes for an organism for example), it makes a lot of sense that some of them won't carry any information. By definition then, any variance they have is noise that the model may inadvertently pick up instead of the signal we want. Feature removal is a common strategy in that domain which improves accuracy dramatically.
Contrast that with the Boston Housing data which has 13 carefully curated features, all of which carry information (based on eyeballing crude scatter plots with respect to the target variable). That particular reasoning isn't likely to affect accuracy much. Moreover, there aren't enough features for there to be very much bias introduced with duplicated information.
On top of that, there are hundreds of data points covering the majority of the input space, so even if we did have bias problems or extraneous features, there is more than enough data that the effects will be negligible. Perhaps enough to make or break the 1st or 2nd place winners in Kaggle, but not enough to make the difference between a good analysis and a great analysis.
Especially if you're using a linear algorithm on top though, having fewer features can greatly aid in the explainability of a model. If you restrict your model to those three variables, it's pretty easy to tell a person that you know houses in the area are expensive because they're all waterfront, they're huge, and they have nice lawns (nitrous oxide indicates fertilizer usage).
Removing features is only a small portion of feature engineering, and another important technique is the addition of features. Adding features usually amounts to low-order polynomial interactions (as an example, the age variable has a fairly weak correlation to the medv variable, but if you square it then the data straightens out a bit and improves the correlation).
Adding features (and removing them) can be aided greatly with a little domain knowledge. I don't know a ton about housing, so I can't add a lot of help here, but in other domains like credit worthiness you can easily imagine combining debt and income features to get a ratio of debt to income as a single feature. Reshaping those features so that they linearly correlate to your output and represent physically meaningful quantities in the domain is a big part of obtaining accuracy and generalizability.
With respect to generalizability and domain knowledge, even with something as simple as a linear model it's important to be able to explain why a feature is important. Just because the data says that nitrous oxide matters in the test set doesn't mean that it will carry any predictive weight in the train set as well. Especially as the number of features grows and the amount of data shrinks, you will expect such correlations to occur purely by accident. Having a physical interpretation (nitrous oxide corresponds to nice lawns) yields confidence that the model isn't learning spurious correlations.

Categories