I am asked to implement a Random Forest Classifier, which to my understanding is just a bunch of Decision Trees, on which the test data is ran through after training and the classification is then determined by majority voting of all the trees.
This is all well and good, and I even understand that entropy determines which feature to use next. What I am struggling to understand, is that for numeric data, how do I determine the conditions?
An example, is whether a person will play golf depending on weather conditions. Given 3 features (outlook, humidity, wind), and a classification label (play -> yes or no), we first start with outlook:
Outlook -> Overcast (pure), Sunny, Rain
From Sunny, choose Humidity next: High, Normal (pure)
From Outlook to Rain, choose Wind (last feature): Weak (pure), Strong
Essentially, in this case the values of the features are taken individually. But what happens, when I have a dataset with a bunch of decimals?
(Some of) the data:
In this case I would start by first looking at the label (0 or 1), then progress to the feature with the highest entropy in each. But how do I know the conditions of going to a leaf node? Or even, how many children a parent have?
A poor diagram to aid my question:
For a theoretical answer to your question, I would start by recommending this excellent visual tutorial.
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
In terms of implementation, there are several ways to go into it. You could try the following algorithm (inspired by this answer):
For each column (feature) in your dataset, start by sorting it. At every point where you have a class change, split your dataset. Say, for example, that your data points change from class 0 to 1 when feature A is equal to 5. All data points with A < 5 will belong to class 0, and the ones with A > 5 will belong to class 1. In case your dataset is not as simple, you can then proceed in the way you would proceed with a categorical decision tree, for example, by calculating the entropy at each splitting candidate. You then calculate the data points that arrive at each children node, and proceed recursively.
Related
I am enquiring about assistance regarding regression testing. I have a continuous time series that fluctuates between positive and negative integers. I also have events occurring throughout this time series at seemingly random time points. Essentially, when an event occurs I grab the respective integer. I then want to test whether this integer influences the event at all. As in, are there more positive/negative integers.
I originally thought logistic regression with the positive/negative number but that would require at least two distinct groups. Whereas, I only have info on events that have occurred. I can't really include that amount of events that don't occur as it's somewhat continuous and random. The amount of times an event doesn't occur is impossible to measure.
So my distinct group is all true in a sense as I don't have any results from something that didn't occur. What I am trying to classify is:
When an outcome occurs, does the positive or negative integer influence this outcome.
It sounds like you are interested in determining the underlying forces that are producing a given stream of data. Such mathematical models are called Markov Models. A classic example is the study of text.
For example, if I run a Hidden Markov Model algorithm on a paragraph of English text, then I will find that there are two driving categories that are determining the probabilities of what letters show up in the paragraph. Those categories can be roughly broken into two groups, "aeiouy " and "bcdfghjklmnpqrstvwxz". Neither the mathematics nor the HMM "knew" what to call those categories, but they are what is statistically converged to upon analysis of a paragraph of text. We might call those categories "vowels" and "consonants". So, yes, vowels and consonants are not just 1st grade categories to learn, they follow from how text is written statistically. Interestingly, a "space" behaves more like a vowel than a consonant. I didn't give the probabilities for the example above, but it is interesting to note that "y" ends up with a probability of roughly 0.6 vowel and 0.4 consonant; meaning that "y" is the most consonant behaving vowel statistically.
A great paper is https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf which goes over the basic ideas of this kind of time-series analysis and even provides some sudo-code for reference.
I don't know much about the data that you are dealing with and I don't know if the concepts of "positive" and "negative" are playing a determining factor in the data that you see, but if you ran an HMM on your data and found the two groups to be the collection of positive numbers and collection of negative numbers, then your answer would be confirmed, yes, the most influential two-categories that are driving your data are the concepts of positive and negative. If they don't split evenly, then your answer is that those concepts are not an influential factor in driving the data. Regardless, in the end, the algorithm would end with several probability matricies that would show you how much each integer in your data is being influenced by each category, hence you would have much greater insight in the behavior of your time-series data.
Although, the question is quite difficult to understand after first paragraph. Let me help from what I could understand from this question.
Assuming you want to understand if there is relationship between the events happening and the integers in the data.
1st approach: Plot the data on a 2d scale and check visually if there is a relationship between data.
2nd approach: make the data from the events continuous and remove the events from other data and using rolling window smooth out the data and then compare both trends.
Above given approach only works well if I am understanding your problem correctly.
There is also one more thing known as Survivorship bias. You might be missing data, please also check that part also.
Maybe I am misunderstanding your problem but I don't believe that you can preform any kind of meaningful regression without more information.
Regression is usually used to find a relationship between two or more variables, however It appears that you only have one variable (if they are positive or negative) and one constant (outcome is always true in data). Maybe you could do some statistics on the distribution of the numbers (mean, median, standard deviation) but I am unsure how you might do regression.
https://en.wikipedia.org/wiki/Regression_analysis
You might want to consider that there might be some strong survivorship bias if you are missing a large chunk of your data.
https://en.wikipedia.org/wiki/Survivorship_bias
Hope this is at least a bit helpful to get you steered in the right direction
In KNN (K nearest neighbour) classifiers, if an even value of K is chosen, then what would be the prediction in majority voting rule or in Euclidean distance rule. For example if there are 3 classes say
Iris-setosa
Iris-versicolor
Iris-virginica
And now say we have value of n_neighbors = 6. There is a fair amount of chance to get a tie result for Majority Voting Rule? In most visualisation this region is represented in white saying no decision could be achieved. But What would be the actual prediction for tie. This problem is hard to emulate and fairly conceptual hence may not be emulated so easily.
Also does an odd value of n_neighbors solves/reduces this issue? Do you think instead of using simple majority voting, euclidean/Manhattan distance would better handle this. However sklearn docs does not mention this at all.
After some digging, I have some good answers. First let me tell you that as mentioned by some users like #anasvaf, you should only use odd number for binary classification. This is absolutely untrue. Firstly, when we use majority voting for binary classification, on some tie, it is entirely dependent on the actual library to choose the action. For example, in scikit-learn, it takes the mode of the variable. This means if in the training dataset, number of datapoints for class 1 is more, then 1 would be used on the tie. But there is a better solution.
What we can use weighted-KNN instead of normal KNN. In the weighed KNN, if there is a tie, we can see total distance for 1 labeled datapoints from k points and 0 labelled points. If total distance for 1 is more, then class would be 0 and vice-versa.
There are other good techniques too to handle tie in KNN but to be very honest, KNN is not a good learning algorithm specially due to its space of time complexity on large datasets.
Since you are using majority voting, a selection of odd values for the nearest neighbors solves the issue when for instance two classes labels achieve the same score.
I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.
When training a decision tree, there are 2 criteria to stop growing a tree:
a. Stop when there are no more features left to split a node on.
b. Stop when the node has all samples in it belonging to the same class.
Based on that,
1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total features, and then from these m features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree.
If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier.
The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?
2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!
Q1
You shouldn't remove the features from M. Otherwise it will not be able to detect some types of relationships (ex: linear relationships)
Maybe you can stop earlier, in your condition it might go up to leaves with only 1 sample, this will have no statistical significance. So it's better to stop at say, when the number of samples at leaf is <= 3
Q2
For continuous features maybe you can bin them to groups and use them to figure out a splitting point.
I have a classification problem with time series data.
Each example has 10 variables which are measured at irregular intervals and in the end the object is classified into 1 of the 2 possible classes (binary classification).
I have only the final class of the example to learn from during training. But when given a new example, I would like to make a prediction at each timestamp (in an online manner). So, if the new example had 25 measurements, I would like to make 25 predictions of its class; one at each timestamp.
The way I am implementing this currently is by using the min, mean and max of the measurements of its 10 variables till that point as features for classification. Is this optimal ? What would be a better way.
If you have to make predictions at each time stamp, then this doesn't become a a time series problem (unless you plan to use the sequence of previous observations to make your next prediction, in which case you will need to train a sequence based model). Assuming you can only train a model based on the final data you observe, there can be many approaches, but I'd recommend you use Random Forest with large number of trees and 3 or 4 variables in each tree. That way even if some variables don't give you the desired input other trees can still make predictions to a fair amount of accuracy. Besides this there can be many ensemble approaches.
The way you're currently doing may be a very loose approximation and practical but doesn't make much statistical sense.
I am working on a predictive modeling exercise using a categorical output (pass/fail: binary 1 or 0) and about 200 features. I have about 350K training examples for this, but I can increase the size of my dataset if needed. Here are a few issues that I running into:
1- I am dealing with severely imbalanced classes. Out of those 350K examples, only 2K are labelled as “fail” (i.e. categorical output = 1). How do I account for this? I know there are several techniques, such as up-sampling with bootstrap;
2- Most of my features (~ 95%) are categorical (e.g. city, language, etc.) with less than 5-6 levels each. Do I need to transform them into binary data for each level of a feature? For instance if the feature “city” has 3 levels with New York, Paris, and Barcelona, then I can transform it into 3 binary features: city_New_york, city_Paris, and city_Barcelona;
3 - Picking the model itself: I am thinking about a few such as SVM, K-neighbors, Decision tree, Random Forest, Logistic Regression, but my guess is that Random Forest will be appropriate for this because of the large number of categorical features. Any suggestions there?
4 - If I use Random Forest, do I need to (a) do feature scaling for the continuous variables (I am guessing not), (b) change my continuous variables to binary, as explained in question 2 above (I am guessing not), (c) account for my severe imbalanced classes, (d) remove missing values.
Thanks in advance for your answers!
It helps to train with balanced classes (but don't cross validate with them) RF is surprisingly efficient with data, so you won't need all 350k negative samples to train, likely. Choose an equal number of positive examples by sampling with replacement from that pool. Don't forget to leave some positive examples out for validation though.
If you are in scikit-learn, use pandas' df.get_dummies() to generate the binary encoding. R does the binary encoding for you for variables that are factors. Behind the scenes it makes a bit vector.
I always start with RF because there are so few knobs, it's a good benchmark. After I've straightened out my feature transforms and gotten AUC up, I try the other methods.
a) no b) no c) yes d) Yes it needs to be fixed somehow. If you can get away with removing data where any predictor has missing values, great. However if that's not possible, median is a common choice. Let's say a tree is being built, and variable X4 is chosen to split on. RF needs to choose a point on a line and send all the data to either the left or right. What should it do for data where X4 has no value ? Here is the strategy the 'randomForest' package takes in R:
For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.