Get the data entries that have influenced the predicted decision tree value - python

After a regression prediction through a decsion tree, I am interested in finding out the dataframe rows from which the predicted value was derived. As far as I understand this can be one or more rows.
I didn't try anthing yet. The only thing I can imagine is that, if the predicted value was taken from exactly one data entry, it is possible to search for predicted value in the training dataset.

Related

Filter training dataframe using k-means (or any other technique)

I have a training dataframe with input and output variables and a validation dataframe with just input variables. I need to train a model with the training dataframe and generate results in the validation dataframe, but the training dataframe is very large and may contain data that is unrelated to my validation dataframe. They are not necessarily outliers, but data that was acquired that may not be representative for my validation dataframe at that time, but that may be related to another validation dataframe that I may acquire in the future. I would like to filter the training dataframe before training a model on it.
I thought of doing the following: fit a k-means algorithm into the validation dataframe and filter my training dataframe by the proximity the instances have to the k-means-adjusted cluster centroids in the validation dataframe. I would choose a maximum distance and throw away instances with greater distance, in hopes that this would remove points that would have nothing to do with my validation dataframe.
I ran some tests and got this result:
Before filtering (training - blue; validation - orange):
After filtering (training - blue; validation - orange):
However, I can't remove all the unrelated points and I'm throwing away points that are close to the centroids. The way I calculated the distance between the centroids and the instances was based on this post:
Distance between nodes and the centroid in a kmeans cluster?
Where, after calculating the distances (sqdist), I remove the points with distances greater than a given value.
My question is: would this be the best way to do this? Is there another way to select these instances? As a final result, I would expect a blue dot cloud that would be the same shape as the orange dot cloud, but a little wider.
Note: I cannot provide my data, examples will need to be made with synthetic data.

SHAP for a single data point, instead of average prediction of entire dataset

I am trying to explain a regression model based on LightGBM using SHAP. I'm using the
shap.TreeExplainer(<lightgbm model>).shap_values(X)
method to get the SHAP values, where X is the entire training dataset. These SHAP values give me comparison of an individual prediction, compared to the average prediction of the entire dataset.
In the online book by Christopher Molnar, section 5.9.4, he mentions that:
"Instead of comparing a prediction to the average prediction of the entire dataset, you could compare it to a subset or even to a single data point."
I have a couple of questions regarding this:
Am I correct to interpret that if, instead of passing the entire training dataset, I pass a subset of say 20 observations, then the SHAP values returned will be relative to the average of these 20 observations? This will be the equivalent of "subset" that Christopher Molnar mentioned in his book
Assuming that the answer to question 1 is yes, what if, instead of generating SHAP values relative to the average of 20 observations, I want to generate SHAP values relative to one specific observation. Christopher Molnar seems to imply that is possible. If it is possible, how do I do that?
Thank you in advance for the guidance!
Yes, but definition of "average" is important. If you supply a "background" dataset, your explanations will be calculated against this background, not against the whole dataset. As far as "relative to the average" of the background, one needs to understand shap values are average marginal contributions over all possible coalitions. So as far as SHAP values are concerned, you fix coalition(s), and the rest is yes, averaged. This allows fitting model once, and then passing different coalitions (with the rest averaged) through the model that was trained only once. This is where SHAP time savings come from.
If you're interested in more you may visit original paper or this blog.
Yes. You supply a single data row as background, for a binary classification e.g., supply another class' data row for explanation, and see which feature, and by how much, changed class output.
Yes. By the mathematical formulation in the original paper, SHAP values are "the contribution of a feature to the difference between the actual prediction and the average prediction". The average prediction, sometimes called the "base value" or "expected model output", is relative to the background dataset you provided.
Yes. You can use a background dataset of 1 sample. The common choices of the background dataset is the training data, one single sample as the reference sample, or even a dataset of all zeros. From the author: “I recommend using either a single background data point, a small random subset of the true background, or for the best performance a set of k-medians (weighted by how many training points they each represent) designed to represent the background succinctly. “
Below are more details to support my answers to the two questions and how 2 can be done. So, why does the "expected model output" depend on the background dataset? To answer this questions, let's walk through how SHAP is done:
Step 1: We create a shap explainer providing two things: a trained prediction model and a background dataset. From the background dataset, SHAP creates an artificial dataset of coalitions. Each coalition is a binary vector representing the permutation of feature combinations, 1 represents a feature being present, and 0 absent. So there are 2^M possible coalitions for M features.
explainer = shap.KernelExplainer(f, background_X)
Step 2: We provide the sample(s) for which we want to compute SHAP values for. SHAP fills in values for this artificial dataset such that present features take original values of that sample, and absent features are filled with a value from the background dataset. Then the prediction is generated for this coalition. If the background dataset has n rows, the absent features are filled n times and the average of the n predictions is used as the prediction of this coalition. If the background dataset has a single sample, then the absent feature is filled with the values of that sample.
shap_values = explainer.shap_values(test_X)
Therefore, the SHAP values are relative to the average prediction of the background dataset.

Classification of accelerometer data

I am trying to classify accelerometer data into 4 classes- 1,2,3,4. The training dataset looks like the following-
The training labels are contained in another file and contain labels for only the 10th observation. This is what it looks like-
Now I am not sure how to interpret this. Should I only use the training_labels dataset to train a model? In that case, I don't know why the first dataset is given. Also, using only the second set would lead to a loss of information. I thought of doing a left-outer join on the first dataset with the second and using 'bfill' in df.fillna() to get rid of the Nan values and then use that data to train but I am confused as to whether this is the right approach. I am still a beginner at Machine Learning so any help is appreciated.
EDIT: The data comes from an online course I am doing. It says that- "Because the accelerometers are sampled at high frequency, the labels in train_labels are only provided for every 10th observation
If you can afford to discard 90% of your data you can just use only the observations with labels, you can also take the mean / median x,y,z coordinate of 10 observations with the provided label or use the same label for the for the last 10 observations. Those approaches seem legit to me.
Probably the sampling frequency was unnecessary high and therefore you can assume labels do not change that quickly. But this can also depend on the problem at hand.

Correlations of feature columns in TensorFlow

I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.

Formatting and combining word frequency with other data machine learning python

I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)
Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.
Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?

Categories