I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.
Related
I am trying to explain a regression model based on LightGBM using SHAP. I'm using the
shap.TreeExplainer(<lightgbm model>).shap_values(X)
method to get the SHAP values, where X is the entire training dataset. These SHAP values give me comparison of an individual prediction, compared to the average prediction of the entire dataset.
In the online book by Christopher Molnar, section 5.9.4, he mentions that:
"Instead of comparing a prediction to the average prediction of the entire dataset, you could compare it to a subset or even to a single data point."
I have a couple of questions regarding this:
Am I correct to interpret that if, instead of passing the entire training dataset, I pass a subset of say 20 observations, then the SHAP values returned will be relative to the average of these 20 observations? This will be the equivalent of "subset" that Christopher Molnar mentioned in his book
Assuming that the answer to question 1 is yes, what if, instead of generating SHAP values relative to the average of 20 observations, I want to generate SHAP values relative to one specific observation. Christopher Molnar seems to imply that is possible. If it is possible, how do I do that?
Thank you in advance for the guidance!
Yes, but definition of "average" is important. If you supply a "background" dataset, your explanations will be calculated against this background, not against the whole dataset. As far as "relative to the average" of the background, one needs to understand shap values are average marginal contributions over all possible coalitions. So as far as SHAP values are concerned, you fix coalition(s), and the rest is yes, averaged. This allows fitting model once, and then passing different coalitions (with the rest averaged) through the model that was trained only once. This is where SHAP time savings come from.
If you're interested in more you may visit original paper or this blog.
Yes. You supply a single data row as background, for a binary classification e.g., supply another class' data row for explanation, and see which feature, and by how much, changed class output.
Yes. By the mathematical formulation in the original paper, SHAP values are "the contribution of a feature to the difference between the actual prediction and the average prediction". The average prediction, sometimes called the "base value" or "expected model output", is relative to the background dataset you provided.
Yes. You can use a background dataset of 1 sample. The common choices of the background dataset is the training data, one single sample as the reference sample, or even a dataset of all zeros. From the author: “I recommend using either a single background data point, a small random subset of the true background, or for the best performance a set of k-medians (weighted by how many training points they each represent) designed to represent the background succinctly. “
Below are more details to support my answers to the two questions and how 2 can be done. So, why does the "expected model output" depend on the background dataset? To answer this questions, let's walk through how SHAP is done:
Step 1: We create a shap explainer providing two things: a trained prediction model and a background dataset. From the background dataset, SHAP creates an artificial dataset of coalitions. Each coalition is a binary vector representing the permutation of feature combinations, 1 represents a feature being present, and 0 absent. So there are 2^M possible coalitions for M features.
explainer = shap.KernelExplainer(f, background_X)
Step 2: We provide the sample(s) for which we want to compute SHAP values for. SHAP fills in values for this artificial dataset such that present features take original values of that sample, and absent features are filled with a value from the background dataset. Then the prediction is generated for this coalition. If the background dataset has n rows, the absent features are filled n times and the average of the n predictions is used as the prediction of this coalition. If the background dataset has a single sample, then the absent feature is filled with the values of that sample.
shap_values = explainer.shap_values(test_X)
Therefore, the SHAP values are relative to the average prediction of the background dataset.
I am working with features extracted from pre-trained VGG16 and VGG19 models. The features have been extracted from second fully connected layer (FC2) of the above networks.
The resulting feature matrix (of dimensions (8000,4096)) has values in the range [0,45]. As a result, when I am using this feature matrix in gradient based optimization algorithms, the value for loss function, gradient, norms etc. take very high values.
In order to do away with such high values, I applied MinMax normalization to this feature matrix and since then the values are manageable. Also, the optimization algorithm is behaving properly. Is my strategy OK i.e. is it fair enough to normalize features that have been extracted from a pre-trained models for further processing.
From experience, as long as you are aware of the fact that your results are coming from normalized values, it is okay. If normalization helps you show gradients, norms, etc. better then I am for it.
What I would be cautious about though, would be any further analysis on those feature matrices as they are normalized and not the true values. Say, if you were to study the distributions and such, you should be fine, but I am not sure what is your next step, and if this can/will be harmful.
Can you share more details around "further analysis"?
I am new to the concept of scaling a feature in Machine Learning, I read that scaling will be useful when one feature range is very high when compared to other features. But if I choose to scale the training data then:
Can I just scale that one feature that has high range?
If I scale the entire X of train data then do I need to also scale the y of train data and entire test data?
Yes, you can scale that one feature that has high range, but do ensure that there is no other feature that has a high range, because if it exist and has not been scaled then that feature will make the algorithm overlook the contributions of the scaled features and effect the result(output value) with even a slight change in it. It is recommended( but not compulsory) to scale all the features in the training set.
You do not need to scale the Y of train data as the algorithm or model will set the parameter values to get least Cost(error), that is k{Y(output)-Y(original)} anyway. But if the Xtrain was scaled then the test set(feature values, Xtest)(Scale Ytest only if the Ytrain was scaled) needs to be scaled(using training mean and variance) before feeding it to the model because the model hasn't seen this data before and has been trained on data with scaled range, so if the test data has a feature value diverging from the corresponding feature range in train data by a considerably high value then the model will output a wrong prediction for the corresponding test data.
Yes, you can scale a single feature. You can interpret scaling as a means of giving the same importance to each feature. For instance, imagine you have data about people and you describe your examples via two features: height and weight. If you measure height in meters and weight in kilograms, a k-Nearest Neighbours classifier when computing the distance between two examples is likely to make its decisions solely based on the weight. In that case, you can scale one of the features to the same range of the other. Commonly, we scale all the features to the same range (e.g. 0 - 1). In addition, remember that all the values you use to scale your training data must be used to scale the test data.
As for the dependent variable y you do not need to scale it.
I am making a machine learning program just with a simple linear regression. In the dataset, I have two types of features:
Low dimension features: e.g. word count, length, etc.
High dimension features: TFIDF features
I just to want to manually assign weights on different set of features.
I have been trying with pipleline featureunion (transformer_weights), but I find it is difficult to do so if I want to have a TFIDF feature built from a vocabulary created by both train and test data.
Could you please advice if there are any solution other than using pipeline?
Thanks a lot!
I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.