Regression confidence using SVMs in python - python

I'm using regression SVMs in python and I am wondering if there is any way to get a "confidence-measure" value for its predictions.
Previously, when using SVMs for binary classification, I was able to compute a confidence-type value from the 'margin'. Here is some pseudo-code showing how I got a confidence value:
# Begin pseudo-code
import svm as svmlib
prob = svmlib.svm_problem(labels, data)
param = svmlib.svm_parameter(svm_type=svmlib.C_SVC, kernel_type = svmlib.RBF)
model = svmlib.svm_model(prob, param)
# get confidence
confidence = self.model.predict_values_raw(sample_to_classify)
I imagine that the further the new sample is from the training data, the worse the confidence, but I'm looking for a function that might help compute a reasonable estimate for this.
My (high-level) problem is as follows:
I have a function F(x), where x is a high-dimensional vector
F(x) can be computed but it is very slow
I want to train a regression SVM to approximate it
If I can find values of 'x' that have low prediction confidence, I can add these points and retrain (aka. active learning)
Has anyone obtained/used regression-SVM confidence/margin values before?

Have a look at this similar response on Stack back in January. The chosen answer was spot on regarding how hard it is to get confidence measures on non-parametric fitting methods. There's probably some Bayesian type thing you could do, but that's probably not possible with the Python SVM library: Prefer one class in libsvm (python).

Related

how to interpret learning curve in machine learning?

those're the learning curves for each algorithm I used. i'm working on my report and i'm confused how to interpret the curve.
I used multi label classification algorithms.
this is the learning curve of binary relevance the classifier is KNeighborsClassifier.
the second one is the curve of classifier chain using DecisionTreeClassifier
and the last one is the curve of LabelPowerset using GaussianNB
which one is the best? because the accuracy and the F1 score are good results
Learning curves are a tool to finding which models benefit from increasing training data. In other words, they indicate whether a model, with an increased dataset, will give better results.
The best curve in my opinion is the one that gives the best normalized score in a minimum training example. It also has to converge fast enough to a good score.

SHAP's base value and predicted value using isolation forest are big

I used isolation forest model to do outlier detection and I also tried to build shap_force plot to see features.
The isolation forest model I build is:
model = IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.2), max_features=1.0,random_state= 0)
model.fit(df)
pred = model.predict(df)
df['anomaly_label']=pred
And i tried to get the shap values:
def shap_plot(j):
explainerModel = shap.TreeExplainer(model)
shap_values_Model = explainerModel.shap_values(df,check_additivity=False)
p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], S.iloc[[j]])
return(p)
And some examples I got are like:
shap value
shap value
The base value and predicted values are big which exceed the range. I wonder why does this happen? Are there any ways to solve this problem?
shap is explaining the anomaly score of the samples rather than the hard predictions. (This is quite common; e.g. in probabilistic classifiers, often the explanations take place in the log-odds space by default, rather than in probability space or hard classifications.) The score_samples and decision_function methods of IsolationForest are more relevant here than predict. That said, it looks like the explanation is on another transformation of those scores; see this comment in the PR that added isolation forests to the TreeExplainer.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

What is the best way to minimize the RMSE?

I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.
Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.
RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps

How to know the factor by which a feature affects a model's prediction

I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.

Categories