I am using sklearns permutation_importance to estimate the importance of my independent variables. The model I am fitting is a linear Regression.
The permutation importance the model returns looks like this: [0.7939618 3.6692722 0.02936469].
The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.
In my case, I think that the baseline metric is the R^2 value if I do not permutate any variables (baseline R^2~0.86). How is it possible that I obtain a value of 3.66 for one of my features in this case? If I manually permutate this feature and recalculate R^2, I get a value of ~ 0.18, so the feature importance would be ~0.68 if I am not mistaken.
If anyone could explain to my why I am observing these high feature importance values, I'd be very grateful!
Related
There's no specific code for this.
Right now, I have a logistic regressor with target column is is_promoted (boolean) with 0's and 1's. When I find the train and test accuracy and MSE, they are between 0 and 1.
I have a different model, it's a linear regressor. The target column is 'resale_price` with values 10,000 and up. When I find the train and test accuracy, they are negative, and sometimes go past -1. And their MSEs are also at least 5 digits long.
What I am wondering is,
In my logistic regressor, the values are 1 digit long, whereas my linear regressor has values 5-6 digits long. Do bigger numbers produce bigger MSE?
My linear regressor train and test MSE are like 100,000. Could something be wrong with my data preparation?
MSE is not a suitable metric for logistic regression. In a machine learning context, logistic regression predicts membership of a binary class based on the data input variables. As you state in the question, the predicted class can only be 1 or 0. The formula for MSE is
Clearly, when both the predicted and actual Y values are only either 0 or 1, this formula doesn't make sense.
Metrics that make more sense for logistic regression as a classifier algorithm are classifier-specific metrics, such as accuracy, sensitivity, and specificity (see confusion matrix).
Linear regression is a regression algorithm, and so predicts a continuous outcome. In this situation, MSE is a suitable metric (along with R-squared, RMSE, MAE, and others).
And in answer to your second question, MSE is dependent on scale, so without further context, this question cannot be answered. A scaleless metric for linear regression is R-squared, which assesses the correlation of the predicted values versus the actual values, with R-Squared = 1 being perfect fit.
i am using python to perform prediction forecast with various models and wanted to measure how well is the prediction. i have use MAE, MSE, RMSE, Chi2, pvalues to compare models, which can show goodness of fit.
however, i am unable to find a method to measure model's degree of over-estimation (e.g. forecasted value has more values under the actual value by X) or under-estimation (e.g. forecast has more values over the actual values by X).
I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?
Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).
I am trying to conduct Regression analysis on 25-D data.
My data is in a data frame.
My end objective to predict a score value which is a percentage (0,99,70,22 e.t.c)
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ?
Is there any other technique i want to predict values?
So heres the link data type:
https://docs.google.com/spreadsheets/d/1swkRwLXklrWEDV3bKic5uxl_uHLjzU0QDHJ2JLSP8zQ/edit?usp=sharing
Also here's the code:
X= colum[D:AC] of spreadsheet
Y= column['Score'] or column ['Match'] in case of logistic regression
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
X_test_ = poly.fit_transform(X_test)
# Instantiate
lg = LinearRegression()
# Fit
lg.fit(X_, y)
# Obtain coefficients
lg.coef_
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
It is "usually" a good practice. Model converges faster. If you are using sklearn, then the Linear Regression module has a parameter called normalize which when set to True will normalize all variables before fitting the model
2. I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ? Is there any other technique i want to predict values?
Polynomial Regression is designed to give values between -inf and +inf. If you want percentage values, scale these variables through a function like sigmoid. You can also use Logistic Regression and the predict_proba() function will output probabilities between 0 and 1 (although this model works on a different objective entirely).
As #VivekKumar rightly said, we can hardly help you unless we have specific information.
I am training my dataset using linearsvm in scikit. Can I calculate/get the probability with which a sample is classified under a given label?
For example, using SGDClassifier(loss="log") to fit the data, enables the predict_proba method, which gives a vector of probability estimates P(y|x) per sample x:
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
Output:
array([[ 0.0000005, 0.9999995]])
Is there any similar function which I can use to calculate the prediction probability while using svm.LinearSVC (multi-class classification). I know there is a method decision_function to predict the confidence scores for samples in this case. But, is there any way I can calculate probability estimates for the samples using this decision function?
No, LinearSVC will not compute probabilities because it's not trained to do so. Use sklearn.linear_model.LogisticRegression, which uses the same algorithm as LinearSVC but with the log loss. It uses the standard logistic function for probability estimates:
1. / (1 + exp(-decision_function(X)))
(For the same reason, SGDClassifier will only output probabilities when loss="log", not using its default loss function which causes it to learn a linear SVM.)
Multi class classification is a one-vs-all classification. For a SGDClassifier, as a distance to hyperplane corresponding to to particular class is returned, probability is calculated as
clip(decision_function(X), -1, 1) + 1) / 2
Refer to code for details.
You can implement similar function, it seems being reasonable to me for LinearSVC, althrough that probably needs some justification. Refer to paper mentioned in docs
Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD‘02, http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
P.s. A comment from "Is there 'predict_proba' for LinearSVC?":
if you want probabilities, you should either use Logistic regression or SVC. both can predict probsbilities, but in very diferent ways.