I have such a pipline:
attribute_est = Pipeline([
('jsdf', DictVectorizer()),
('clf', Ridge())
])
In there I pass data like:
{
'Master_card' : 1,
'Credit_Cards': 1,
'casual_ambiance': 0,
'Classy_People': 0
}
My model does not predict that well. Now I got proposed to:
You may find it difficult to find a single regressor that does well
enough. A common solution is to use a linear model to fit the linear
part of some data, and use a non-linear model to fit the residual that
the linear model can't fit. Build a residual estimator that takes as
an argument two other estimators. It should use the first to fit the
raw data and the second to fit the residuals of the first.
What is meant with a Residual estimator? Can you provide me with an example please?
A residual is the error between the true data values, and those predicted by some estimator. The simplest example is in the case of linear regression, where the residuals are the distance between the best linear fit to some data and the actual data points. Least-squares fitting of a line minimizes the sum of these squared residuals.
The recommendation you were given suggests using two estimators. The first will fit the data itself. In the linear regression case, this is a least-squares linear fit, probably using something like scikit-learn's linear regression model.
The second estimator will then try to fit the residuals, i.e., the difference between the linear fit to the data and the actual data points. In the least-squares case, this is effectively detrending the data, and then fitting what is left over. You might pick this to be a Gaussian, in the case where you expect the data actually is a line with additive Gaussian noise. But if you know something about the underlying noise distribution, then use that as your second estimator.
Related
There's no specific code for this.
Right now, I have a logistic regressor with target column is is_promoted (boolean) with 0's and 1's. When I find the train and test accuracy and MSE, they are between 0 and 1.
I have a different model, it's a linear regressor. The target column is 'resale_price` with values 10,000 and up. When I find the train and test accuracy, they are negative, and sometimes go past -1. And their MSEs are also at least 5 digits long.
What I am wondering is,
In my logistic regressor, the values are 1 digit long, whereas my linear regressor has values 5-6 digits long. Do bigger numbers produce bigger MSE?
My linear regressor train and test MSE are like 100,000. Could something be wrong with my data preparation?
MSE is not a suitable metric for logistic regression. In a machine learning context, logistic regression predicts membership of a binary class based on the data input variables. As you state in the question, the predicted class can only be 1 or 0. The formula for MSE is
Clearly, when both the predicted and actual Y values are only either 0 or 1, this formula doesn't make sense.
Metrics that make more sense for logistic regression as a classifier algorithm are classifier-specific metrics, such as accuracy, sensitivity, and specificity (see confusion matrix).
Linear regression is a regression algorithm, and so predicts a continuous outcome. In this situation, MSE is a suitable metric (along with R-squared, RMSE, MAE, and others).
And in answer to your second question, MSE is dependent on scale, so without further context, this question cannot be answered. A scaleless metric for linear regression is R-squared, which assesses the correlation of the predicted values versus the actual values, with R-Squared = 1 being perfect fit.
I am using sklearns permutation_importance to estimate the importance of my independent variables. The model I am fitting is a linear Regression.
The permutation importance the model returns looks like this: [0.7939618 3.6692722 0.02936469].
The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.
In my case, I think that the baseline metric is the R^2 value if I do not permutate any variables (baseline R^2~0.86). How is it possible that I obtain a value of 3.66 for one of my features in this case? If I manually permutate this feature and recalculate R^2, I get a value of ~ 0.18, so the feature importance would be ~0.68 if I am not mistaken.
If anyone could explain to my why I am observing these high feature importance values, I'd be very grateful!
I am trying to conduct Regression analysis on 25-D data.
My data is in a data frame.
My end objective to predict a score value which is a percentage (0,99,70,22 e.t.c)
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ?
Is there any other technique i want to predict values?
So heres the link data type:
https://docs.google.com/spreadsheets/d/1swkRwLXklrWEDV3bKic5uxl_uHLjzU0QDHJ2JLSP8zQ/edit?usp=sharing
Also here's the code:
X= colum[D:AC] of spreadsheet
Y= column['Score'] or column ['Match'] in case of logistic regression
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
X_test_ = poly.fit_transform(X_test)
# Instantiate
lg = LinearRegression()
# Fit
lg.fit(X_, y)
# Obtain coefficients
lg.coef_
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
It is "usually" a good practice. Model converges faster. If you are using sklearn, then the Linear Regression module has a parameter called normalize which when set to True will normalize all variables before fitting the model
2. I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ? Is there any other technique i want to predict values?
Polynomial Regression is designed to give values between -inf and +inf. If you want percentage values, scale these variables through a function like sigmoid. You can also use Logistic Regression and the predict_proba() function will output probabilities between 0 and 1 (although this model works on a different objective entirely).
As #VivekKumar rightly said, we can hardly help you unless we have specific information.
I'm using cross-validation to evaluate the performance of a classifier with scikit-learn and I want to plot the Precision-Recall curve. I found an example on scikit-learn`s website to plot the PR curve but it doesn't use cross validation for the evaluation.
How can I plot the Precision-Recall curve in scikit learn when using cross-validation?
I did the following but i'm not sure if it's the correct way to do it (psudo code):
for each k-fold:
precision, recall, _ = precision_recall_curve(y_test, probs)
mean_precision += precision
mean_recall += recall
mean_precision /= num_folds
mean_recall /= num_folds
plt.plot(recall, precision)
What do you think?
Edit:
it doesn't work because the size of precision and recall arrays are different after each fold.
anyone?
Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. Next, collect all the test (i.e. out-of-bag) predictions and compute precision and recall.
## let test_samples[k] = test samples for the kth fold (list of list)
## let train_samples[k] = test samples for the kth fold (list of list)
for k in range(0, k):
model = train(parameters, train_samples[k])
predictions_fold[k] = predict(model, test_samples[k])
# collect predictions
predictions_combined = [p for preds in predictions_fold for p in preds]
## let predictions = rearranged predictions s.t. they are in the original order
## use predictions and labels to compute lists of TP, FP, FN
## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation
Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. Given n samples, you should have n test predictions.
(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.)
Unless you are using leave-one-out cross-validation, k-fold cross validation generally requires a random partitioning of the data. Ideally, you would do repeated (and stratified) k-fold cross validation. Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006).
I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation.
For a nice plot, I showed a representative PR curve from one of the cross-validation rounds.
There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset.
For instance, if the proportion of (binary) labels in your dataset is not skewed (i.e. it is roughly 50-50), you could use the simpler ROC analysis with cross-validation:
Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration.
This is currently the best way to plot a Precision Recall curve for an sklearn classifier using cross-validation. Best part is, it plots the PR Curves for ALL classes, so you get multiple neat-looking curves as well
from scikitplot.classifiers import plot_precision_recall_curve
import matplotlib.pyplot as plt
clf = LogisticRegression()
plot_precision_recall_curve(clf, X, y)
plt.show()
The function automatically takes care of cross-validating the given dataset, concatenating all out of fold predictions, and calculating the PR Curves for each class + averaged PR Curve. It's a one-line function that takes care of it all for you.
Precision Recall Curves
Disclaimer: Note that this uses the scikit-plot library, which I built.
I am training my dataset using linearsvm in scikit. Can I calculate/get the probability with which a sample is classified under a given label?
For example, using SGDClassifier(loss="log") to fit the data, enables the predict_proba method, which gives a vector of probability estimates P(y|x) per sample x:
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
Output:
array([[ 0.0000005, 0.9999995]])
Is there any similar function which I can use to calculate the prediction probability while using svm.LinearSVC (multi-class classification). I know there is a method decision_function to predict the confidence scores for samples in this case. But, is there any way I can calculate probability estimates for the samples using this decision function?
No, LinearSVC will not compute probabilities because it's not trained to do so. Use sklearn.linear_model.LogisticRegression, which uses the same algorithm as LinearSVC but with the log loss. It uses the standard logistic function for probability estimates:
1. / (1 + exp(-decision_function(X)))
(For the same reason, SGDClassifier will only output probabilities when loss="log", not using its default loss function which causes it to learn a linear SVM.)
Multi class classification is a one-vs-all classification. For a SGDClassifier, as a distance to hyperplane corresponding to to particular class is returned, probability is calculated as
clip(decision_function(X), -1, 1) + 1) / 2
Refer to code for details.
You can implement similar function, it seems being reasonable to me for LinearSVC, althrough that probably needs some justification. Refer to paper mentioned in docs
Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD‘02, http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
P.s. A comment from "Is there 'predict_proba' for LinearSVC?":
if you want probabilities, you should either use Logistic regression or SVC. both can predict probsbilities, but in very diferent ways.