Residual plot diagnostic and how to improve the regression model - python

When creating regression models for this housing dataset, we can plot the residuals in function of real values.
from sklearn.linear_model import LinearRegression
X = housing[['lotsize']]
y = housing[['price']]
model = LinearRegression()
model.fit(X, y)
plt.scatter(y,model.predict(X)-y)
We can clearly see that the difference (prediction - real value) is mainly positive for lower prices, and the difference is negative for higher prices.
It is true for linear regression, because the model is optimized for RMSE (so the sign of the residual is not taken into account).
But when doing KNN
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors = 3)
We can find a similar plot.
In this case, what interpretation can we give, and how can we improve the model.
EDIT: we can use all the other predictors, the results are similar.
housing = housing.replace(to_replace='yes', value=1, regex=True)
housing = housing.replace(to_replace='no', value=0, regex=True)
X = housing[['lotsize','bedrooms','stories','bathrms','bathrms','driveway','recroom',
'fullbase','gashw','airco','garagepl','prefarea']]
The following graph is for KNN with 3 neighbors. With 3 neighbors, one would expect overfitting, I can't figure out why there is this trend.

If you look at the fit:
plt.scatter(X,y)
plt.plot(X,model.predict(X), '--k')
You get negative values for higher values of y because there is a cluster of data around x=8000 with high y values that deviate a lot from what you expect.
Now if you do a knn, bear in mind your independent variable is only 1 dimensional, meaning, you are defining neighbours based on your lotsize, and you use the mean of the groups as a predictive value. For those high outlier values around x=8000, they will group together with values lower than them, making the difference negative
If you plot this out:
plt.scatter(X,y)
plt.scatter(X,model.predict(X))
How to improve the model? With only one predictor, there's not much you can do, maybe categorize lotsize but I doubt it changes much. Most likely you need other variables to see what is causing that bump around lotsize = 8000, then you can model the dependent variable better.

Related

After normalising my data using DataPreparer while using random forest and SVM, why do my data values become negative?

I am working on predictive modeling where I need to predict whether an online customer ends up purchasing a product on a website or not, and I am using Random Forest Classifier and SVM since it's a classification problem.
After creating the fitting splits for training, testing, and validation sets, I dummify, standardize and normalize my data. However, after I normalize the sets, their values become all negative. Is there a way to change that and why does it happen?
The code that I am using to normalize my fitting sets is as below:
data_preparer = DataPreparer(one_hot_encoder, standard_scaler)
data_preparer.prepare_data(fitting_splits.train_set).head()
data_preparer.prepare_data(fitting_splits.validation_set).head()
I think the documentation from sklearn.preprocessing.StandardScaler can help here:
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples or zero if
with_mean=False, and s is the standard deviation of the training
samples or one if with_std=False.
Based on this equation, if x (the individual value currently being scaled) is less than the mean of the variable, then your scaled value will be negative.

Prediction model to output percentage 'likelihood'?

Suppose I want to predict the percentage likelihood (1-100%) that a 3rd year student graduates college.
I have a training data set with 100 observations, all of which contain examples of students classified to be "Highly likely to Graduate".
I have another data set consisting of say 500 observations (where we don't know if any have graduated).
My question is: How would I go about getting a probability value for all 500 students that describes how likely they are to graduate based on a number of features (anywhere between 1-5 features such as grade scores, living on campus or off campus, etc.) on a model that was trained from the first dataset? What approaches would you suggest?
I would recommend you to use OneClassSVM which is an unsupervised outlier detection. Since your training data contains only samples from one class i.e. "Highly likely to Graduate" training a Logistic Regression or a Neural Network may not work here. It's better to consider that whatever data you have are not outliers, and the other category which is not likely to graduate as outliers. Once you fit an OneClassSVM model you can use the decision_function to get the signed distance to the separating hyperplane, which will be positive for an inlier and negative for an outlier. Then on top of it you can just you a sigmoid function to get the probabilities out. I have shown an example below:
from sklearn.svm import OneClassSVM
X = [[0], [0.44], [0.45], [0.46], [1]]
clf = OneClassSVM(gamma='auto').fit(X)
def sigmoid(x):
return 1/(1+np.exp(-x))
prob = clf.decision_function([[0.455]]) # Not an outlier
sigmoid(prob)
#array([0.50027839])
prob = clf.decision_function([[5]]) # An outlier
sigmoid(prob)
#array([0.11356841])

Underfitting, Overfitting, Good_Generalization

So as a part of my assignment I'm applying linear and lasso regressions, and here's Question 7.
Based on the scores from question 6, what gamma value corresponds to a
model that is underfitting (and has the worst test set accuracy)? What
gamma value corresponds to a model that is overfitting (and has the
worst test set accuracy)? What choice of gamma would be the best
choice for a model with good generalization performance on this
dataset (high accuracy on both training and test set)?
Hint: Try plotting the scores from question 6 to visualize the
relationship between gamma and accuracy. Remember to comment out the
import matplotlib line before submission.
This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.
I really need help, I can't really think of any way to solve this last question. What code should I use to determine (Underfitting, Overfitting, Good_Generalization) and why???
Thanks,
Data set: http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io
Here's my code from question 6:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
def answer_six():
# SVC requires kernel='rbf', C=1, random_state=0 as instructed
# C: Penalty parameter C of the error term
# random_state: The seed of the pseudo random number generator
# used when shuffling the data for probability estimates
# e radial basis function kernel, or RBF kernel, is a popular
# kernel function used in various kernelized learning algorithms,
# In particular, it is commonly used in support vector machine
# classification
model = SVC(kernel='rbf', C=1, random_state=0)
# Return numpy array numbers spaced evenly on a log scale (start,
# stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
gamma = np.logspace(-4,1,6)
# Create a Validation Curve for model and subsets.
# Create parameter name and range regarding gamma. Test Scoring
# requires accuracy.
# Validation curve requires X and y.
train_scores, test_scores = validation_curve(model, X_subset, y_subset, param_name='gamma', param_range=gamma, scoring ='accuracy')
# Determine mean for scores and tests along columns (axis=1)
sc = (train_scores.mean(axis=1), test_scores.mean(axis=1))
return sc
answer_six()
Well, make yourself familiar with overfitting. You are supposed to produce something like this: Article on this topic
On the left you have underfitting, on the right overfitting... Where both errors are low you have good generalisation.
And these things are a function of gamma (the regularizor)
Overfitting = your model false
if model false
scatter it
change linear to poly or suport vector with working kernel...
Underfitting = your dataset false
add new data ideal correleated ...
check by nubers
score / accuracy of test and train if test and train high and no big difference you are doiing good ...
if test low or train low then you facing overfitting / underfitting
hope explained you ...

Regression analysis results not coming in Range of numbers expected

I am trying to conduct Regression analysis on 25-D data.
My data is in a data frame.
My end objective to predict a score value which is a percentage (0,99,70,22 e.t.c)
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ?
Is there any other technique i want to predict values?
So heres the link data type:
https://docs.google.com/spreadsheets/d/1swkRwLXklrWEDV3bKic5uxl_uHLjzU0QDHJ2JLSP8zQ/edit?usp=sharing
Also here's the code:
X= colum[D:AC] of spreadsheet
Y= column['Score'] or column ['Match'] in case of logistic regression
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
X_test_ = poly.fit_transform(X_test)
# Instantiate
lg = LinearRegression()
# Fit
lg.fit(X_, y)
# Obtain coefficients
lg.coef_
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
It is "usually" a good practice. Model converges faster. If you are using sklearn, then the Linear Regression module has a parameter called normalize which when set to True will normalize all variables before fitting the model
2. I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ? Is there any other technique i want to predict values?
Polynomial Regression is designed to give values between -inf and +inf. If you want percentage values, scale these variables through a function like sigmoid. You can also use Logistic Regression and the predict_proba() function will output probabilities between 0 and 1 (although this model works on a different objective entirely).
As #VivekKumar rightly said, we can hardly help you unless we have specific information.

How to find the importance of the features for a logistic regression model?

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Categories