Suppose I want to predict the percentage likelihood (1-100%) that a 3rd year student graduates college.
I have a training data set with 100 observations, all of which contain examples of students classified to be "Highly likely to Graduate".
I have another data set consisting of say 500 observations (where we don't know if any have graduated).
My question is: How would I go about getting a probability value for all 500 students that describes how likely they are to graduate based on a number of features (anywhere between 1-5 features such as grade scores, living on campus or off campus, etc.) on a model that was trained from the first dataset? What approaches would you suggest?
I would recommend you to use OneClassSVM which is an unsupervised outlier detection. Since your training data contains only samples from one class i.e. "Highly likely to Graduate" training a Logistic Regression or a Neural Network may not work here. It's better to consider that whatever data you have are not outliers, and the other category which is not likely to graduate as outliers. Once you fit an OneClassSVM model you can use the decision_function to get the signed distance to the separating hyperplane, which will be positive for an inlier and negative for an outlier. Then on top of it you can just you a sigmoid function to get the probabilities out. I have shown an example below:
from sklearn.svm import OneClassSVM
X = [[0], [0.44], [0.45], [0.46], [1]]
clf = OneClassSVM(gamma='auto').fit(X)
def sigmoid(x):
return 1/(1+np.exp(-x))
prob = clf.decision_function([[0.455]]) # Not an outlier
sigmoid(prob)
#array([0.50027839])
prob = clf.decision_function([[5]]) # An outlier
sigmoid(prob)
#array([0.11356841])
Related
I'm dealing with a multi-class classification, and at the end, for some of the labels, the F1 score and precision and recall are 1 .
Is It normal?
I thought it was odd and searched it out, but the answers were quite different and said it was okay.
As u can see in the pic the accuracy is 88 % and I balanced the data, used PCA, scaled with a min-max scaler, and used Grid searchCV for cross-validation. The data set is real-world data and only has 62 rows, the problem is about predicting depression in the 3rd Trimester of pregnancy using features like (dep and anxiety in the 1st and 2nd Trimester.....)
This means that your model fits the training data perfectly. Is it likely that your data can be predicted to this degree of accuracy?
Are you using a balanced dataset so that there is enough variance and will your model do well in the real world? Your model may be overfitting.
I am working on predictive modeling where I need to predict whether an online customer ends up purchasing a product on a website or not, and I am using Random Forest Classifier and SVM since it's a classification problem.
After creating the fitting splits for training, testing, and validation sets, I dummify, standardize and normalize my data. However, after I normalize the sets, their values become all negative. Is there a way to change that and why does it happen?
The code that I am using to normalize my fitting sets is as below:
data_preparer = DataPreparer(one_hot_encoder, standard_scaler)
data_preparer.prepare_data(fitting_splits.train_set).head()
data_preparer.prepare_data(fitting_splits.validation_set).head()
I think the documentation from sklearn.preprocessing.StandardScaler can help here:
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples or zero if
with_mean=False, and s is the standard deviation of the training
samples or one if with_std=False.
Based on this equation, if x (the individual value currently being scaled) is less than the mean of the variable, then your scaled value will be negative.
When creating regression models for this housing dataset, we can plot the residuals in function of real values.
from sklearn.linear_model import LinearRegression
X = housing[['lotsize']]
y = housing[['price']]
model = LinearRegression()
model.fit(X, y)
plt.scatter(y,model.predict(X)-y)
We can clearly see that the difference (prediction - real value) is mainly positive for lower prices, and the difference is negative for higher prices.
It is true for linear regression, because the model is optimized for RMSE (so the sign of the residual is not taken into account).
But when doing KNN
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors = 3)
We can find a similar plot.
In this case, what interpretation can we give, and how can we improve the model.
EDIT: we can use all the other predictors, the results are similar.
housing = housing.replace(to_replace='yes', value=1, regex=True)
housing = housing.replace(to_replace='no', value=0, regex=True)
X = housing[['lotsize','bedrooms','stories','bathrms','bathrms','driveway','recroom',
'fullbase','gashw','airco','garagepl','prefarea']]
The following graph is for KNN with 3 neighbors. With 3 neighbors, one would expect overfitting, I can't figure out why there is this trend.
If you look at the fit:
plt.scatter(X,y)
plt.plot(X,model.predict(X), '--k')
You get negative values for higher values of y because there is a cluster of data around x=8000 with high y values that deviate a lot from what you expect.
Now if you do a knn, bear in mind your independent variable is only 1 dimensional, meaning, you are defining neighbours based on your lotsize, and you use the mean of the groups as a predictive value. For those high outlier values around x=8000, they will group together with values lower than them, making the difference negative
If you plot this out:
plt.scatter(X,y)
plt.scatter(X,model.predict(X))
How to improve the model? With only one predictor, there's not much you can do, maybe categorize lotsize but I doubt it changes much. Most likely you need other variables to see what is causing that bump around lotsize = 8000, then you can model the dependent variable better.
I am building a churn prediction model with logistic regression in python. My model accuracy is 0.47 and only predicts 0s. The realized y variable is actually 81 zeros and 92 ones.
The data set I have is only a few features and 220 users(records). If I set a reference time, it is even less(about 123 records for the training set and 173 for the testing set). So I think the sample size is too small to use logistic regression. But I still tried because this is just a sample test so I only got this small data set. (Theoretically there is more data)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy: {:.2f}'.format(logreg.score(x_test, y_test)))
Even if I don't test the model, meaning I use the whole data set to build the model, when I predict the future churn it still returns only 0s.
is it that my sample size is too small, or because the accuracy is less than 0.5 so it just returns one value(0 here) ? Or I did something wrong in the code?
Thanks very much!
There are several potential causes for heavily biased prediction from a logistic regression model. For the purpose of informing general audience, I will list the most common ones even though some of them don't apply to your case.
(Skewed output distribution) Your training data has biased, imbalanced label distribution. If your training contains, for example, 1 positive and 100000 negatives, the bias/intercept term in the regression will be very small. After applying the link function the predictions can be practically zero.
(Sparsity) The feature space is large and your dataset is small, leading to a sparse training data. Therefore most new incoming instances of data point aren't seen before. In the worse case, in which all features are factor, unseen factor values result in zeros because the correct one-hot column cannot be identified.
(Skewed input distribution) The feature space is small and your dataset is dense around a small region. If it turns out at that region there are more zeros, the predictions are always gonna be zero even for future instances of input. For example, my data X has two columns, gender and age. It turns out most of my data points are 30 years old male, and 80 out of 100 30-year-old males like ice-cream, in a 101 data-point dataset. The model will predict 30-year-old males like ice-cream for future input, which are usually for 30-year-old males assuming similar input distribution.
You should check the distribution of score using the predict_proba function, and check the distribution of input features using something like pairplot.
I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?
Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).