XGBoost (Python) Prediction for Survival Model

XGBoost (Python) Prediction for Survival Model - python

The docs for Xgboost imply that the output of a model trained using the Cox PH loss will be exponentiation of the individual persons predicted multiplier (against the baseline hazard). Is there no way to extract from this model the baseline hazard in order to predict the entire survival curve per person?
survival:cox: Cox regression for right censored survival time data
(negative values are considered right censored). Note that predictions
are returned on the hazard ratio scale (i.e., as HR =
exp(marginal_prediction) in the proportional hazard function h(t) =
h0(t) * HR)

No, I think not. A workaround would be to fit the baseline hazard in another package e.g. from sksurv.linear_model import CoxPHSurvivalAnalysis or in R by require(survival). Then you can use the predicted output from XGBoost as multiplyers to the fitted baseline. Just remember that if the baseline is on the log scale then use output_margin=True and add the predictions.
I hope the authors of XGBoost soon will provide some examples of how to use this function.

Related

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!

There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.

It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)

I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.

Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?

The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

Predicting from inferred parameters in pymc3

I am trying to understand this from a non-Bayesian background.
In linear regression or blackbox machine learning tools the work flow is something like the following.
Get data
Prepare data
Model data (learn from it [or part of it, the training set])
Test model (usually on the test set)
If model is good according to some metric, goto 6, else
investigate and revise work.
Model is good enough; use it to predict/classify, etc.
So let's say I use pymc3 to try to understand the relationship between advertising expenditure and revenue from goods sold. If all stages from 1 to 5 go well, then in frequentest statistics used in R and machine learning packages such as scikit-learn, I only need to pass new unseen data to the learned model and invoke the predict method. This will usually print out a predicted value of Y (revenue from goods sold), given some unseen value(s) of X (advertising expenditure), with some confidence intervals or some other margin of error still being taken into account.
How would one go about doing that in pymc3? If I end up with many slopes and many betas then which should I use for predicting? And wouldn't taking the mean of all slopes and all betas to use be like throwing away a lot of otherwise useful learned knowledge?
I find if difficult to understand how sampling from the posterior can help in this. One can imagine bosses who need to be told about an expected revenue from goods sold Y figure given some advertising expenditure X amount, with some confidence and error margins. Aside from plotting, I don't know how sampling from posterior can be incorporated into a management report and make it useful for cash flow planning by interested parties.
I know some of us are spoiled coming from R and maybe scikit-learn, but wouldn't it be nice if there was a predict method that dealt with this matter in a more uniform and standardized way?
Thanks

One way of taking into account the uncertainty in parameters when making predictions with a model is to use the posterior predictive distribution. This distribution tells you the probability of a new observation, conditioned on the data that you used to constrain the model parameters. If the revenue is Y, the advertising expenditure is X, the model parameters are theta and the data used to constrain the model are X', then you can write
The left hand side is the probability of attaining revenue Y given an expenditure X, and the data used to constrain the model X'. This is the posterior predictive distribution of your model, and should be used when making predictions. p(Y | X, theta) is the probability of revenue Y given some set of model parameters theta and the expenditure X. p(theta | X') is the posterior distribution on the model parameters given the data that you used to constrain the model.
When using software like pymc3, you obtain samples from p(theta | X'). You can use these to do the integral above in a Monte-Carlo fashion. If you have N samples from the posterior in your MCMC chain, then you can do the sum
in other words, you compute p(Y | X, theta_n) for every set of parameters in your MCMC, and then take the average (note that this isnt the same as `taking the mean of all slopes and all betas' as you mentioned in your question, because you are computing the average of a pdf rather than the parameters themselves). In practice this should be easy to code, you just need to implement the function p(Y | X, theta) and then plug in your posterior parameter samples, then take the mean at the end. This gives you the fairest representation of your model prediction given your MCMC sampling.

Scikit-learn categorisation: binomial log regression?

I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.
How can you perform binomial log regression to get the probability that test data is -100 or +100?
The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?

SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.
See the documentation at:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.
It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.
Edited to add:
My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.
https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features
My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.
Good luck!

If pos/neg, or the probability of pos, is really the only thing you need as output, then you can derive binary labels y as
y = score > 0
assuming you have the scores in a NumPy array score.
You can then feed this to a LogisticRegression instance, using the continuous score to derive relative weights for the samples:
clf = LogisticRegression()
sample_weight = np.abs(score)
sample_weight /= sample_weight.sum()
clf.fit(X, y, sample_weight)
This gives maximum weight to tweets with scores ±100, and a weight of zero to tweets that are labeled neutral, varying linearly between the two.
If the dataset is very large, then as #brentlance showed, you can use SGDClassifier, but you have to give it loss="log" if you want a logistic regression model; otherwise, you'll get a linear SVM.

Regression confidence using SVMs in python

I'm using regression SVMs in python and I am wondering if there is any way to get a "confidence-measure" value for its predictions.
Previously, when using SVMs for binary classification, I was able to compute a confidence-type value from the 'margin'. Here is some pseudo-code showing how I got a confidence value:
# Begin pseudo-code
import svm as svmlib
prob = svmlib.svm_problem(labels, data)
param = svmlib.svm_parameter(svm_type=svmlib.C_SVC, kernel_type = svmlib.RBF)
model = svmlib.svm_model(prob, param)
# get confidence
confidence = self.model.predict_values_raw(sample_to_classify)
I imagine that the further the new sample is from the training data, the worse the confidence, but I'm looking for a function that might help compute a reasonable estimate for this.
My (high-level) problem is as follows:
I have a function F(x), where x is a high-dimensional vector
F(x) can be computed but it is very slow
I want to train a regression SVM to approximate it
If I can find values of 'x' that have low prediction confidence, I can add these points and retrain (aka. active learning)
Has anyone obtained/used regression-SVM confidence/margin values before?

Have a look at this similar response on Stack back in January. The chosen answer was spot on regarding how hard it is to get confidence measures on non-parametric fitting methods. There's probably some Bayesian type thing you could do, but that's probably not possible with the Python SVM library: Prefer one class in libsvm (python).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.