Python's XGBRegressor vs R's XGBoost - python

I'm using python's XGBRegressor and R's xgb.train with the same parameters on the same dataset and I'm getting different predictions.
I know that XGBRegressor uses 'gbtree' and I've made the appropriate comparison in R, however, I'm still getting different results.
Can anyone lead me in the right direction on how to differentiate the 2 and/or find R's equivalence to python's XGBRegressor?
Sorry if this is a stupid question, thank you.

Since XGBoost uses decision trees under the hood it can give you slightly different results between fits if you do not fix random seed so the fitting procedure becomes deterministic.
You can do this via set.seed in R and numpy.random.seed in Python.
Noting Gregor's comment you might want to set nthread parameter to 1 to achieve full determinism.

Related

Complete separation of logistic regression data

I've been running some large logistic regression models in SAS, which take 4+ hours to converge. Recently however I acquired access to a Hadoop cluster and can use Python to fit the same models much faster (something more like 10-15 minutes).
Problematically, I have some complete/quasi-complete separation of data points in my data which results in failure to converge; I was using the FIRTH command in SAS to produce robust parameter estimates despite that, but there seems to be no equivalent option for Python, either in sklearn or statsmodels (I'm mostly using the latter).
Is there another way to get around this problem in Python?
AFAIK, there is no Firth penalization available in Python. Statsmodels has an open issue but nobody is working on it at the moment.
As alternative it would be possible to use a different kind of penalization, e.g. as available in sklearn or maybe statsmodels.
The other option is to change the observed response variable. Firth can be implemented by augmenting the dataset. However, I don't know of any recipe or prototype for this in Python.
https://github.com/statsmodels/statsmodels/issues/3561
Statsmodels has ongoing work on penalization but currently the emphasis is on feature/variable selection (elastic net, SCAD) and quadratic penalization for generalized additive models GAM, especially for splines.
Firth uses data dependent penalization which does not fit the generic penalization framework where the penalization structure is a data independent "prior".
Conditional likelihood is another way to work around perfect separation. This is in a Statsmodels PR that is basically ready to use:
https://github.com/statsmodels/statsmodels/pull/5304

R's caret for sklearn user

I've used sklearn for machine learning modelling over the last couple of years and grew accustomed to what seems like a very logical and cohesive framework:
from sklearn.ensemble import RandomForestClassifier
# define a model
clf = RandomForestClassifier()
# fit the model to data
clf.fit(X,y)
#make prediction on a test set
preds = clf.predict_proba(X_test)[:,1]
I'm now trying to learn some R, and want to start doing some of the same things I was doing in sklearn. The first thing that you notice coming from the sklearn world is the diverse syntax across packages. Which is understandable, but kind of inconvenient.
caret seems like a nice solution to that problem, creating cohesion across all the different R packages (i.e. randomForest, gbm,...).
Though I'm still puzzled by some of default choices (i.e. the train() method seems to default to some sort of grid search). Also, caret seems to be using plyr behind the scenes, which messes up some of dplyr methods like summarise. Since I do lots of data manipulation with dplyr that's kind of a problem.
Can you help me figure out what the caret's equivalent of the sklearn's model/fit/predict_proba is? Also, is there a way to deal with the plyr/dplyr issue?
The equivalent of making a prediction in the caret library would be to change the type in ?predict.train. It should be altered to this:
predict(model, data, type="prob")
If you want to mix dplyr/plyr then the easiest way to explicitly call it by using:
dplyr::summarise
or
plyr::summarise
If you had already tried to use predict(..., type="prob") and come up with a weird error which you didn't understand and gave up, I would recommend reading in this thread: Predicting Probabilities for GBM with caret library

How to use statsmodels to fit data

I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.

Method of moments in scipy?

Following from this question, is there a way to use any method other than MLE (maximum-likelihood estimation) for fitting a continuous distribution in scipy? I think that my data may be resulting in the MLE method diverging, so I want to try using the method of moments instead, but I can't find out how to do it in scipy. Specifically, I'm expecting to find something like
scipy.stats.genextreme.fit(data, method=method_of_moments)
Does anyone know if this is possible, and if so how to do it?
Few things to mention:
1) scipy does not have support for GMM. There is some support for GMM via statsmodels (http://statsmodels.sourceforge.net/stable/gmm.html), you can also access many R routines via Rpy2 (and R is bound to have every flavour of GMM ever invented): http://rpy.sourceforge.net/rpy2/doc-2.1/html/index.html
2) Regarding stability of convergence, if this is the issue, then probably your problem is not with the objective being maximised (eg. likelihood, as opposed to a generalised moment) but with the optimiser. Gradient optimisers can be really fussy (or rather, the problems we give them are not really suited for gradient optimisation, leading to poor convergence).
If statsmodels and Rpy do not give you the routine you need, it is perhaps a good idea to write out your moment computation out verbose, and see how you can maximise it yourself - perhaps a custom-made little tool would work well for you?

Multiple Logistic Regression in Python

I have a data set as such:
0,1,0,1,1,0,0,1,5
1,1,0,1,1,0,0,1,3
1,1,0,0,1,0,0,1,1
0,1,1,0,1,1,0,0,4
I'm looking for a way to run logistic regression in python which uses several discrete values (0 or 1) to predict a numerical value (between 1-5). This seems useful but it assumes the predictor variable is also discrete: http://www.mblondel.org/tlml/logreg.py.html#
Any suggestions?
If getting the job done in R (through one of rpy2, pyRserve, or pyper) is an option, you could this to get the job done. If questions about the statistical method to use, this "cross-validated" is a better place to ask.

Categories