XGBoost R vs python - different performance and feature importance

XGBoost R vs python - different performance and feature importance - python

I have this problem with xgboost I use at work. My task is to port a piece of code that's currently running in R to python.
What the code does:
My aim is to use XGBoost to determine the features with most gain. I made sure the inputs into the XGBoost are identical in R and python. The XGBoost is run roughly 100 times (on different data) and each time I extract 30 best features by gain.
My problem is this:
The input in R and python are identical. Yet python and R output vastly different features(both in terms of total number of features per round, and which features are chosen). They only share about 50 % of features. My parameters are the same, and I don't use any samples, so there is no randomness.
Also, another thing I noticed - XGBoost is slower in python when compared to R with the same parameters. Is it a known issue?
R parameters
Python parameters
I've been trying to look around, but didn't find anyone having a similar problem. I can't share the data or code, because it's confidential. Does someone have an idea why the features differ so much?
R version: 3.4.3
XGBoost R version: 0.6.4.1
python version: 3.6.5
XGBoost python version: 0.71
Running on Windows.

You set the internal seed in the R code but not the Python code.
More of an issue is likely that Python and R may also use different random number generators so despite always setting internal and external seeds you could get different sequences. This thread may help in that respect.
I would also hazard a guess that the variables not selected in one model provide similar information to those selected in the other, where swapping variables one way or another shouldn't impact model performance significantly. Although I don't know if the R model and the Python one perform the same?

Related

Generate and estimate models like IGARCH, FIGARCH or HYGARCH

my issue is that I'm trying to simulate modifications of GARCH model like IGARCH, FIGARCH or HYGARCH.
I have already found that some of them is possible to generate in R (rugarch or (no more existing) fSeries package) or in Python (arch library). I will organize my questions into the following points:
1. How can I simulate an IGARCH model in Python?
I tried these two ways:
1) used GARCH.simulate with fixed parameters where alfas and betas sum to 1. But there was an error
message about non-stationarity and it took intercept in order to initialize the model. And I'm not
sure if it is OK and the simulated series is still IGARCH (in my opinion it is only moved by
constant so it should have had no crucial effect).
I have also programmed my own function for GARCH simulation, and it works also for coefficients
that sum to 1. Hopefully, the implementation is good...The only restriction for IGARCH that
differentiates it from GARCH is that the sum of coefficients equals 1, right?.
2) used FIGARCH.simulate with fixed parameters where d=1 which is a special case of FIGARCH models
which in that case become IGARCH(p,q) models. But the error there was *invalid value encountered in
sqrt: data[t] = errors[t] * np.sqrt(sigma2[t])*.
2. Is there any free software in which the HYGARCH is implemented? If not, could somebody advise me how
to implement that model in R or Python?
The model is already implemented in Ox Metrics 8, but it is a paid software. I found that the
interface to these Ox Metrics functions were implemented in the R package fSeries which, however
doesn't exist anymore and I'm not able to install the older version on my R version 3.5.1. And in
Python there is no such a model implemented.
3. Corresponds model specification "fgarch" in the function ugarchspec(...) of rugarch package in R to the FIGARCH model? If not, is FIGARCH implemented in the R software?
My R Code is:
spec=ugarchspec(variance.model=list(model="fGARCH", garchOrder=c(1,1), submodel="GARCH"),
mean.model=list(armaOrder=c(0,0), include.mean=TRUE), distribution.model="std",
fixed.pars=list(mu=0.001,omega=0.00001, alpha1=0.05,beta1=0.90,delta = 0.00001,
shape=4))
I found somewhere on the Internet that there is also possible to specify "figarch" in upper function instead of "fgarch", but in that case I'm not able to simulate the path by using ugarchpath(...).
Thank you in advance! I need this for my master thesis so I appreciate any recommendations, advises etc.
Domca

Complete separation of logistic regression data

I've been running some large logistic regression models in SAS, which take 4+ hours to converge. Recently however I acquired access to a Hadoop cluster and can use Python to fit the same models much faster (something more like 10-15 minutes).
Problematically, I have some complete/quasi-complete separation of data points in my data which results in failure to converge; I was using the FIRTH command in SAS to produce robust parameter estimates despite that, but there seems to be no equivalent option for Python, either in sklearn or statsmodels (I'm mostly using the latter).
Is there another way to get around this problem in Python?

AFAIK, there is no Firth penalization available in Python. Statsmodels has an open issue but nobody is working on it at the moment.
As alternative it would be possible to use a different kind of penalization, e.g. as available in sklearn or maybe statsmodels.
The other option is to change the observed response variable. Firth can be implemented by augmenting the dataset. However, I don't know of any recipe or prototype for this in Python.
https://github.com/statsmodels/statsmodels/issues/3561
Statsmodels has ongoing work on penalization but currently the emphasis is on feature/variable selection (elastic net, SCAD) and quadratic penalization for generalized additive models GAM, especially for splines.
Firth uses data dependent penalization which does not fit the generic penalization framework where the penalization structure is a data independent "prior".

Conditional likelihood is another way to work around perfect separation. This is in a Statsmodels PR that is basically ready to use:
https://github.com/statsmodels/statsmodels/pull/5304

bioinformatics sequence clustering in Python

I am trying to find a new method to cluster sequence data. I implemented my method and got an accuracy rate for it. Now I should compare it with available methods to see whether it works as I expected or not.
Is it possible to tell me what are the most famous methods in bioinformatics domain and what are the packages corresponded to those methods in Python? I am an engineer and have no idea about the most accurate methods in this field that I should compare my method to them.

Two common used methods are:
CH-hit, http://weizhongli-lab.org/cd-hit/
Uclust (USEARCH, 32bit version is free) https://drive5.com/usearch/
Both are command line tools and written in C++ (i think)

It also depends on the question for which tool you need(data reduction, otu clustering, making a tree, etc..). These days you see a shift in cluster tools that uses a more dynamic approach instead of a fixed similarity cutoff.
Example:
DADA2
UNOISE
Seekdeep
Fixed clustering:
CD-HIT
uclust
vsearch

Python's XGBRegressor vs R's XGBoost

I'm using python's XGBRegressor and R's xgb.train with the same parameters on the same dataset and I'm getting different predictions.
I know that XGBRegressor uses 'gbtree' and I've made the appropriate comparison in R, however, I'm still getting different results.
Can anyone lead me in the right direction on how to differentiate the 2 and/or find R's equivalence to python's XGBRegressor?
Sorry if this is a stupid question, thank you.

Since XGBoost uses decision trees under the hood it can give you slightly different results between fits if you do not fix random seed so the fitting procedure becomes deterministic.
You can do this via set.seed in R and numpy.random.seed in Python.
Noting Gregor's comment you might want to set nthread parameter to 1 to achieve full determinism.

Why is scikit-learn SVM.SVC() extremely slow?

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?

General remarks about SVM-learning
SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.
This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.
So we can do some math to approximate the time-difference between 1k and 100k samples:
1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!
This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!
Scikit-learn specific remarks
The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.
The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!
Scikit-learn documentation
Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):

If you are using intel CPU then Intel has provided the solution for it.
Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.
You should follow the following steps:
First install intelex package for sklearn
pip install scikit-learn-intelex
Now just add the following line in the top of the program
from sklearnex import patch_sklearn
patch_sklearn()
Now run the program it will be much faster than before.
You can read more about it from the following link:
https://intel.github.io/scikit-learn-intelex/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XGBoost R vs python - different performance and feature importance - python

Related

Generate and estimate models like IGARCH, FIGARCH or HYGARCH

Complete separation of logistic regression data

bioinformatics sequence clustering in Python

Python's XGBRegressor vs R's XGBoost

Why is scikit-learn SVM.SVC() extremely slow?

Categories

Resources