Complete separation of logistic regression data - python

I've been running some large logistic regression models in SAS, which take 4+ hours to converge. Recently however I acquired access to a Hadoop cluster and can use Python to fit the same models much faster (something more like 10-15 minutes).
Problematically, I have some complete/quasi-complete separation of data points in my data which results in failure to converge; I was using the FIRTH command in SAS to produce robust parameter estimates despite that, but there seems to be no equivalent option for Python, either in sklearn or statsmodels (I'm mostly using the latter).
Is there another way to get around this problem in Python?

AFAIK, there is no Firth penalization available in Python. Statsmodels has an open issue but nobody is working on it at the moment.
As alternative it would be possible to use a different kind of penalization, e.g. as available in sklearn or maybe statsmodels.
The other option is to change the observed response variable. Firth can be implemented by augmenting the dataset. However, I don't know of any recipe or prototype for this in Python.
https://github.com/statsmodels/statsmodels/issues/3561
Statsmodels has ongoing work on penalization but currently the emphasis is on feature/variable selection (elastic net, SCAD) and quadratic penalization for generalized additive models GAM, especially for splines.
Firth uses data dependent penalization which does not fit the generic penalization framework where the penalization structure is a data independent "prior".

Conditional likelihood is another way to work around perfect separation. This is in a Statsmodels PR that is basically ready to use:
https://github.com/statsmodels/statsmodels/pull/5304

Related

How does Mutual Information from the Sklearn mutual_info_regression work?

Can someone "explain" ( and possibly where can I access) the code and the logic behind the Sklearn mutual_info_regression? For the classification, I think it works based on the KNN, but for the regression problems, I am having problem understanding how it would work.
Mutual_info_regression is used for feature selection. It works by measuring the mutual information (https://en.wikipedia.org/wiki/Mutual_information) between two random variables or a set of feature vectors (the output in simple terms) and the target (y). Mutual information is a way of measuring one variable's dependence on another by measuring statistical dependency (https://en.wikipedia.org/wiki/Independence_(probability_theory)). It is equal to zero if the two variables and completely independent from each and gets higher as they are more dependent. This can be useful for feature selection. As for source code Sklearn is an open-source library and all of it's code can be found at https://github.com/scikit-learn/scikit-learn
Hope this helped!

Algorithms to model non-linear relationship between two vectors

I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.

How to store R model coefficient for use in python-based production system?

I have a situation where I need to train a model in R and then use the model coefficients obtained from that (betas) in order to perform a regression classification in semi-live data. This production system is implemented in pure python (data processing) and django (web interface).
The model coefficients will be calculated every week manually and right now produces a csv, that is read by the python code. I just wanted to know if there are any better ways of doing this?
This is mostly a question on what are the established best practices for cases like this, even though the current approach works.

Why is scikit-learn SVM.SVC() extremely slow?

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?
General remarks about SVM-learning
SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.
This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.
So we can do some math to approximate the time-difference between 1k and 100k samples:
1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!
This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!
Scikit-learn specific remarks
The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.
The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!
Scikit-learn documentation
Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):
If you are using intel CPU then Intel has provided the solution for it.
Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.
You should follow the following steps:
First install intelex package for sklearn
pip install scikit-learn-intelex
Now just add the following line in the top of the program
from sklearnex import patch_sklearn
patch_sklearn()
Now run the program it will be much faster than before.
You can read more about it from the following link:
https://intel.github.io/scikit-learn-intelex/

How to use statsmodels to fit data

I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.

Categories