Cox PH on Lifelines shows convergence problem

Cox PH on Lifelines shows convergence problem - python

I'm running a Cox PH model using lifelines package on Python.
I find it strange that if I run the model on the whole data there is no problem running it, however when I do a cross-validation (using the package's own validation function) a convergence error appears.
Any idea how I can solve this? The documentation suggested using a penalizer but I haven't found a value that lets me run the thing.
Here's my code if you're wondering:
# Gone right
cph = CoxPHFitter()
cph.fit(daten, "length_of_arrears2", event_col='cured2')
# Gone wrong
cph = CoxPHFitter(penalizer=10)
scores = k_fold_cross_validation(cph, daten, 'length_of_arrears2', event_col='cured2', k=5)
This is the error it outputs:
ConvergenceError: Convergence halted due to matrix inversion problems. Suspicion is high collinearity. Please see the following tips in the lifelines documentation: https://lifelines.readthedocs.io/en/latest/Examples.html#problems-with-convergence-in-the-cox-proportional-hazard-modelMatrix is singular.
I checked the correlation table and some variables are quite correlated but it's still a bit weird to me that it works on the full thing but not on the cross val.
Is there a good way to get rid of high correlation without removing a variable completely?
Edit:
I did a few more tests on it. First I removed all variables with more than 0.74 correlation, that did not work on the KFold approach.
Then, I manually split the data in 90/10, it worked, so I kept trying until 70/30, because 60/40 didn't work already.
Any idea?

Related

Calculating goodness of fit and rmsea from factor_analyser in python

I am performing Confirmatory factor analysis in python using the factor_analyzer module.
I have searched hi and low for a way to generate the model diagnostics such as the Root Mean Square Error of Approximation, the chi square, the CFI and Tucker-Lewis index. I'm not particularly mathematically inclined and relatively new to python but I have been able to muddle through for the most part.
I understand that the factor_analyzer module produces a lot of different objects that allow would, in-theory, allow me to carry out additional calculations and I have found this document which provides me with most of the formula I need. However, I do not know what to take (or calculate) to get the model diagnostics I need.
The CFA code is
model_dict = {"F1": factor_1,
"F2": factor_2}# I have made these lists previously
model_spec = ModelSpecificationParser.parse_model_specification_from_dict(df[influence_scale],
model_dict)
cfa = ConfirmatoryFactorAnalyzer(model_spec, disp=False)
cfa.fit(df[influence_scale].values)
cfa_loadings = pd.DataFrame(cfa.loadings_)
I have gotten no errors and the code works fine giving me clean loadings as I would have expected on each factor, however I'm really stuck on getting the additional stats I need.
If anyone can help me out I'd really really appreciate it.

Implementation of BEGAN (Boundary Equilibrium GAN) Using CNTK Python API

I found an implementation for BEGAN using CNTK.
(https://github.com/2wins/BEGAN-cntk)
This uses MNIST dataset instead of Celeb A which was used in the original paper.
However, I don't understand the result images, which looks quite deterministic:
Output images of the trained generator (iter: 30000)
For different noise samples, I expect different outputs come from it. But it doesn't do that regardless of any hyper-parameters. Which part of the code does make the problem?
Please explain it.

Use higher gamma (for example gamma=1 or 1.3, more than 1 actually). Then it will improve certainly but would not make it perfect. Take enough iterations like 200k.
Please look at the paper carefully. It says the parameter gamma controls diversity.
One of the results that I obtained is .
I'm also looking for the best parameters and best results, but haven't yet.

Looks like your model might be getting stuck in a particular mode. One idea would be to add an additional condition on the class labels. Conditional GANs have been proposed to overcome such limitations.
http://www.foldl.me/uploads/2015/conditional-gans-face-generation/paper.pdf
This is an idea that would be worth exploring.

online linear regression with forgetting

I need a way to run a linear regression during a simulation in python. New X and y values come in, should be fitted and new coefficient estimates should be made. However, older values should get a lower weight.
Is there a package that can do this?

Short answer here, perhaps more an idea than a solution.
Have you tried scipy.optimize.curve_fit ?
It would do the fitting, but you would still have to code yourself the lower-weightening of the old values before passing it through the absolute_sigma parameter.

R's caret for sklearn user

I've used sklearn for machine learning modelling over the last couple of years and grew accustomed to what seems like a very logical and cohesive framework:
from sklearn.ensemble import RandomForestClassifier
# define a model
clf = RandomForestClassifier()
# fit the model to data
clf.fit(X,y)
#make prediction on a test set
preds = clf.predict_proba(X_test)[:,1]
I'm now trying to learn some R, and want to start doing some of the same things I was doing in sklearn. The first thing that you notice coming from the sklearn world is the diverse syntax across packages. Which is understandable, but kind of inconvenient.
caret seems like a nice solution to that problem, creating cohesion across all the different R packages (i.e. randomForest, gbm,...).
Though I'm still puzzled by some of default choices (i.e. the train() method seems to default to some sort of grid search). Also, caret seems to be using plyr behind the scenes, which messes up some of dplyr methods like summarise. Since I do lots of data manipulation with dplyr that's kind of a problem.
Can you help me figure out what the caret's equivalent of the sklearn's model/fit/predict_proba is? Also, is there a way to deal with the plyr/dplyr issue?

The equivalent of making a prediction in the caret library would be to change the type in ?predict.train. It should be altered to this:
predict(model, data, type="prob")
If you want to mix dplyr/plyr then the easiest way to explicitly call it by using:
dplyr::summarise
or
plyr::summarise
If you had already tried to use predict(..., type="prob") and come up with a weird error which you didn't understand and gave up, I would recommend reading in this thread: Predicting Probabilities for GBM with caret library

How can I check if a glm / lm of statsmodels is converged or not?

Where can I check if the regression fit is converged? After I set .fit(maxiter=7) I would expect it doesn't converge. But it doesn't trigger any warnings. So I am wondering in generally how I can check if a model fit is convergent or not?
This is the source code: http://statsmodels.sourceforge.net/devel/_modules/statsmodels/genmod/generalized_linear_model.html#GLM.fit
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/generalized_linear_model.py
Does it mean it never produces any warning message if the fit doesn't even converge?
I raised an issue in github.
https://github.com/statsmodels/statsmodels/issues/1844
If it is really because of the source code, I will close the question.

All of the maximum likelihood models have a converged flag in the mle_retvals. GLM doesn't have this yet.
The easiest way is to check is to do exactly what is done in the source there. Import _check_convergence, the convergence criterion is already attached, so is the iteration, and you know the tol. If you file an issue on github (there might be one already), it'll get added. Of course, this would be trivial to add, so PRs welcome.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.