Multiple Logistic Regression in Python - python

I have a data set as such:
0,1,0,1,1,0,0,1,5
1,1,0,1,1,0,0,1,3
1,1,0,0,1,0,0,1,1
0,1,1,0,1,1,0,0,4
I'm looking for a way to run logistic regression in python which uses several discrete values (0 or 1) to predict a numerical value (between 1-5). This seems useful but it assumes the predictor variable is also discrete: http://www.mblondel.org/tlml/logreg.py.html#
Any suggestions?

If getting the job done in R (through one of rpy2, pyRserve, or pyper) is an option, you could this to get the job done. If questions about the statistical method to use, this "cross-validated" is a better place to ask.

Related

<lifelines> Solving Cox Proportional Hazard after creating interaction variable with time

I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.

Complete separation of logistic regression data

I've been running some large logistic regression models in SAS, which take 4+ hours to converge. Recently however I acquired access to a Hadoop cluster and can use Python to fit the same models much faster (something more like 10-15 minutes).
Problematically, I have some complete/quasi-complete separation of data points in my data which results in failure to converge; I was using the FIRTH command in SAS to produce robust parameter estimates despite that, but there seems to be no equivalent option for Python, either in sklearn or statsmodels (I'm mostly using the latter).
Is there another way to get around this problem in Python?
AFAIK, there is no Firth penalization available in Python. Statsmodels has an open issue but nobody is working on it at the moment.
As alternative it would be possible to use a different kind of penalization, e.g. as available in sklearn or maybe statsmodels.
The other option is to change the observed response variable. Firth can be implemented by augmenting the dataset. However, I don't know of any recipe or prototype for this in Python.
https://github.com/statsmodels/statsmodels/issues/3561
Statsmodels has ongoing work on penalization but currently the emphasis is on feature/variable selection (elastic net, SCAD) and quadratic penalization for generalized additive models GAM, especially for splines.
Firth uses data dependent penalization which does not fit the generic penalization framework where the penalization structure is a data independent "prior".
Conditional likelihood is another way to work around perfect separation. This is in a Statsmodels PR that is basically ready to use:
https://github.com/statsmodels/statsmodels/pull/5304

Best modeling technique for multiple independent variables

I have time series data with 4 independent and 1 dependent variable. I'm trying to predict the value of the independent variable using the dependent variables. The data is quite complex, I've tried the linear regression already, which as expected did not work.
I proceeded to using multivariate polynomial regression, but have been unsuccessful till now because I haven't been able to get the code going. But I also read somewhere that using multivariate polynomial might not be the best approach.
Is there any other model that I could use to predict the value of the independent variable? My entire data is numerical, with new data coming in everyday. I'm using python for this exercise.
Any suggestions are helpful and highly appreciated.
Thank you!

Python's XGBRegressor vs R's XGBoost

I'm using python's XGBRegressor and R's xgb.train with the same parameters on the same dataset and I'm getting different predictions.
I know that XGBRegressor uses 'gbtree' and I've made the appropriate comparison in R, however, I'm still getting different results.
Can anyone lead me in the right direction on how to differentiate the 2 and/or find R's equivalence to python's XGBRegressor?
Sorry if this is a stupid question, thank you.
Since XGBoost uses decision trees under the hood it can give you slightly different results between fits if you do not fix random seed so the fitting procedure becomes deterministic.
You can do this via set.seed in R and numpy.random.seed in Python.
Noting Gregor's comment you might want to set nthread parameter to 1 to achieve full determinism.

online linear regression with forgetting

I need a way to run a linear regression during a simulation in python. New X and y values come in, should be fitted and new coefficient estimates should be made. However, older values should get a lower weight.
Is there a package that can do this?
Short answer here, perhaps more an idea than a solution.
Have you tried scipy.optimize.curve_fit ?
It would do the fitting, but you would still have to code yourself the lower-weightening of the old values before passing it through the absolute_sigma parameter.

Categories