online linear regression with forgetting - python

I need a way to run a linear regression during a simulation in python. New X and y values come in, should be fitted and new coefficient estimates should be made. However, older values should get a lower weight.
Is there a package that can do this?

Short answer here, perhaps more an idea than a solution.
Have you tried scipy.optimize.curve_fit ?
It would do the fitting, but you would still have to code yourself the lower-weightening of the old values before passing it through the absolute_sigma parameter.

Related

Regression method with high precision

I would like to ask suggestions for my data set. As I am not familiar with machine learning or data science, I would like to get help from you guys.
I have four features, about a million rows each of them with one output. The final purpose is to make a fine regression with high precision. I have tried one regression method, and it seems like due to lot of samples there seems to be non-universal regression equation that fits for million rows.
Are there any methods I can try? One idea I thought of is to do multiple regression by truncating data rows, but then what should I do with all these equations to somehow make "one universal" equation or at least, minimize the number of the equation as much as possible to make quasi-universal?
Thank you in advance.
Scikit learn is a great package to do this sort of things (https://scikit-learn.org/stable/)
There exists tons of methods to perform this kind of regression task. The first I would try are LinearRegression, RandomForestRegressor, AdaBoost.
Now, you must prescribe a relevant metric to measure the success of the regression (l2 distance is the most common, but it may be ill-suited for your problem).

<lifelines> Solving Cox Proportional Hazard after creating interaction variable with time

I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.

How to use statsmodels to fit data

I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.

Using robust linear methods from python module "statsmodels" with weights?

I have some data,y with errors, y_err, measured at x. I need to fit a straight line to this mimicking some code from matlab specifically the fit method with robust "on" and giving the weights as 1/yerr. The matlab documentation says it uses the bisquare method (also know as the TukeyBiweight method). My code so far is..
rlm_model = sm.RLM(y, x, M=sm.robust.norms.TukeyBiweight())
rlm_results = rlm_model.fit()
print rlm_results.params
however I need to find a way of including weights derived from yerr.
Hope people can help, this is the first time I have tried to used the statsmodel module.
In response to the first answer:
I tried;
y=y*(yerr)
x=x*(yerr)
x=sm.add_constant(x, prepend=False)
rlm_model = sm.RLM(y, x, M=sm.robust.norms.TukeyBiweight())
results=rlm_model.fit()
but sadly this doesnt match the matlab function.
Weights reflecting heteroscedasticity, that is unequal variance across observations, are not yet supported by statsmodels RLM.
As a workaround, you can divide your y and x by yerr in the call to RLM.
I think, in analogy to weighted least squares, the parameter estimates, their standard errors and other statistics are still correct in this case. But I haven't checked yet.
as reference:
Carroll, Raymond J., and David Ruppert. "Robust estimation in heteroscedastic linear models." The annals of statistics (1982): 429-441.
They also estimate the variance function, but for fixed weights 1/sigma_i the optimization just uses
(y_i - x_i beta) / sigma_i
The weights 1/sigma_i will only be relative weights and will still be multiplied with a robust estimate of the scale of the errors.

Multiple Logistic Regression in Python

I have a data set as such:
0,1,0,1,1,0,0,1,5
1,1,0,1,1,0,0,1,3
1,1,0,0,1,0,0,1,1
0,1,1,0,1,1,0,0,4
I'm looking for a way to run logistic regression in python which uses several discrete values (0 or 1) to predict a numerical value (between 1-5). This seems useful but it assumes the predictor variable is also discrete: http://www.mblondel.org/tlml/logreg.py.html#
Any suggestions?
If getting the job done in R (through one of rpy2, pyRserve, or pyper) is an option, you could this to get the job done. If questions about the statistical method to use, this "cross-validated" is a better place to ask.

Categories