I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.
Related
I have a dataset with low data points but very high dimensions/features. I wanted to know if there's any classification algorithm that work well with such dataset without having to perform dimensionality reduction techniques such as PCA, TSNE?
df.shape
(2124, 466029)
This is the classic curse of dimensionality (or p>>n) problem (with p being the number of predictors and n the number of observations).
Many techniques have been developed to try and address this problem.
You can randomly restrict your variables (you select different random subsets) and then asses their importance using cross-validation.
A preferable approach (imho) would be to use ridge-regression, lasso, or elastic net for regularization, however be aware that their oracle properties are rarely satisfied in practice.
Finally, there are algorithms that are able to deal with a very large number of predictors (and tweaks in their implementation that improve the performance when p>>n).
Examples of such models are support vector machine or random forest.
There are many resources on the topic, which are freely available.
You can have a look at these slides from Duke University for example.
Oracle properties (Lasso)
I will not explain in a sound mathematical way but I'll briefly give you some intuition.
Y= dependent variable, your target
X= regressors, your features
ε= your errors
We define a shrinkage procedure oracle if it is asymptotically able to:
Identify the right subset of regressors (i.e. retain only the features that have a true causal relationship with your dependent variable.
Has an optimal estimation rate (I'll leave the details out)
There are three assumptions that, if satisfied, make the lasso oracle.
Beta-min condition: The coefficients of the "true" regressors is above a certain threshold.
Your regressors are uncorrelated with each other.
X and ε are normally distributed and homoskedatisc
In practice you rarely have these assumptions satisfied.
What happens in that case is that your shrinkage will not necessarily retain the right variables.
This implies that you can't make statistically sound inference on the final model (you can't say X_1 explains Y for this and this other reason).
The intuition is simple. If assumption 1 is not satisfied one of the true variables might be incorrectly removed. If assumption 2 is not satisfied then a variable highly correlated with one of the true variables might be incorrectly retained in stead of the right one.
All in all, you shouldn't worry if your aim is forecasting. Your forecast will still be good! The only difference is that mathematically you can't say anymore that you are selecting the correct variables with probability -> 1.
PS: Lasso is a special case of elastic net, I vaguely remember that the oracle property of the elastic net has been proved as well but I might be wrong.
PPS: Corrections are appreciated as I haven't studied these things in a long while and there might be inaccuracies.
You could try a lasso/ridge/elastic net logistic regression.
I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).
I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.
Hello old faithful community,
This might be a though one as I can barely find any material on this.
The Problem
I have a data set of crimes committed in NSW Australia by council, and have merged this with average house prices by council. I'm now looking to produce a linear regression to try and predict said house price by the crime in the neighbourhood. The issue is, I have 49 crimes, and only want the best ones (statistically speaking) to be used in my model.
I've run the regression score over all and some variables (using correlation), and had results from .23 - .38 but I want to perfect this to the best possible - if there is a way to do this of course.
I've thought about looping over every possible combination, but this would end up by couple of million according to google.
So, my friends - how can I python this dataframe to get the best columns?
If I might add, you may want to take a look at the Python package mlxtend, http://rasbt.github.io/mlxtend.
It is a package that features several forward/backward stepwise regression algorithms, while still using the regressors/selectors of sklearn.
There is no gold standard to solving this problem and you are right, selecting every combination is computational not feasible most of the time -- especially with 49 variables. One method would be to implement a forward or backward selection by adding/removing variables based on a user specified p-value criteria (this is the statistically relevant criteria you mention). For python implementations using statsmodels, check out these links:
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn/24447#24447
http://planspace.org/20150423-forward_selection_with_statsmodels/
Other approaches that are less 'statistically valid' would be to define a model evaluation metric (e.g., r squared, mean squared error, etc) and use a variable selection approach such as LASSO, random forest, genetic algorithm, etc to identify the set of variables that optimize the metric of choice. I find that in practice, ensembling these techniques in a voting-type scheme works the best as different techniques work better for certain types of data. Check out the links below from sklearn to see some options that you can code up pretty quickly with your data:
Overview of techniques: http://scikit-learn.org/stable/modules/feature_selection.html
A stepwise procedure: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
Select best features based on model: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
If you are up for it, I would try a few techniques and see if the answers converge to the same set of features -- This will give you some insight into the relationships between your variables.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com