I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.
Related
I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.
I have a dataset with low data points but very high dimensions/features. I wanted to know if there's any classification algorithm that work well with such dataset without having to perform dimensionality reduction techniques such as PCA, TSNE?
df.shape
(2124, 466029)
This is the classic curse of dimensionality (or p>>n) problem (with p being the number of predictors and n the number of observations).
Many techniques have been developed to try and address this problem.
You can randomly restrict your variables (you select different random subsets) and then asses their importance using cross-validation.
A preferable approach (imho) would be to use ridge-regression, lasso, or elastic net for regularization, however be aware that their oracle properties are rarely satisfied in practice.
Finally, there are algorithms that are able to deal with a very large number of predictors (and tweaks in their implementation that improve the performance when p>>n).
Examples of such models are support vector machine or random forest.
There are many resources on the topic, which are freely available.
You can have a look at these slides from Duke University for example.
Oracle properties (Lasso)
I will not explain in a sound mathematical way but I'll briefly give you some intuition.
Y= dependent variable, your target
X= regressors, your features
ε= your errors
We define a shrinkage procedure oracle if it is asymptotically able to:
Identify the right subset of regressors (i.e. retain only the features that have a true causal relationship with your dependent variable.
Has an optimal estimation rate (I'll leave the details out)
There are three assumptions that, if satisfied, make the lasso oracle.
Beta-min condition: The coefficients of the "true" regressors is above a certain threshold.
Your regressors are uncorrelated with each other.
X and ε are normally distributed and homoskedatisc
In practice you rarely have these assumptions satisfied.
What happens in that case is that your shrinkage will not necessarily retain the right variables.
This implies that you can't make statistically sound inference on the final model (you can't say X_1 explains Y for this and this other reason).
The intuition is simple. If assumption 1 is not satisfied one of the true variables might be incorrectly removed. If assumption 2 is not satisfied then a variable highly correlated with one of the true variables might be incorrectly retained in stead of the right one.
All in all, you shouldn't worry if your aim is forecasting. Your forecast will still be good! The only difference is that mathematically you can't say anymore that you are selecting the correct variables with probability -> 1.
PS: Lasso is a special case of elastic net, I vaguely remember that the oracle property of the elastic net has been proved as well but I might be wrong.
PPS: Corrections are appreciated as I haven't studied these things in a long while and there might be inaccuracies.
You could try a lasso/ridge/elastic net logistic regression.
For Topic Modelling ,
Why random_state parameter is used in NMF and LDA algorithm ? What are the benefits of using random topics generated every time ?
The algorithms for both are stochastic - meaning they use randomness as a part of estimating a good answer. It's done that way to make it tractable, and in the case of LDA, the whole model is stochastic, providing you ideally with a probabilistic distribution (called "the posterior distribution") of answers, but instead providing a single, likely answer as an estimate.
So the answer is that using randomness in the algorithms makes a tremendously difficult problem much simpler and feasible to calculate in less than a hundred years.
If you're going to use them, I think it would do you well to study them, learn something of how they work, why they work. Using a tool that you don't understand is risky, as you don't really know what the result the tool provides actually means. One example is the numerors words in all "topics" with very low probability. The differences in these probabilities are actually meaningless - given a different sample from the posterior, you'd get different probabilities, ranked differently between words.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.