I'm analyzing a set of data and I need to find the regression for it. Th number of data points in the dataset are low (~15) and I decided to use the robust linear regression for the job. The problem is that the procedure is selecting some points as outlier that do not seem to be that influential. Here is scatter plot of the data, with their influence used as size:
Point B and C (shown with red circle in the figure) are selected as outliers, while point A which has way higher influence is not. Although point A does not change the general trend of the regression, it is basically defining the slope along with the point with the highest X. Whereas point B and C are only affecting the significance of slope. So my question has two parts:
1) What is the RLM package's method for selecting outliers if the most influential point is not selected and do you know of other packages that have a outlier selection that I have in mind?
2) Do you think that point A is an outlier?
RLM in statsmodels is limited to M-estimators. The default Huber norm is only robust to outliers in y, but not in x, i.e. not robust to bad influential points.
See for example http://www.statsmodels.org/devel/examples/notebooks/generated/robust_models_1.html
line In [51] and after.
Redescending norms like bisquare are able to remove bad influential points but the solution is a local optimum and needs appropriate starting values. Methods that have a low breakdown point and are robust to x outliers like LTS are currently not available in statsmodels nor, AFAIK, anywhere else in Python. R has a more extensive suite of robust estimators that can handle these cases. Some extensions to add more methods and models in statsmodels.robust are in, currently stalled, pull requests.
In general and to answer the second part of the question:
In specific cases it is often difficult to declare or identify an observation as outlier. Very often researchers use robust methods to indicate outlier candidates that need further investigation. One reason for example could be that the "outliers" were sampled from a different population. Using a purely mechanical, statistical identification might not be appropriate in many cases.
In this example: If we fit a steep slope and drop point A as an outlier, then points B and C might fit reasonably well and are not identified as outliers. On the other hand, if A is a reasonable point based on extra information, then maybe the relationship is nonlinear.
My guess is that LTS will declare A as the only outlier and fit a steep regression line.
Related
I have some data that has some outliers. My data however has a direction to it and has trends that i need to consider when looking for outlier. What an outlier is however, is not simply a yes or no answer. The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data.
Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not.
I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from.
I know that when you prepare your data for a NN, data has to always be prepared the exact same way. That is, it goes through a set of static processes/functions. The techniques used to select outliers, require context, and context changes, so the function changes. I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model.
If this is true, are there any good static methods to select an outlier?
A model-independent way of selecting outliers is based upon the distribution of errors. This boils down to:
Fit the model with all data points
Calculate the residual error for each data point
Eliminate outliers based on some threshold
Re-fit the model from scratch with outliers removed
(Optionally repeat until a termination condition is met, e.g. no outliers are removed)
The threshold of elimination is problem- and metric-dependent. One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers).
https://en.wikipedia.org/wiki/Standard_score
This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error).
If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (e.g. fit a distribution on the residual errors, then apply a probabilistic threshold for each point). This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go.
The foregoing discusses how to identify outliers, but doesn't address whether you should. This is an application-dependent question. If outliers are not informative of behavior you want to model, then they can be removed from training. However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained.
I am performing PCA on dataset of shape 300,1500 using scikit learn in Python 3.
I have following questions in the context of PCA implementation in scikit learn and generally accepted approach.
1) Before doing PCA do I remove highly correlated columns? I have 67 columns which have correlation > 0.9. Does PCA automatically handle this correlation I.e ignores them?
2) Do I need to remove outliers before performing PCA?
3) if I have to remove outliers how best to approach this. Using z-score for each column when I tried to remove outliers (z-score >3) I am left with only 15 observations. It seems like wrong approach.
4) Finally is there ideal amount of cumulative explained variance which I should be using to choose P components. In this case around 150 components give me 90% cum explained variance
With regards to using PCA, PCA will discover the axes of greatest variance in your data. Consequently:
No, you no not need to remove correlated features.
You shouldn't need to remove outliers for any a priori reason related to PCA. That said, if you think they are potentially manipulating your results either for analysis or prediction you could consider removing them, although I don't think they are a problem for PCA per se.
That is probably not the right approach. First things first visualize your data and look for your outliers. Also, I wouldn't assume the distribution of your data and apply a basic z score to it. Some googling on criteria on removing outliers would be useful here.
There are various cutoffs people use with PCA. 99% can be quite common, although I don't know if there is a hard and fast rule. If your goal is prediction, there there will probably be a trade off between speed and the accuracy of your predictions. You will need to find the cutoff that suits your needs.
I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!
I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.
I am aware of the existence of this, and this on this topic. However, I would like to finalize on an actual implementation in Python this time.
My only problem is that the elbow point seems to be changing from different instantiations of my code. Observe the two plots shown in this post. While they appear to be visually similar, the value of the elbow point changed significantly. Both the curves were generated from an average of 20 different runs. Even then, there is a significant shift in the value of the elbow point. What precautions can I take to make sure that the value falls within a certain bound?
My attempt is shown below:
def elbowPoint(points):
secondDerivative = collections.defaultdict(lambda:0)
for i in range(1, len(points) - 1):
secondDerivative[i] = points[i+1] + points[i-1] - 2*points[i]
max_index = secondDerivative.values().index(max(secondDerivative.values()))
elbow_point = max_index + 1
return elbow_point
points = [0.80881476685027154, 0.79457906121371058, 0.78071124401504677, 0.77110686192601441, 0.76062373158581287, 0.75174963969985187, 0.74356408965979193, 0.73577573557299236, 0.72782434749305047, 0.71952590556748364, 0.71417942487824781, 0.7076502559300516, 0.70089375208028415, 0.69393584640497064, 0.68550490458450741, 0.68494440529025913, 0.67920157634796108, 0.67280267176628761]
max_point = elbowPoint(points)
Its sounds like your actual concern is how to smooth your data as it contains noise? in which case perhaps you should fit a curve to the data first, then find the elbow of the fitted curve?
Whether this will work would depend on the source of the noise, and if the noise is important for your application? by the way you may want to see how sensitive your fit is to your data by seeing how it changes (or hopefully doesn't) when a point is omitted from the fit (obviously with a high enough polynomial you will always get a good fit to a specific set of data, but you are presumably interested in the general case)
I have no idea if this approach is acceptable, intuitively though i'd think that sensitivity to small errors is bad. ultimately by fitting a curve you are saying that the underlying process is, in the ideal case, modelled by the curve, and any deviation from the curve is an error/noise