Removing outlier in data for NN, good or bad idea? - python

I have some data that has some outliers. My data however has a direction to it and has trends that i need to consider when looking for outlier. What an outlier is however, is not simply a yes or no answer. The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data.
Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not.
I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from.
I know that when you prepare your data for a NN, data has to always be prepared the exact same way. That is, it goes through a set of static processes/functions. The techniques used to select outliers, require context, and context changes, so the function changes. I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model.
If this is true, are there any good static methods to select an outlier?

A model-independent way of selecting outliers is based upon the distribution of errors. This boils down to:
Fit the model with all data points
Calculate the residual error for each data point
Eliminate outliers based on some threshold
Re-fit the model from scratch with outliers removed
(Optionally repeat until a termination condition is met, e.g. no outliers are removed)
The threshold of elimination is problem- and metric-dependent. One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers).
https://en.wikipedia.org/wiki/Standard_score
This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error).
If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (e.g. fit a distribution on the residual errors, then apply a probabilistic threshold for each point). This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go.
The foregoing discusses how to identify outliers, but doesn't address whether you should. This is an application-dependent question. If outliers are not informative of behavior you want to model, then they can be removed from training. However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained.

Related

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

Evaluating ARIMA models with the AIC

Having come across ARIMA/seasonal ARIMA recently, I am wondering why the AIC is chosen as an estimator for the applicability of a model. According to Wikipedia, it evaluates the goodness of the fit while punishing non-parsimonious models in order to prevent overfitting. Many grid search functions such as auto_arima in Python or R use it as an evaluation metric and suggest the model with the lowest AIC as the best fit.
However, in my case, choosing a simple model (with the lowest AIC -> small amount of parameters) just results in a model, that strongly follows previous in-sample observations and performs very badly on the test sample data. I don't see how overfitting is prevented just by choosing a small number of parameters...
ARIMA(1,0,1)(0,0,0,53); AIC=-16.7
Am I misunderstanding something? What could be a workaround to prevent this?
In the case of an ARIMA model whatever the parameters of the model are it will follow past observations, in the sense that you predict next values given previous values from your data. Now, auto.arima just tries some models and gives you the one with the lowest AIC by default or some other information criterion e.g BIC. This does not mean anything more than what the definition of those criteria are: so the model with the lowest AIC is the one that gives minimizes the AIC function. In case of time series analysis after you make sure that time series is stationary, I would recommend that you examine the ACF and PACF plots of your time series and read this
P.S I don't get this straight orange line in your plot after the dashed vertical line.
We usually use some form of cross-validation to protect against overfitting. It is well known that leave-one-out cross-validation is asymptotically equivalent to AIC under some assumptions about normality etc. Indeed, back when we had less computing power, AIC and other information criteria were handy exactly because they accomplish something very similar to cross-validation analytically.
Also, note that by their nature ARMA(1,1) models -- and other stationary ARMA models for that matter -- tend to converge to a constant rather quickly. The easiest way to see this is to write down the expressions of y_t+1, y_t+2 as a function of y_t. You will see that the expression has exponentials of numbers less than 1 (your AR and MA parameters), which quickly converge to zero as t grows. Also see this discussion.
The reason why your 'observed' data (to the left of the dashed line) does not exhibit this behaviour is that for each period you get a new realisation of random error term epsilon_t. On the right hand side, you do not get these realisations of random shocks, but instead they are replaced with their expressed value 0.

Data Standardization vs Normalization vs Robust Scaler

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):
I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!
My questions are:
Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?
Am I right to say that also Standardization gets affected negatively by the extreme values as well?
Indeed you are; the scikit-learn docs themselves clearly warn for such a case:
However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.
More or less, the same holds true for the MinMaxScaler as well.
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?
Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:
RobustScaler
[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
where the "see below" refers to the QuantileTransformer and quantile_transform.
None of them are robust in the sense that the scaling will take care of outliers and put them on a confined scale, that is no extreme values will appear.
You can consider options like:
Clipping(say, between 5 percentile and 95 percentile) the series/array before scaling
Taking transformations like square-root or logarithms, if clipping is not ideal
Obviously, adding another column 'is clipped'/'logarithmic clipped amount' will reduce information loss.

Principal Component Analysis with correlated features and outliers

I am performing PCA on dataset of shape 300,1500 using scikit learn in Python 3.
I have following questions in the context of PCA implementation in scikit learn and generally accepted approach.
1) Before doing PCA do I remove highly correlated columns? I have 67 columns which have correlation > 0.9. Does PCA automatically handle this correlation I.e ignores them?
2) Do I need to remove outliers before performing PCA?
3) if I have to remove outliers how best to approach this. Using z-score for each column when I tried to remove outliers (z-score >3) I am left with only 15 observations. It seems like wrong approach.
4) Finally is there ideal amount of cumulative explained variance which I should be using to choose P components. In this case around 150 components give me 90% cum explained variance
With regards to using PCA, PCA will discover the axes of greatest variance in your data. Consequently:
No, you no not need to remove correlated features.
You shouldn't need to remove outliers for any a priori reason related to PCA. That said, if you think they are potentially manipulating your results either for analysis or prediction you could consider removing them, although I don't think they are a problem for PCA per se.
That is probably not the right approach. First things first visualize your data and look for your outliers. Also, I wouldn't assume the distribution of your data and apply a basic z score to it. Some googling on criteria on removing outliers would be useful here.
There are various cutoffs people use with PCA. 99% can be quite common, although I don't know if there is a hard and fast rule. If your goal is prediction, there there will probably be a trade off between speed and the accuracy of your predictions. You will need to find the cutoff that suits your needs.

Python statsmodel robust linear regression (RLM) outlier selection

I'm analyzing a set of data and I need to find the regression for it. Th number of data points in the dataset are low (~15) and I decided to use the robust linear regression for the job. The problem is that the procedure is selecting some points as outlier that do not seem to be that influential. Here is scatter plot of the data, with their influence used as size:
Point B and C (shown with red circle in the figure) are selected as outliers, while point A which has way higher influence is not. Although point A does not change the general trend of the regression, it is basically defining the slope along with the point with the highest X. Whereas point B and C are only affecting the significance of slope. So my question has two parts:
1) What is the RLM package's method for selecting outliers if the most influential point is not selected and do you know of other packages that have a outlier selection that I have in mind?
2) Do you think that point A is an outlier?
RLM in statsmodels is limited to M-estimators. The default Huber norm is only robust to outliers in y, but not in x, i.e. not robust to bad influential points.
See for example http://www.statsmodels.org/devel/examples/notebooks/generated/robust_models_1.html
line In [51] and after.
Redescending norms like bisquare are able to remove bad influential points but the solution is a local optimum and needs appropriate starting values. Methods that have a low breakdown point and are robust to x outliers like LTS are currently not available in statsmodels nor, AFAIK, anywhere else in Python. R has a more extensive suite of robust estimators that can handle these cases. Some extensions to add more methods and models in statsmodels.robust are in, currently stalled, pull requests.
In general and to answer the second part of the question:
In specific cases it is often difficult to declare or identify an observation as outlier. Very often researchers use robust methods to indicate outlier candidates that need further investigation. One reason for example could be that the "outliers" were sampled from a different population. Using a purely mechanical, statistical identification might not be appropriate in many cases.
In this example: If we fit a steep slope and drop point A as an outlier, then points B and C might fit reasonably well and are not identified as outliers. On the other hand, if A is a reasonable point based on extra information, then maybe the relationship is nonlinear.
My guess is that LTS will declare A as the only outlier and fit a steep regression line.

Categories