Data Standardization vs Normalization vs Robust Scaler

Data Standardization vs Normalization vs Robust Scaler - python

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):
I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!
My questions are:
Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?

Am I right to say that also Standardization gets affected negatively by the extreme values as well?
Indeed you are; the scikit-learn docs themselves clearly warn for such a case:
However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.
More or less, the same holds true for the MinMaxScaler as well.
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?
Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:
RobustScaler
[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
where the "see below" refers to the QuantileTransformer and quantile_transform.

None of them are robust in the sense that the scaling will take care of outliers and put them on a confined scale, that is no extreme values will appear.
You can consider options like:
Clipping(say, between 5 percentile and 95 percentile) the series/array before scaling
Taking transformations like square-root or logarithms, if clipping is not ideal
Obviously, adding another column 'is clipped'/'logarithmic clipped amount' will reduce information loss.

Related

Removing outlier in data for NN, good or bad idea?

I have some data that has some outliers. My data however has a direction to it and has trends that i need to consider when looking for outlier. What an outlier is however, is not simply a yes or no answer. The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data.
Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not.
I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from.
I know that when you prepare your data for a NN, data has to always be prepared the exact same way. That is, it goes through a set of static processes/functions. The techniques used to select outliers, require context, and context changes, so the function changes. I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model.
If this is true, are there any good static methods to select an outlier?

A model-independent way of selecting outliers is based upon the distribution of errors. This boils down to:
Fit the model with all data points
Calculate the residual error for each data point
Eliminate outliers based on some threshold
Re-fit the model from scratch with outliers removed
(Optionally repeat until a termination condition is met, e.g. no outliers are removed)
The threshold of elimination is problem- and metric-dependent. One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers).
https://en.wikipedia.org/wiki/Standard_score
This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error).
If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (e.g. fit a distribution on the residual errors, then apply a probabilistic threshold for each point). This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go.
The foregoing discusses how to identify outliers, but doesn't address whether you should. This is an application-dependent question. If outliers are not informative of behavior you want to model, then they can be removed from training. However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained.

what's so special about 80% threshold for PCA variance ratio?

Why is 80% of the PCA.explained_variance_ratio_ seem like a reasonable threshold? What can one say about the number of components required to explain 80% of the variance?
According to the PCA documentation,
auto:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
Ok, I'm not sure if I'm even making sense, but it seems like 80% is a good threshold, but why? I tried looking this up, but it didn't amount to much.

KMeans clustering unbalanced data

I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.
Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.
I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.
Is there anyway balance out the clusters?

It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.
What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.
Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.

Principal Component Analysis with correlated features and outliers

I am performing PCA on dataset of shape 300,1500 using scikit learn in Python 3.
I have following questions in the context of PCA implementation in scikit learn and generally accepted approach.
1) Before doing PCA do I remove highly correlated columns? I have 67 columns which have correlation > 0.9. Does PCA automatically handle this correlation I.e ignores them?
2) Do I need to remove outliers before performing PCA?
3) if I have to remove outliers how best to approach this. Using z-score for each column when I tried to remove outliers (z-score >3) I am left with only 15 observations. It seems like wrong approach.
4) Finally is there ideal amount of cumulative explained variance which I should be using to choose P components. In this case around 150 components give me 90% cum explained variance

With regards to using PCA, PCA will discover the axes of greatest variance in your data. Consequently:
No, you no not need to remove correlated features.
You shouldn't need to remove outliers for any a priori reason related to PCA. That said, if you think they are potentially manipulating your results either for analysis or prediction you could consider removing them, although I don't think they are a problem for PCA per se.
That is probably not the right approach. First things first visualize your data and look for your outliers. Also, I wouldn't assume the distribution of your data and apply a basic z score to it. Some googling on criteria on removing outliers would be useful here.
There are various cutoffs people use with PCA. 99% can be quite common, although I don't know if there is a hard and fast rule. If your goal is prediction, there there will probably be a trade off between speed and the accuracy of your predictions. You will need to find the cutoff that suits your needs.

scikit-learn PCA remove common signals

Traditionally PCA is used to reduce dimensionality (I believe) but I want to use it to remove trends.
My use case is that I have lots of time series (star brightnesses) and want to remove spurious signals which are present in lots of these time series. I believe PCA can be used for determining these basis functions, but how can I then remove them from the full dataset?
I have tried e.g.
pca = PCA(n_components=4)
pca.fit(lightcurves)
detrended = lightcurves / (pca.components_ * pca.explained_variance_ratio_)
Is this possible with PCA?
For example, I have a lot of time series which have similar features, such as the one below. Every night the star gets brighter towards the highest point of its positioning on the sky, and then gets fainter. Each object undergoes this (it's an artefact of looking through the atmosphere) so surely it's something PCA can pick out, such that it can be removed?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.