what's so special about 80% threshold for PCA variance ratio? - python

Why is 80% of the PCA.explained_variance_ratio_ seem like a reasonable threshold? What can one say about the number of components required to explain 80% of the variance?
According to the PCA documentation,
auto:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
Ok, I'm not sure if I'm even making sense, but it seems like 80% is a good threshold, but why? I tried looking this up, but it didn't amount to much.

Related

Removing outlier in data for NN, good or bad idea?

I have some data that has some outliers. My data however has a direction to it and has trends that i need to consider when looking for outlier. What an outlier is however, is not simply a yes or no answer. The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data.
Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not.
I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from.
I know that when you prepare your data for a NN, data has to always be prepared the exact same way. That is, it goes through a set of static processes/functions. The techniques used to select outliers, require context, and context changes, so the function changes. I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model.
If this is true, are there any good static methods to select an outlier?
A model-independent way of selecting outliers is based upon the distribution of errors. This boils down to:
Fit the model with all data points
Calculate the residual error for each data point
Eliminate outliers based on some threshold
Re-fit the model from scratch with outliers removed
(Optionally repeat until a termination condition is met, e.g. no outliers are removed)
The threshold of elimination is problem- and metric-dependent. One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers).
https://en.wikipedia.org/wiki/Standard_score
This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error).
If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (e.g. fit a distribution on the residual errors, then apply a probabilistic threshold for each point). This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go.
The foregoing discusses how to identify outliers, but doesn't address whether you should. This is an application-dependent question. If outliers are not informative of behavior you want to model, then they can be removed from training. However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained.

Data Standardization vs Normalization vs Robust Scaler

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):
I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!
My questions are:
Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?
Am I right to say that also Standardization gets affected negatively by the extreme values as well?
Indeed you are; the scikit-learn docs themselves clearly warn for such a case:
However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.
More or less, the same holds true for the MinMaxScaler as well.
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?
Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:
RobustScaler
[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
where the "see below" refers to the QuantileTransformer and quantile_transform.
None of them are robust in the sense that the scaling will take care of outliers and put them on a confined scale, that is no extreme values will appear.
You can consider options like:
Clipping(say, between 5 percentile and 95 percentile) the series/array before scaling
Taking transformations like square-root or logarithms, if clipping is not ideal
Obviously, adding another column 'is clipped'/'logarithmic clipped amount' will reduce information loss.

Find the time that 90% of tickets are processed in?

My boss wants metrics on our ticket processing system, and one of the metrics he wants is "the 90% time" which he defines as the time it takes 90% of the tickets to be processed. I guess he's considering that 10% are anomalous can be ignored. I would like this to at least approach some statistical validity. So I've got a list of the times that I throw into a numpy array. This is the code I've come up with.
import numpy as np
inliers = data[data<np.percentile(data, 90)]
ninety_time = inliers.max()
Is this valid? Is there a better way?
Percentiles are a statistically perfectly valid approach. They are used to provide robust descriptions of the data. For example the 50% percentile is the median, and box-plots typically show the 25%, 50%, and 75% percentiles to give an idea of the range covered by data.
The 90% percentile can be seen as a rather naive and rough estimate of the maximum value that is less vulnerable to outliers than the actual max-value. (Obviously, it is somewhat biased - it will always be less than the true maximum.) Use this interpretation with care. It's safest to see the 90% percentile as what it is - a value where 90% of the data below and 10% above.
Your code is somewhat redundant as the percentile(data, 90) returns the value where 90% of the elements in data are lower or equal. So I would say this is exactly the 90% time and there is no need to compute the value for <90%. For a large number of samples and continous values the difference between <=90% and <90% will vanish anyway.

Efficient k-means evaluation with silhouette score in sklearn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.
Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.
kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.
Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

Deciding input values to DBSCAN algorithm

I have written code in python to implement DBSCAN clustering algorithm.
My dataset consists of 14k users with each user represented by 10 features.
I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input
How should I decide that?
Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?
DBSCAN is pretty often hard to estimate its parameters.
Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.
Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).

Categories