Principal Component Analysis with correlated features and outliers - python

I am performing PCA on dataset of shape 300,1500 using scikit learn in Python 3.
I have following questions in the context of PCA implementation in scikit learn and generally accepted approach.
1) Before doing PCA do I remove highly correlated columns? I have 67 columns which have correlation > 0.9. Does PCA automatically handle this correlation I.e ignores them?
2) Do I need to remove outliers before performing PCA?
3) if I have to remove outliers how best to approach this. Using z-score for each column when I tried to remove outliers (z-score >3) I am left with only 15 observations. It seems like wrong approach.
4) Finally is there ideal amount of cumulative explained variance which I should be using to choose P components. In this case around 150 components give me 90% cum explained variance

With regards to using PCA, PCA will discover the axes of greatest variance in your data. Consequently:
No, you no not need to remove correlated features.
You shouldn't need to remove outliers for any a priori reason related to PCA. That said, if you think they are potentially manipulating your results either for analysis or prediction you could consider removing them, although I don't think they are a problem for PCA per se.
That is probably not the right approach. First things first visualize your data and look for your outliers. Also, I wouldn't assume the distribution of your data and apply a basic z score to it. Some googling on criteria on removing outliers would be useful here.
There are various cutoffs people use with PCA. 99% can be quite common, although I don't know if there is a hard and fast rule. If your goal is prediction, there there will probably be a trade off between speed and the accuracy of your predictions. You will need to find the cutoff that suits your needs.

Related

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

K Means Clustering: What does it mean about my input features if the Elbow Method gives me a straight line?

I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks
I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.

Is there a way to get intracluster distances for k-means in Python

Hi I am new to Python and trying to figure out these below. Really appreciate any help. Thank you
How to get intracluster and intercluster distances in kmeans using python?
How to verify the quality of clusters? Any measures to check the goodness of clusters formed?
Is there a way to find out which factors/variables are most significant features affecting the clustering - Feature Extraction/Selection
I tried this for question 1 above, is this correct approach??
dists = euclidean_distances(km.cluster_centers_)
tri_dists = dists[np.triu_indices(4, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
print(max_dist, avg_dist, min_dist)
Avoid putting multiple questions into one.
K-means does not compute all these distances. Otherwise it would need O(n²) time and memory, that would be much slower! It uses a special property of variance (another reason why it does not just optimize other distances except sum-of-squares) known as the Koenig-Huygens theorem.
Yes, there have been over 20, probably even 100, such quality measures proposed in literature. But that does not make it much easier to pick the "best" clustering: in the end, clusters are subjective for the user.
Yes, you can apply various techniques ranging from variance analysis to factor analysis to random forests.

Data Standardization vs Normalization vs Robust Scaler

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.
In theory, the guidelines are:
Advantages:
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.
Disadvantages:
Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.
I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):
I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!
My questions are:
Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?
Am I right to say that also Standardization gets affected negatively by the extreme values as well?
Indeed you are; the scikit-learn docs themselves clearly warn for such a case:
However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.
More or less, the same holds true for the MinMaxScaler as well.
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?
Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:
RobustScaler
[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
where the "see below" refers to the QuantileTransformer and quantile_transform.
None of them are robust in the sense that the scaling will take care of outliers and put them on a confined scale, that is no extreme values will appear.
You can consider options like:
Clipping(say, between 5 percentile and 95 percentile) the series/array before scaling
Taking transformations like square-root or logarithms, if clipping is not ideal
Obviously, adding another column 'is clipped'/'logarithmic clipped amount' will reduce information loss.

Python statsmodel robust linear regression (RLM) outlier selection

I'm analyzing a set of data and I need to find the regression for it. Th number of data points in the dataset are low (~15) and I decided to use the robust linear regression for the job. The problem is that the procedure is selecting some points as outlier that do not seem to be that influential. Here is scatter plot of the data, with their influence used as size:
Point B and C (shown with red circle in the figure) are selected as outliers, while point A which has way higher influence is not. Although point A does not change the general trend of the regression, it is basically defining the slope along with the point with the highest X. Whereas point B and C are only affecting the significance of slope. So my question has two parts:
1) What is the RLM package's method for selecting outliers if the most influential point is not selected and do you know of other packages that have a outlier selection that I have in mind?
2) Do you think that point A is an outlier?
RLM in statsmodels is limited to M-estimators. The default Huber norm is only robust to outliers in y, but not in x, i.e. not robust to bad influential points.
See for example http://www.statsmodels.org/devel/examples/notebooks/generated/robust_models_1.html
line In [51] and after.
Redescending norms like bisquare are able to remove bad influential points but the solution is a local optimum and needs appropriate starting values. Methods that have a low breakdown point and are robust to x outliers like LTS are currently not available in statsmodels nor, AFAIK, anywhere else in Python. R has a more extensive suite of robust estimators that can handle these cases. Some extensions to add more methods and models in statsmodels.robust are in, currently stalled, pull requests.
In general and to answer the second part of the question:
In specific cases it is often difficult to declare or identify an observation as outlier. Very often researchers use robust methods to indicate outlier candidates that need further investigation. One reason for example could be that the "outliers" were sampled from a different population. Using a purely mechanical, statistical identification might not be appropriate in many cases.
In this example: If we fit a steep slope and drop point A as an outlier, then points B and C might fit reasonably well and are not identified as outliers. On the other hand, if A is a reasonable point based on extra information, then maybe the relationship is nonlinear.
My guess is that LTS will declare A as the only outlier and fit a steep regression line.

Categories