I am working with financial data and cannot assume a Gaussian distribution. So I normalize my data by subtracting the median and dividing by the interquartile range. This puts 95% of the data into a range [-2,2]. The rest are a bunch of crazy outliers that can be as high as -8, 28, 47 etc.
But I still dont want to throw the outliers away. So I apply a tanh(x) to my entire normalized time series and the majority of the data that are in the range [-2,-2] are now mapped to [-0.95, 0.95], and the crazy outliers are now saturated close to -1 and 1, and the really crazy ones are all mapped to precisely -1 and 1. Order is kept throughout the process, because tanh(x) is a monotonic function. And the Machine Learning algorithm doesnt have to waste time and energy on numbers that have much larger absolute values than others. The extreme outliers are all now in two groups, -1 and 1.
By the way, the tanh compression doesnt destroy too many unique values. That is, close values are not collapsed to the same value by tanh. I get almost exatly the same amount of unique values in my time series before the tanh, as after.
The data will be fed into Neural Network, Random Forest, and Gradient Boosted Decision Trees. (Even though decision trees dont care too much about outliers, I still want to force all indicators into the same range [-1,1]).
What are the bad consequences to my approach, compared to just throwing the outliers away? What am I missing?
Related
Has anyone tried to predict a specific pattern in time series data?
Example: In a specific time, there is a huge upward spike in certain variables in a time series...
How would I build a model to predict that spike when next time it occurs?
Please do respond if anyone working in this area.
I tried with converting that particular series of data in a NumPy array and trying to feed in the model.But Its not allowing me.
Here is the data looks like
This data is generated in a controlled manner so that we can have these spikes near to near.. In actual case this could b random, and our main objective is to catch this pattern and make a count.
Das, you could try implementing LSTM based Neural Network Models.
See:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
It is still preferred that the data contains a trend. If the upward spike happens around the same time of the recurring time interval, it is more likely that you get a better prediction result.
In the image you shared, there seems to be trend in the data. Hence LSTM models can pretty efficiently extract the pattern and output a prediction.
Statistical modelling of the data can also provide better results.
See: https://orangematter.solarwinds.com/2019/12/15/holt-winters-forecasting-simplified/
Das, if outputting the total number of peaks is solely the requirement, then I think heavy neural network models are bit of an overkill. However, neural network models also can pretty well do the job, but require lot of data input for training and fine tuning the weights and biases to give a really good result.
How about you try implementing a thresholding based technique, where you increment a counter every time the data value crosses the preset threshold? In such an approach you should ensure to group very nearby peaks together so that the count is just one for that case. Here you could set a threshold on the x axis too.
ie:- For instance with respect to the given plot, let the y-threshold be 4. Then you will get a count 5 if you consider the y axis threshold (y value 4) alone. This is because for x value at 15:48.2, there are two peaks that cross y value 4. So suppose you set a threshold in the x axis too, then these nearby peaks shall be grouped together within the preset limit and the final count will be 4 (which is the requirement).
This Question is more Theoretical, and not specifically trying to problem-solve.
I recently was introduced to the K-Means Clustering algorithm, and unsupervised machine learning algorithm, and I was intrigued by the though that one some sets of data, even if completely random, the average centroids drawn could keep changing through each iteration.
Example:
What I am trying to show here, is, imagine if the program flipped between iteration 6, to iteration 9, and kept doing this forever.
I have had my code randomly hang before using K-Means, so I don't believe this is impossible, but please let me know if this is a known occurrence, or if it is impossible due to the nature of the algorithm.
If you need more information just ask me in a comment. Using Python 3.7
tl;dr No, a K-means algorithm always has an end point if the algorithm is coded correctly.
Explanation:
The ideal way to think about this is not in the sense of what datapoints would cause issues, but rather about how kmeans is working in the broader sense of things. The k-means algorithm is always working in a finite space. For N data points, there are only N ^ k distinct arrangements for the data points. (This number can be pretty large, but is still finite)
Secondly, a k-means algorithm is always optimizing a loss function, based on the sum of squared distances between each data point and it's assigned cluster center. This means two very important things: Each of the N ^ k distinct arrangements can be arranged in an ascending/descending order of minimum loss to maximum loss. Also, the K-means algorithm will never go from a state of lower net loss to a higher net loss.
These two conditions guarantee that the algorithm will always tend towards the minimum loss arrangement in a finite space, thus ensuring that it has an end.
The last edge case: What if more than one minimum state has equal loss? This is a highly unlikely scenario, but can cause issues if and only if the algorithm is coded poorly for tie breakers. Essentially, the only way this can cause a cycle is if a data point has equal distance for two clusters, and is allowed to change clusters away from it's current cluster even on equal distance. Suffice to say, the algorithms are generally coded so that the data points never swap on a tie, or in some other deterministic manner, thus avoiding this scenario entirely.
I have some data that has some outliers. My data however has a direction to it and has trends that i need to consider when looking for outlier. What an outlier is however, is not simply a yes or no answer. The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data.
Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not.
I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from.
I know that when you prepare your data for a NN, data has to always be prepared the exact same way. That is, it goes through a set of static processes/functions. The techniques used to select outliers, require context, and context changes, so the function changes. I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model.
If this is true, are there any good static methods to select an outlier?
A model-independent way of selecting outliers is based upon the distribution of errors. This boils down to:
Fit the model with all data points
Calculate the residual error for each data point
Eliminate outliers based on some threshold
Re-fit the model from scratch with outliers removed
(Optionally repeat until a termination condition is met, e.g. no outliers are removed)
The threshold of elimination is problem- and metric-dependent. One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers).
https://en.wikipedia.org/wiki/Standard_score
This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error).
If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (e.g. fit a distribution on the residual errors, then apply a probabilistic threshold for each point). This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go.
The foregoing discusses how to identify outliers, but doesn't address whether you should. This is an application-dependent question. If outliers are not informative of behavior you want to model, then they can be removed from training. However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained.
Why is 80% of the PCA.explained_variance_ratio_ seem like a reasonable threshold? What can one say about the number of components required to explain 80% of the variance?
According to the PCA documentation,
auto:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
Ok, I'm not sure if I'm even making sense, but it seems like 80% is a good threshold, but why? I tried looking this up, but it didn't amount to much.
My boss wants metrics on our ticket processing system, and one of the metrics he wants is "the 90% time" which he defines as the time it takes 90% of the tickets to be processed. I guess he's considering that 10% are anomalous can be ignored. I would like this to at least approach some statistical validity. So I've got a list of the times that I throw into a numpy array. This is the code I've come up with.
import numpy as np
inliers = data[data<np.percentile(data, 90)]
ninety_time = inliers.max()
Is this valid? Is there a better way?
Percentiles are a statistically perfectly valid approach. They are used to provide robust descriptions of the data. For example the 50% percentile is the median, and box-plots typically show the 25%, 50%, and 75% percentiles to give an idea of the range covered by data.
The 90% percentile can be seen as a rather naive and rough estimate of the maximum value that is less vulnerable to outliers than the actual max-value. (Obviously, it is somewhat biased - it will always be less than the true maximum.) Use this interpretation with care. It's safest to see the 90% percentile as what it is - a value where 90% of the data below and 10% above.
Your code is somewhat redundant as the percentile(data, 90) returns the value where 90% of the elements in data are lower or equal. So I would say this is exactly the 90% time and there is no need to compute the value for <90%. For a large number of samples and continous values the difference between <=90% and <90% will vanish anyway.