My boss wants metrics on our ticket processing system, and one of the metrics he wants is "the 90% time" which he defines as the time it takes 90% of the tickets to be processed. I guess he's considering that 10% are anomalous can be ignored. I would like this to at least approach some statistical validity. So I've got a list of the times that I throw into a numpy array. This is the code I've come up with.
import numpy as np
inliers = data[data<np.percentile(data, 90)]
ninety_time = inliers.max()
Is this valid? Is there a better way?
Percentiles are a statistically perfectly valid approach. They are used to provide robust descriptions of the data. For example the 50% percentile is the median, and box-plots typically show the 25%, 50%, and 75% percentiles to give an idea of the range covered by data.
The 90% percentile can be seen as a rather naive and rough estimate of the maximum value that is less vulnerable to outliers than the actual max-value. (Obviously, it is somewhat biased - it will always be less than the true maximum.) Use this interpretation with care. It's safest to see the 90% percentile as what it is - a value where 90% of the data below and 10% above.
Your code is somewhat redundant as the percentile(data, 90) returns the value where 90% of the elements in data are lower or equal. So I would say this is exactly the 90% time and there is no need to compute the value for <90%. For a large number of samples and continous values the difference between <=90% and <90% will vanish anyway.
Related
I am working with financial data and cannot assume a Gaussian distribution. So I normalize my data by subtracting the median and dividing by the interquartile range. This puts 95% of the data into a range [-2,2]. The rest are a bunch of crazy outliers that can be as high as -8, 28, 47 etc.
But I still dont want to throw the outliers away. So I apply a tanh(x) to my entire normalized time series and the majority of the data that are in the range [-2,-2] are now mapped to [-0.95, 0.95], and the crazy outliers are now saturated close to -1 and 1, and the really crazy ones are all mapped to precisely -1 and 1. Order is kept throughout the process, because tanh(x) is a monotonic function. And the Machine Learning algorithm doesnt have to waste time and energy on numbers that have much larger absolute values than others. The extreme outliers are all now in two groups, -1 and 1.
By the way, the tanh compression doesnt destroy too many unique values. That is, close values are not collapsed to the same value by tanh. I get almost exatly the same amount of unique values in my time series before the tanh, as after.
The data will be fed into Neural Network, Random Forest, and Gradient Boosted Decision Trees. (Even though decision trees dont care too much about outliers, I still want to force all indicators into the same range [-1,1]).
What are the bad consequences to my approach, compared to just throwing the outliers away? What am I missing?
Why is 80% of the PCA.explained_variance_ratio_ seem like a reasonable threshold? What can one say about the number of components required to explain 80% of the variance?
According to the PCA documentation,
auto:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
Ok, I'm not sure if I'm even making sense, but it seems like 80% is a good threshold, but why? I tried looking this up, but it didn't amount to much.
ipdb> np.count_nonzero(test==0) / len(ytrue) * 100
76.44815766923736
I have a datafile counting 24000 prices where I use them for a time series forecasting problem. Instead of trying predicting the price, I tried to predict log-return, i.e. log(P_t/P_P{t-1}). I have applied the log-return over the prices as well as all the features. The prediction are not bad, but the trend tend to predict zero. As you can see above, ~76% of the data are zeros.
Now the idea is probably to "look up for a zero-inflated estimator : first predict whether it's gonna be a zero; if not, predict the value".
In details, what is the perfect way to deal with excessive number of zeros? How zero-inflated estimator can help me with that? Be aware originally I am not probabilist.
P.S. I am working trying to predict the log-return where the units are "seconds" for High-Frequency Trading study. Be aware that it is a regression problem (not a classification problem).
Update
That picture is probably the best prediction I have on the log-return, i.e log(P_t/P_{t-1}). Although it is not bad, the remaining predictions tend to predict zero. As you can see in the above question, there is too many zeros. I have probably the same problem inside the features as I take the log-return on the features as well, i.e. if F is a particular feature, then I apply log(F_t/F_{t-1}).
Here is a one day data, log_return_with_features.pkl, with shape (23369, 30, 161). Sorry, but I cannot tell what are the features. As I apply log(F_t/F_{t-1}) on all the features and on the target (i.e. the price), then be aware I added 1e-8 to all the features before applying the log-return operation to avoid division by 0.
Ok, so judging from your plot: it's the nature of the data, the price doesn't really change that often.
Try subsampling your original data a bit (perhaps by a factor of 5, just look at the data), so that you generally see a price movement with every time-tick. This should make any modeling much MUCH easier.
For the subsampling: I suggest you do simple regular downsampling in time domain. So if you have price data with a second resolution (i.e. one price tag every second), then simply take every fifth datapoint. Then proceed as you usually do, specifically, compute the log-increase in the price from this subsampled data. Remember that whatever you do, it must be reproducible during the test time.
If that is not an option for you for whatever reasons, have a look at something that can handle multiple time scales, e.g. WaveNet or Clockwork RNN.
I have some observational data for which I would like to estimate parameters, and I thought it would be a good opportunity to try out PYMC3.
My data is structured as a series of records. Each record contains a pair of observations that relate to a fixed one hour period. One observation is the total number of events that occur during the given hour. The other observation is the number of successes within that time period. So, for example, a data point might specify that in a given 1 hour period, there were 1000 events in total, and that of those 1000, 100 were successes. In another time period, there may be 1000000 events in total, of which 120000 are successes. The variance of the observations is not constant and depends on the total number of events, and it is partly this effect that I would like to control and model.
My first step in doing this to estimate the underlying success rate. I've prepared the code below which is intended to mimic the situation by providing two sets of 'observed' data by generating it using scipy. However, it does not work properly.
What I expect it to find is:
loss_lambda_factor is roughly 0.1
total_lambda (and total_lambda_mu) is roughly 120.
Instead, the model converges very quickly, but to an unexpected answer.
total_lambda and total_lambda_mu are respectively sharp peaks around 5e5.
loss_lambda_factor is roughly 0.
The traceplot (which I cannot post due to reputation below 10) is fairly uninteresting - quick convergence, and sharp peaks at numbers that do not correspond to the input data. I am curious as to whether there is something fundamentally wrong with the approach I am taking. How should the following code be modified to give the correct / expected result?
from pymc import Model, Uniform, Normal, Poisson, Metropolis, traceplot
from pymc import sample
import scipy.stats
totalRates = scipy.stats.norm(loc=120, scale=20).rvs(size=10000)
totalCounts = scipy.stats.poisson.rvs(mu=totalRates)
successRate = 0.1*totalRates
successCounts = scipy.stats.poisson.rvs(mu=successRate)
with Model() as success_model:
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100000)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000000)
total_lambda = Normal('total_lambda', mu=total_lambda_mu, tau=total_lambda_tau)
total = Poisson('total', mu=total_lambda, observed=totalCounts)
loss_lambda_factor = Uniform('loss_lambda_factor', lower=0, upper=1)
success_rate = Poisson('success_rate', mu=total_lambda*loss_lambda_factor, observed=successCounts)
with success_model:
step = Metropolis()
success_samples = sample(20000, step) #, start)
plt.figure(figsize=(10, 10))
_ = traceplot(success_samples)
There is nothing fundamentally wrong with your approach, except for the pitfalls of any Bayesian MCMC analysis: (1) non-convergence, (2) the priors, (3) the model.
Non-convergence: I find a traceplot that looks like this:
This is not a good thing, and to see more clearly why, I would change the traceplot code to show only the second-half of the trace, traceplot(success_samples[10000:]):
The Prior: One major challenge for convergence is your prior on total_lambda_tau, which is a exemplar pitfall in Bayesian modeling. Although it might appear quite uninformative to use prior Uniform('total_lambda_tau', lower=0, upper=100000), the effect of this is to say that you are quite certain that total_lambda_tau is large. For example, the probability that it is less than 10 is .0001. Changing the prior to
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000)
results in a traceplot that is more promising:
This is still not what I look for in a traceplot, however, and to get something more satisfactory, I suggest using a "sequential scan Metropolis" step (which is what PyMC2 would default to for an analogous model). You can specify this as follows:
step = pm.CompoundStep([pm.Metropolis([total_lambda_mu]),
pm.Metropolis([total_lambda_tau]),
pm.Metropolis([total_lambda]),
pm.Metropolis([loss_lambda_factor]),
])
This produces a traceplot that seems acceptible:
The model: As #KaiLondenberg responded, the approach you have taken with priors on total_lambda_tau and total_lambda_mu is not a standard approach. You describe widely varying event totals (1,000 one hour and 1,000,000 the next) but your model posits it to be normally distributed. In spatial epidemiology, the approach I have seen for analogous data is a model more like this:
import pymc as pm, theano.tensor as T
with Model() as success_model:
loss_lambda_rate = pm.Flat('loss_lambda_rate')
error = Poisson('error', mu=totalCounts*T.exp(loss_lambda_rate),
observed=successCounts)
I'm sure that there are other ways that will seem more familiar in other research communities as well.
Here is a notebook collecting up these comments.
I see several potential problems with the model.
1.) I would think that the success counts (called error ?) should follow a Binomial(n=total,p=loss_lambda_factor) distribution, not a Poisson.
2.)
Where does the chain start ? Starting at a MAP or MLE configuration would make sense unless you use pure Gibbs sampling. Otherwise the chain might take a long time for burn-in, which might be what's happening here.
3.) Your choice of a hierarchical prior for total_lambda (i.e. normal with two uniform priors on those parameters) ensures that the chain will take a long time to converge, unless you pick your start wisely (as in Point 2.). You essentially introduce a lot of unneccessary degrees of freedom for the MCMC chain to get lost in. Given that total_lambda has to be nonngative, I would choose a Uniform prior for total_lambda in a suitable range (from 0 to the observed maximum for example).
4.) You use the Metropolis Sampler. 20000 samples might not be enough for that one. Try 60000 and discard the first 20000 as burn-in. The Metropolis Sampler can take a while to tune the step size, so it might well be that it spent the first 20000 samples to mainly reject proposals and tune. Try other samplers, such as NUTS.
I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.
Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.
kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.
Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.