For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an πR implementation of this
(ππππππππππ::ππππππ‘()robustbase::adjbox()) as well as
a matlab one (in a library called πππππlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.
Related
I have a dataset with 7 parameters for each point:
counterOfPackets
counterOfSyn
counterOfPa
counterOfR
counterOfRA
counterOfFin
packetsTotalSize
I would like to find a way to get all the outliers to a python list (not as a plt.show GUI).
What algorithm should I use and how can I view the results as a python list?
Thanks for your help :D
This page on Medium from Will Badr is a good resource - https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623. In terms of what outlier detection algorithm to use, the answer depends on the distribution of your data. I have found success using standard deviations and distance from inter-quartile ranges to identify outliers. However, these approaches work better over normal distributions, and in my scenario, I found methods to transform my data into a normal distribution without impacting the outcome.
I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you
Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.
I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package.
My goal is to see how varying the size of my samples changes the goodness of fit to both the distributions they came from, and the other PDFs as well, and determine how large a sample is necessary to completely determine whether it came from one or other of my custom PDFs.
To do this I've gathered that I need to use some sort of nonparametric statistical analysis, i.e. seeing whether a set of data has been drawn from a provided probability distribution. Doing a bit of research, it seems like the Anderson-Darling test is ideal for this, however its implementation in python (scipy.stats.anderson) seems to only be usable for preset probability distributions such as normal, exponential, etc.
So my question is: given my many nonstandard PDFs (or CDFs if necessary, or the data I used to create them) what is the best way to work out how well a set of sample data fits each model in Python? If it is the Anderson-Darling test, is there some way of defining a custom PDF to test against?
Thanks. Any help is much appreciated.
(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying to classify new data into one distribution or another, my advice is to look at it as a classification problem and use your constructed pdf's to compute p(class | data) = p(data | class) p(class) / p(data) where the key part p(data | class) is your histogram. Maybe you can say more about your problem domain.
(2) You could apply the Kolmogorov-Smirnov test, but it's really pointless, as mentioned above.
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.
Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...