I have a dataset with 7 parameters for each point:
counterOfPackets
counterOfSyn
counterOfPa
counterOfR
counterOfRA
counterOfFin
packetsTotalSize
I would like to find a way to get all the outliers to a python list (not as a plt.show GUI).
What algorithm should I use and how can I view the results as a python list?
Thanks for your help :D
This page on Medium from Will Badr is a good resource - https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623. In terms of what outlier detection algorithm to use, the answer depends on the distribution of your data. I have found success using standard deviations and distance from inter-quartile ranges to identify outliers. However, these approaches work better over normal distributions, and in my scenario, I found methods to transform my data into a normal distribution without impacting the outcome.
Related
For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an πR implementation of this
(ππππππππππ::ππππππ‘()robustbase::adjbox()) as well as
a matlab one (in a library called πππππlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.
I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package.
My goal is to see how varying the size of my samples changes the goodness of fit to both the distributions they came from, and the other PDFs as well, and determine how large a sample is necessary to completely determine whether it came from one or other of my custom PDFs.
To do this I've gathered that I need to use some sort of nonparametric statistical analysis, i.e. seeing whether a set of data has been drawn from a provided probability distribution. Doing a bit of research, it seems like the Anderson-Darling test is ideal for this, however its implementation in python (scipy.stats.anderson) seems to only be usable for preset probability distributions such as normal, exponential, etc.
So my question is: given my many nonstandard PDFs (or CDFs if necessary, or the data I used to create them) what is the best way to work out how well a set of sample data fits each model in Python? If it is the Anderson-Darling test, is there some way of defining a custom PDF to test against?
Thanks. Any help is much appreciated.
(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying to classify new data into one distribution or another, my advice is to look at it as a classification problem and use your constructed pdf's to compute p(class | data) = p(data | class) p(class) / p(data) where the key part p(data | class) is your histogram. Maybe you can say more about your problem domain.
(2) You could apply the Kolmogorov-Smirnov test, but it's really pointless, as mentioned above.
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.
I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.
1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.
2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.
Can someone find me a good lib? Google hasn't been my friend
Yeah I think there isn't a good implementation to what I need.
I have some crazy requirements, like distance caching etc.
So i think i will just write my own lib and release it as GPLv3 soon.
Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.
Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.
What about this very nice answer on CrossValidated?
It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.
Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).
If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).
Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...