Comparing datasets to nonstandard probability distributions in Python

Comparing datasets to nonstandard probability distributions in Python - python

I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package.
My goal is to see how varying the size of my samples changes the goodness of fit to both the distributions they came from, and the other PDFs as well, and determine how large a sample is necessary to completely determine whether it came from one or other of my custom PDFs.
To do this I've gathered that I need to use some sort of nonparametric statistical analysis, i.e. seeing whether a set of data has been drawn from a provided probability distribution. Doing a bit of research, it seems like the Anderson-Darling test is ideal for this, however its implementation in python (scipy.stats.anderson) seems to only be usable for preset probability distributions such as normal, exponential, etc.
So my question is: given my many nonstandard PDFs (or CDFs if necessary, or the data I used to create them) what is the best way to work out how well a set of sample data fits each model in Python? If it is the Anderson-Darling test, is there some way of defining a custom PDF to test against?
Thanks. Any help is much appreciated.

(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying to classify new data into one distribution or another, my advice is to look at it as a classification problem and use your constructed pdf's to compute p(class | data) = p(data | class) p(class) / p(data) where the key part p(data | class) is your histogram. Maybe you can say more about your problem domain.
(2) You could apply the Kolmogorov-Smirnov test, but it's really pointless, as mentioned above.

Related

How to evaluate HDBSCAN text clusters?

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However, I'm having a hard time evaluating the results (apart from visual analysis, which is not great as the data grows). With LDA, although it's hard to evaluate it, i've been using the coherence measure. However, does anyone have any idea on how to evaluate the clusters made by HDBSCAN? I haven't been able to find much info on it, so if anyone has any idea, I'd very much appreciate!

HDBSCAN implements Density-Based Clustering Validation in the method called relative_validity. It will allow you to compare one clustering, obtained with a given set of hyperparameters, to another one.
In general, read about cluster analysis and cluster validation.
Here's a good discussion about this with the author of the HDBSCAN library.

Its the same problem everywhere in unsupervised learning.
It is unsupervised, you are trying to discover something new and interesting. There is no way for the computer to decide whether something is actually interesting or new. It can decide and trivial cases when the prior knowledge is coded in machine processable form already, and you can compute some heuristics values as a proxy for interestingness. But such measures (including density-based measures such as DBCV are actually in no way better to judge this than the clustering algorithm itself is choosing the "best" solution).
But in the end, there is no way around manually looking at the data, and doing the next steps - try to put into use what you learned of the data. Supposedly you are not invory tower academic just doing this because of trying to make up yet another useless method... So use it, don't fake using it.

You can try the clusteval library. This library helps your to find the optimal number of clusters in your dataset, also for hdbscan. When you have the cluster labels, you can start enrichment analysis using hnet.
pip install clusteval
pip install hnet
Example:
# Import library
from clusteval import clusteval
# Set the method
ce = clusteval(method='hdbscan')
# Evaluate
results = ce.fit(X)
# Make plot of the evaluation
ce.plot()
# Make scatter plot using the first two coordinates.
ce.scatter(X)
So at this point you have the optimal detected cluster labels and now you may want to know whether there is association between any of the clusters with a (group of) feature(s) in your meta-data. The idea is to compute for each cluster label how often it is seen for a particular class in your meta-data. This can be defined with a P-value. The lower the P-value (below alpha=0.05), the less likely it happened by random chance.
results is a dict and contains the optimal cluster labels in the key labx. With hnet we can compute the enrichment very easily. More information can be found here: https://erdogant.github.io/hnet
# Import library
import hnet
# Get labels
clusterlabels = results['labx']
# Compute the enrichment of the cluster labels with the dataframe df
enrich_results = hnet.enrichment(df, clusterlabels)
When we look at the enrich_results, there is a column with category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels clusterlabels.
The target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables in the dataframe. This can occur because we may need to better estimate the cluster labels or its a mixed group or something else.
More information about cluster enrichment can be found here:
https://erdogant.github.io/hnet/pages/html/Use%20Cases.html#cluster-enrichment

Measuring "mixtureness" of labeled data (python)

I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.

Adjusted Boxplot in Python

For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an 𝚁R implementation of this
(𝚛𝚘𝚋𝚞𝚜𝚝𝚋𝚊𝚜𝚎::𝚊𝚍𝚓𝚋𝚘𝚡()robustbase::adjbox()) as well as
a matlab one (in a library called 𝚕𝚒𝚋𝚛𝚊libra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..

In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.

Identifying a distribution from a pdf in python

I have a probability density function of an unknown distribution which is given as a set of tuples (x, f(x)), where x=numpy.arange(0,1,size) and f(x) is the corresponding probability.
What is the best way to identify the corresponding distribution? So far my idea is to draw a large amount of samples based on the pdf (by writing the code myself from scratch) and then use the obtained data to fit all of the distributions implemented in scipy.stats, then take the best fit.
Is there a better way to solve this problem? For example, is there some kind of utility in scipy.stats that I'm missing that would help me solve this problem?

In a fundamental sense, it's not really possible to summarize a distribution based on empirical samples - see here a discussion.
It's possible to do something more limited, which is to reject/accept the hypothesis that it comes out of one of a finite set of (parametric) distributions, based on a somewhat arbitrary criterion.
Given the finite set of distributions, for each distribution, you could perhaps realistically do the following:
Fit the distribution's parameters to the data. E.g., scipy.stats.beta.fit will fit the best parameters of the Beta distribution (all scipy distributions have this method).
Reject/accept the hypothesis that the data was generated by this distribution. There is more than a single way of doing this. A particularly simple way is to use the rvs() method of the distribution to generate another sample, then use ks_2samp to generate a Kolmogorov-Smirnov test.
Note that some specific distributions might have better, ad-hoc algorithms for testing whether a member of the distribution's family generated the data. As usual, the Normal distribution has many in particular (see Normalcy test).

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.

I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.

You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc

You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.