I would like to investigate whether there are siginifcant differences between three different groups. There are about 20 numerical attributes for these groups. For each attribute there are about a thousand observations.
My first thought was to calculate a manova. Unfortunately, the data are not normally distributed (tested with Anderson Darling test). From just looking at the data, the distribution is too narrow around the mean and has no tail at all.
When I calculate the Manova anyway, highly significant results come out that are completely against my expectations.
Therefore, I would like to calculate a multivariate Kurskall Wallis test next. So far I have found scipy.stats.kruskal. Unfortunately, it only compares individual data series with each other. Is there already a similar implementation in Python to a MANOVA, where you read in all attributes and all three groups and then give a result?
If you need more information, please let me know.
Thanks a lot! :)
Related
I have a dataset of words in a non-semantic context, basically names. I want to perform an act of grouping all the similar ones (say samantha, samanta, sammanta, samaynta.. ) in the same groups.
Since it is a non-semantic context, I cannot vectorize the data using TF-IDF or something else, so I am using the data as it is.
Notice that, I tried using clustering, I used DBSCAN with a custom distance metric (levenshtein), and Polyfuzz. Both gave some decent results, but they were not enough, the former gave a lot of misclusterings, and the later missed a lot of data. I tried searching on the internet for ways to approach this, but weirdly couldn't find any. All were in semantic contexts using TF-IDF and NLP technologies.
note : the dataset is relatively big (around 400.000 or more names)
I have been stuck in this, and would appreciate help, insight, or propositions in this regard.
For clustering you need a distance metric, such as the Levenshtein distance. If that does not give you the desired result, you need to use another one.
I would start by defining what you mean by similar: obviously it is not similarity in spelling, as otherwise the Levenshtein distance should work. What else is there? From your example it seems like the initail characters are important, so maybe use a string comparison that is weighted towards the beginning of a word.
Another approach you could use is using an algorithm tailored to names, such as Soundex.
I would like to calculate the total variation distance(TVD) between two continuous probability distributions. I would like to point out that while there are two relevant questions(see here and here), they are both working with discrete distributions.
For those not familiar with TVD,
Informally, this is the largest possible difference between the
probabilities that the two probability distributions can assign to the
same event.
as it is described in the respective Wikipedia page. In the case of continuous distributions, TVD is equal with half the integral of the absolute difference between the two (since I cannot add math notation see this for a proof and for the notation).
So far, I wasn't able to find a tool for my job in Python. I would be interested in one if exists. Also, while I have no experience in R, I understand that is commonly used for such tasks so I would be interested in one as well (TVD calculation is the final step of my algorithm so I guess it won't be hard to read some data from a file, do the calculation and print a number even if I am completely new to R).
I would like to add that I am mainly interesting in normal distributions so a tool strictly for those is more than welcomed.
If no such tools exist, then any help adapting answers from this question to use the builtin probability functions will be of great help as well.
Thank you in advance.
I have a dataframe of values from various samples from two groups. I performed a scipy.stats.ttest on these, which works perfectly, but I am a bit concerned here with the fact that so much testing may yield multiple testing error.
And I wonder how to implement MTC (multiple testing correction) with this. I mean, is there some function in scipy or statsmodels which would perform directly the tests and apply MTC on the output series of p-value, or can I apply an MTC function on a list of p-value without problem?
I know that statsmodels may comprise such functions, but what it has in power, it lacks greatly in manageability and documentation, unhappily (indeed, that's not the fault of the developers, they are three for such huge project). Anyway, I am a little stuck here, so I'll gladly take any suggesting. I didn't ask this in CrossValidated, because it is more related to the implementation part than the statistical part.
Edit 9th Oct 2019:
this link works as of today
https://www.statsmodels.org/stable/generated/statsmodels.stats.multitest.multipletests.html
original answer (returns 404 now)
statsmodels.sandbox.stats.multicomp.multipletests takes an array of p-values and returns the adjusted p-values. The documentation is pretty clear.
I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!
You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means
If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.