I'm trying to do permutation testing on a within-subject design, and with stats not being my strong point, after lots of desperate googling I'm still confused.
I have 36 subjects data, and each subject has their data processed by 6 different methods. And I have a metric (say SNR) for how well each method performs (essentially a 36x6 matrix).
The data violates the conditions for parametric testing (not normal, and not homogeneous variance between groups), and rather than using non-parametric testing, we want to use permutation testing.
I want to see if my approach makes sense...
Initially:
Perform an rmANOVA on the data, save the F-value as F-actual.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (any tips on how to perform this are appreciated).
After each shuffle (permutation), recompute the F-value and save to an array of possible F-values.
Check how often F-actual is more extreme than the values in the array of possible F-values.
Post-Hoc Testing:
Perform pairwise t-tests on the data, save the associated T-statistic as T-actual for each pairing.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (the same shuffling as above).
After each shuffle (permutation), recompute the T-stat and save to an array of possible T-values for each pairing.
After n-permutations, check how often the actual T-stat for each pairing is more extreme than those possible T-values for each pairing.
I've currently been working in python with pingouin, but I appreciate this may be easier to do in R so I am open to migrating if that is the case - but any advice on whether this approach even makes sense, and how to perform it if so is greatly appreciated!
Also just to note - the method needs to be capable of dealing with NaN/None values for certain methods and subjects (so for example subject 1 for method 1 may be blank, but there are relevant values for all other methods).
Thank you.
Related
I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
I have a task to find similar parts based on numeric dimensions--diameters, thickness--and categorical dimensions--material, heat treatment, etc. I have a list of 1 million parts. My approach as a programmer is to put all parts on a list, pop off the first part and use it as a new "cluster" to compare the rest of the parts on the list based on the dimensions. As a part on the list matches the categorical dimensions and numerical dimensions--within 5 percent--I will add that part to the cluster and remove from the initial list. Once all parts in the list are compared with the initial cluster part's dimensions, I will pop the next part off the list and start again, populating clusters until no parts remain on the original list. This is a programmatic approach. I am not sure if this is most efficient way of categorizing parts into "clusters" or if k-means clustering would be a better approach.
Define "better".
What you do seems to be related to "leader" clustering. But that is a very primitive form of clustering that will usually not yield competitive results. But with 1 million points, your choices are limited, and kmeans does not handle categorical data well.
But until you decide what is 'better', there probably is nothing 'wrong' with your greedy approach.
An obvious optimization would be to first split all the data based on the categorical attributes (as you expect them to match exactly). That requires just one pass over the data set and a hash table. If your remaining parts are small enough, you could try kmeans (but how would you choose k), or DBSCAN (probably using the same threshold you already have) on each part.
I have 2 dataframes.
Each dataframe contains 64 columns with each column containing 256 values.
I need to compare these 2 dataframes for statistical significance.
I know only basics of statistics.
What I Have done is calculate p-value for all columns for each dataframe.
Then I compare p-value of each column of 1 st dataframe to the p value of each column to the 2nd dataframe.
EX: p value of 1 st column of 1st dataframe to p value of 1st column of 2nd dataframe.
Then I tell which columns are significantly different among 2 dataframes.
Is there any better way to do this.
I use python.
To be honest, the way you do it is not the way its meant to be. Let my highlight some points that you should always keep in mind when conducting such analyses:
1.) Hypothesis first
I strongly suggest to avoid testing everything against everything. This kind of exploratory data analysis will likely produce some significant results but it is also likely that you end up in a multiple comparisons problem.
In simple terms: You have so many tests that the chance of seeing something significant which in fact is not is greatly increased (see also Type I and Type II errors).
2.) The p-value isn't all the magic
Saying that you calculated the p-value for all columns doesn't tell which test you used. The p-value is just a "tool" from mathematical statistics that is used by a lot of tests (e.g. correlation, t-test, ANOVA, regression etc.). Having a significant p-value indicates that the difference/relationship you observed is statistically relevant (i.e. a systematic and not a random effect).
3.) Consider sample and effect size
Depending on which test you are using, the p-value is sensitive to the sample size you have. The greater your sample size, the more likely it is to find a significant effect. For instance, if you compare two groups with 1 million observations each, the smallest differences (which might also be random artifacts) can be significant. It is therefore important to also take a look at the effect size that tells you how large the observed really is (e.g. r for correlations, Cohen's d for t-tests, partial eta for ANOVAs etc.).
SUMMARY
So, if you want to get some real help here, I suggest to post some code and specify more concretely what (1) your research question is, (2) which tests you used, and (3) how your code and your output looks like.
I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!
You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means
If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.
I am trying to test how a periodic data set behaves with respect to the same data set folded with the period (that is, the average profile). More specifically, I want to test if the single profiles are consistent with the average one.
I am reading about a number of test available in Python, especially about the Kolmogorov-Smirnov statistic on 2 samples and the chi square test.
However, my data are real data, and binned of course.
Therefore, as it is frequent, my data have gaps in between. This means that very often the number of bins in the single profiles is less than the bins of the "model" (the folded/average profile).
This means that I can't use those tests straightforward (because the two arrays have different number of elements), but I probably need to:
1) do some transformation, or any other operation, that allows me to compare the distributions;
2) Also, converting the average profile into a continuous model would be a nice solution;
3) Proceed with different statistical instruments which I am not aware of.
But I don't know how to move on in either case, so I would need help in finding a way for (1) or (2) (perhaps both!), or a hint about the third case.
EDIT: the data are a light curve, that is photon counts versus time.
The data are from a periodic astronomical source, that is they repeat their pattern (profile) every given period. I can fold the data with the period and obtain an average profile, and I want to use this averaged profile as a model to test each single profile against the averaged one, that is my model.
Thanks!