Comparing two multi column dataframes for statistical significance

Comparing two multi column dataframes for statistical significance - python

I have 2 dataframes.
Each dataframe contains 64 columns with each column containing 256 values.
I need to compare these 2 dataframes for statistical significance.
I know only basics of statistics.
What I Have done is calculate p-value for all columns for each dataframe.
Then I compare p-value of each column of 1 st dataframe to the p value of each column to the 2nd dataframe.
EX: p value of 1 st column of 1st dataframe to p value of 1st column of 2nd dataframe.
Then I tell which columns are significantly different among 2 dataframes.
Is there any better way to do this.
I use python.

To be honest, the way you do it is not the way its meant to be. Let my highlight some points that you should always keep in mind when conducting such analyses:
1.) Hypothesis first
I strongly suggest to avoid testing everything against everything. This kind of exploratory data analysis will likely produce some significant results but it is also likely that you end up in a multiple comparisons problem.
In simple terms: You have so many tests that the chance of seeing something significant which in fact is not is greatly increased (see also Type I and Type II errors).
2.) The p-value isn't all the magic
Saying that you calculated the p-value for all columns doesn't tell which test you used. The p-value is just a "tool" from mathematical statistics that is used by a lot of tests (e.g. correlation, t-test, ANOVA, regression etc.). Having a significant p-value indicates that the difference/relationship you observed is statistically relevant (i.e. a systematic and not a random effect).
3.) Consider sample and effect size
Depending on which test you are using, the p-value is sensitive to the sample size you have. The greater your sample size, the more likely it is to find a significant effect. For instance, if you compare two groups with 1 million observations each, the smallest differences (which might also be random artifacts) can be significant. It is therefore important to also take a look at the effect size that tells you how large the observed really is (e.g. r for correlations, Cohen's d for t-tests, partial eta for ANOVAs etc.).
SUMMARY
So, if you want to get some real help here, I suggest to post some code and specify more concretely what (1) your research question is, (2) which tests you used, and (3) how your code and your output looks like.

Related

Within-Subject Permutation Testing Python

I'm trying to do permutation testing on a within-subject design, and with stats not being my strong point, after lots of desperate googling I'm still confused.
I have 36 subjects data, and each subject has their data processed by 6 different methods. And I have a metric (say SNR) for how well each method performs (essentially a 36x6 matrix).
The data violates the conditions for parametric testing (not normal, and not homogeneous variance between groups), and rather than using non-parametric testing, we want to use permutation testing.
I want to see if my approach makes sense...
Initially:
Perform an rmANOVA on the data, save the F-value as F-actual.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (any tips on how to perform this are appreciated).
After each shuffle (permutation), recompute the F-value and save to an array of possible F-values.
Check how often F-actual is more extreme than the values in the array of possible F-values.
Post-Hoc Testing:
Perform pairwise t-tests on the data, save the associated T-statistic as T-actual for each pairing.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (the same shuffling as above).
After each shuffle (permutation), recompute the T-stat and save to an array of possible T-values for each pairing.
After n-permutations, check how often the actual T-stat for each pairing is more extreme than those possible T-values for each pairing.
I've currently been working in python with pingouin, but I appreciate this may be easier to do in R so I am open to migrating if that is the case - but any advice on whether this approach even makes sense, and how to perform it if so is greatly appreciated!
Also just to note - the method needs to be capable of dealing with NaN/None values for certain methods and subjects (so for example subject 1 for method 1 may be blank, but there are relevant values for all other methods).
Thank you.

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.

The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

Statistical test for time series where outcome occurs - python

I am enquiring about assistance regarding regression testing. I have a continuous time series that fluctuates between positive and negative integers. I also have events occurring throughout this time series at seemingly random time points. Essentially, when an event occurs I grab the respective integer. I then want to test whether this integer influences the event at all. As in, are there more positive/negative integers.
I originally thought logistic regression with the positive/negative number but that would require at least two distinct groups. Whereas, I only have info on events that have occurred. I can't really include that amount of events that don't occur as it's somewhat continuous and random. The amount of times an event doesn't occur is impossible to measure.
So my distinct group is all true in a sense as I don't have any results from something that didn't occur. What I am trying to classify is:
When an outcome occurs, does the positive or negative integer influence this outcome.

It sounds like you are interested in determining the underlying forces that are producing a given stream of data. Such mathematical models are called Markov Models. A classic example is the study of text.
For example, if I run a Hidden Markov Model algorithm on a paragraph of English text, then I will find that there are two driving categories that are determining the probabilities of what letters show up in the paragraph. Those categories can be roughly broken into two groups, "aeiouy " and "bcdfghjklmnpqrstvwxz". Neither the mathematics nor the HMM "knew" what to call those categories, but they are what is statistically converged to upon analysis of a paragraph of text. We might call those categories "vowels" and "consonants". So, yes, vowels and consonants are not just 1st grade categories to learn, they follow from how text is written statistically. Interestingly, a "space" behaves more like a vowel than a consonant. I didn't give the probabilities for the example above, but it is interesting to note that "y" ends up with a probability of roughly 0.6 vowel and 0.4 consonant; meaning that "y" is the most consonant behaving vowel statistically.
A great paper is https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf which goes over the basic ideas of this kind of time-series analysis and even provides some sudo-code for reference.
I don't know much about the data that you are dealing with and I don't know if the concepts of "positive" and "negative" are playing a determining factor in the data that you see, but if you ran an HMM on your data and found the two groups to be the collection of positive numbers and collection of negative numbers, then your answer would be confirmed, yes, the most influential two-categories that are driving your data are the concepts of positive and negative. If they don't split evenly, then your answer is that those concepts are not an influential factor in driving the data. Regardless, in the end, the algorithm would end with several probability matricies that would show you how much each integer in your data is being influenced by each category, hence you would have much greater insight in the behavior of your time-series data.

Although, the question is quite difficult to understand after first paragraph. Let me help from what I could understand from this question.
Assuming you want to understand if there is relationship between the events happening and the integers in the data.
1st approach: Plot the data on a 2d scale and check visually if there is a relationship between data.
2nd approach: make the data from the events continuous and remove the events from other data and using rolling window smooth out the data and then compare both trends.
Above given approach only works well if I am understanding your problem correctly.
There is also one more thing known as Survivorship bias. You might be missing data, please also check that part also.

Maybe I am misunderstanding your problem but I don't believe that you can preform any kind of meaningful regression without more information.
Regression is usually used to find a relationship between two or more variables, however It appears that you only have one variable (if they are positive or negative) and one constant (outcome is always true in data). Maybe you could do some statistics on the distribution of the numbers (mean, median, standard deviation) but I am unsure how you might do regression.
https://en.wikipedia.org/wiki/Regression_analysis
You might want to consider that there might be some strong survivorship bias if you are missing a large chunk of your data.
https://en.wikipedia.org/wiki/Survivorship_bias
Hope this is at least a bit helpful to get you steered in the right direction

KMeans clustering unbalanced data

I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.
Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.
I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.
Is there anyway balance out the clusters?

It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.
What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.
Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.

Regression Tests on Arbitrary Number Sequences

I am trying to come up with a method to regression test number sequences.
My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.
Example:
"Good" version:
version width height depth
1 123 43 302
2 122 44 304
3 120 46 300
4 124 45 301
"New" version:
5 121 60 305
In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.
My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.
The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.
I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).
The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.
I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).
So my question is this:
Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)
For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.
Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.
To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:
http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

Is there a technique, an algorithm, or a science discipline out there
which would help me analyse number sequences to find abnormalities (in
a last value)?
The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.
You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.
EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:
1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)
2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"
3.) increase the bin width by 1 and repeat step 1 and 2
4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"
5.) let the test cases for the bad values fail
This algorithm should work for cases such as:
a single large peak with some outliers
multiple large peaks and some outliers in between
a flat distribution with a concentration in a certain region (or in multiple regions)
a number sequences where all numbers are equal
Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.
Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

I wonder if different columns in your data can be treated in different ways? Is it appropriate to, for example treat the width with a "close to the mean" check; another column with "value seen in set of good examples"; a third column maybe treated by "In existing cluster from K-means clustering of good examples".
You could score for each column and flag any new value that has any one or more columns not deemed to fit and state why.
Hmm, it's not restricted to individual columns - if, for example, there is some relation between column values then that could be checked for too - maybe width times height is limited; or the volume has limits.
Time: It may be that successive values can only deviate in some given manner by some value - If, for example the sides were continuously varied by some robot and the time between measurements was short enough, then that would limit the delta values between successive readings to that which the robotic mechanism could produce when it is working correctly.
I guess a large part of this answer is to use any knowledge you have about the data source to help.

I am not sure if I understand you correctly, but I think you want to predict if a sample presented to you (after experiencing a sequence of previous examples) is anomalous or not? You are therefore implying some sort of temporal dependency of the new sample?
If you have lots of training data i. e. (hundreds or thousands of) examples of (labelled) good and bad sequences, then you might be able to train a neural architecture to classify if the 'next element in the sequence' is anomalous or not. You could train an LSTM (long short-term memory) architecture that would generalise over input sequences to accurately classify the new sample presented to the architecture.
LSTMs will be available in any good neural network library and basically you will be running a general Supervised Learning routine. Tutorials about this are all over the Internet and in any good machine learning (ML) book.
As always in ML, take care of not over-fitting!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.