Test differently binned data sets

Test differently binned data sets - python

I am trying to test how a periodic data set behaves with respect to the same data set folded with the period (that is, the average profile). More specifically, I want to test if the single profiles are consistent with the average one.
I am reading about a number of test available in Python, especially about the Kolmogorov-Smirnov statistic on 2 samples and the chi square test.
However, my data are real data, and binned of course.
Therefore, as it is frequent, my data have gaps in between. This means that very often the number of bins in the single profiles is less than the bins of the "model" (the folded/average profile).
This means that I can't use those tests straightforward (because the two arrays have different number of elements), but I probably need to:
1) do some transformation, or any other operation, that allows me to compare the distributions;
2) Also, converting the average profile into a continuous model would be a nice solution;
3) Proceed with different statistical instruments which I am not aware of.
But I don't know how to move on in either case, so I would need help in finding a way for (1) or (2) (perhaps both!), or a hint about the third case.
EDIT: the data are a light curve, that is photon counts versus time.
The data are from a periodic astronomical source, that is they repeat their pattern (profile) every given period. I can fold the data with the period and obtain an average profile, and I want to use this averaged profile as a model to test each single profile against the averaged one, that is my model.
Thanks!

Related

Within-Subject Permutation Testing Python

I'm trying to do permutation testing on a within-subject design, and with stats not being my strong point, after lots of desperate googling I'm still confused.
I have 36 subjects data, and each subject has their data processed by 6 different methods. And I have a metric (say SNR) for how well each method performs (essentially a 36x6 matrix).
The data violates the conditions for parametric testing (not normal, and not homogeneous variance between groups), and rather than using non-parametric testing, we want to use permutation testing.
I want to see if my approach makes sense...
Initially:
Perform an rmANOVA on the data, save the F-value as F-actual.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (any tips on how to perform this are appreciated).
After each shuffle (permutation), recompute the F-value and save to an array of possible F-values.
Check how often F-actual is more extreme than the values in the array of possible F-values.
Post-Hoc Testing:
Perform pairwise t-tests on the data, save the associated T-statistic as T-actual for each pairing.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (the same shuffling as above).
After each shuffle (permutation), recompute the T-stat and save to an array of possible T-values for each pairing.
After n-permutations, check how often the actual T-stat for each pairing is more extreme than those possible T-values for each pairing.
I've currently been working in python with pingouin, but I appreciate this may be easier to do in R so I am open to migrating if that is the case - but any advice on whether this approach even makes sense, and how to perform it if so is greatly appreciated!
Also just to note - the method needs to be capable of dealing with NaN/None values for certain methods and subjects (so for example subject 1 for method 1 may be blank, but there are relevant values for all other methods).
Thank you.

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks

Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.

The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

Pattern prediction in time series

Has anyone tried to predict a specific pattern in time series data?
Example: In a specific time, there is a huge upward spike in certain variables in a time series...
How would I build a model to predict that spike when next time it occurs?
Please do respond if anyone working in this area.
I tried with converting that particular series of data in a NumPy array and trying to feed in the model.But Its not allowing me.
Here is the data looks like
This data is generated in a controlled manner so that we can have these spikes near to near.. In actual case this could b random, and our main objective is to catch this pattern and make a count.

Das, you could try implementing LSTM based Neural Network Models.
See:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
It is still preferred that the data contains a trend. If the upward spike happens around the same time of the recurring time interval, it is more likely that you get a better prediction result.
In the image you shared, there seems to be trend in the data. Hence LSTM models can pretty efficiently extract the pattern and output a prediction.
Statistical modelling of the data can also provide better results.
See: https://orangematter.solarwinds.com/2019/12/15/holt-winters-forecasting-simplified/
Das, if outputting the total number of peaks is solely the requirement, then I think heavy neural network models are bit of an overkill. However, neural network models also can pretty well do the job, but require lot of data input for training and fine tuning the weights and biases to give a really good result.
How about you try implementing a thresholding based technique, where you increment a counter every time the data value crosses the preset threshold? In such an approach you should ensure to group very nearby peaks together so that the count is just one for that case. Here you could set a threshold on the x axis too.
ie:- For instance with respect to the given plot, let the y-threshold be 4. Then you will get a count 5 if you consider the y axis threshold (y value 4) alone. This is because for x value at 15:48.2, there are two peaks that cross y value 4. So suppose you set a threshold in the x axis too, then these nearby peaks shall be grouped together within the preset limit and the final count will be 4 (which is the requirement).

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?

Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

Regression Tests on Arbitrary Number Sequences

I am trying to come up with a method to regression test number sequences.
My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.
Example:
"Good" version:
version width height depth
1 123 43 302
2 122 44 304
3 120 46 300
4 124 45 301
"New" version:
5 121 60 305
In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.
My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.
The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.
I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).
The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.
I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).
So my question is this:
Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)
For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.
Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.
To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:
http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

Is there a technique, an algorithm, or a science discipline out there
which would help me analyse number sequences to find abnormalities (in
a last value)?
The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.
You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.
EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:
1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)
2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"
3.) increase the bin width by 1 and repeat step 1 and 2
4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"
5.) let the test cases for the bad values fail
This algorithm should work for cases such as:
a single large peak with some outliers
multiple large peaks and some outliers in between
a flat distribution with a concentration in a certain region (or in multiple regions)
a number sequences where all numbers are equal
Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.
Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

I wonder if different columns in your data can be treated in different ways? Is it appropriate to, for example treat the width with a "close to the mean" check; another column with "value seen in set of good examples"; a third column maybe treated by "In existing cluster from K-means clustering of good examples".
You could score for each column and flag any new value that has any one or more columns not deemed to fit and state why.
Hmm, it's not restricted to individual columns - if, for example, there is some relation between column values then that could be checked for too - maybe width times height is limited; or the volume has limits.
Time: It may be that successive values can only deviate in some given manner by some value - If, for example the sides were continuously varied by some robot and the time between measurements was short enough, then that would limit the delta values between successive readings to that which the robotic mechanism could produce when it is working correctly.
I guess a large part of this answer is to use any knowledge you have about the data source to help.

I am not sure if I understand you correctly, but I think you want to predict if a sample presented to you (after experiencing a sequence of previous examples) is anomalous or not? You are therefore implying some sort of temporal dependency of the new sample?
If you have lots of training data i. e. (hundreds or thousands of) examples of (labelled) good and bad sequences, then you might be able to train a neural architecture to classify if the 'next element in the sequence' is anomalous or not. You could train an LSTM (long short-term memory) architecture that would generalise over input sequences to accurately classify the new sample presented to the architecture.
LSTMs will be available in any good neural network library and basically you will be running a general Supervised Learning routine. Tutorials about this are all over the Internet and in any good machine learning (ML) book.
As always in ML, take care of not over-fitting!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.