I am trying to come up with a method to regression test number sequences.
My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.
Example:
"Good" version:
version width height depth
1 123 43 302
2 122 44 304
3 120 46 300
4 124 45 301
"New" version:
5 121 60 305
In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.
My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.
The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.
I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).
The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.
I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).
So my question is this:
Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)
For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.
You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.
Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.
To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:
http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb
Is there a technique, an algorithm, or a science discipline out there
which would help me analyse number sequences to find abnormalities (in
a last value)?
The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.
You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.
EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:
1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)
2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"
3.) increase the bin width by 1 and repeat step 1 and 2
4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"
5.) let the test cases for the bad values fail
This algorithm should work for cases such as:
a single large peak with some outliers
multiple large peaks and some outliers in between
a flat distribution with a concentration in a certain region (or in multiple regions)
a number sequences where all numbers are equal
Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.
Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.
I wonder if different columns in your data can be treated in different ways? Is it appropriate to, for example treat the width with a "close to the mean" check; another column with "value seen in set of good examples"; a third column maybe treated by "In existing cluster from K-means clustering of good examples".
You could score for each column and flag any new value that has any one or more columns not deemed to fit and state why.
Hmm, it's not restricted to individual columns - if, for example, there is some relation between column values then that could be checked for too - maybe width times height is limited; or the volume has limits.
Time: It may be that successive values can only deviate in some given manner by some value - If, for example the sides were continuously varied by some robot and the time between measurements was short enough, then that would limit the delta values between successive readings to that which the robotic mechanism could produce when it is working correctly.
I guess a large part of this answer is to use any knowledge you have about the data source to help.
I am not sure if I understand you correctly, but I think you want to predict if a sample presented to you (after experiencing a sequence of previous examples) is anomalous or not? You are therefore implying some sort of temporal dependency of the new sample?
If you have lots of training data i. e. (hundreds or thousands of) examples of (labelled) good and bad sequences, then you might be able to train a neural architecture to classify if the 'next element in the sequence' is anomalous or not. You could train an LSTM (long short-term memory) architecture that would generalise over input sequences to accurately classify the new sample presented to the architecture.
LSTMs will be available in any good neural network library and basically you will be running a general Supervised Learning routine. Tutorials about this are all over the Internet and in any good machine learning (ML) book.
As always in ML, take care of not over-fitting!
Related
I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks
Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.
The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.
I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
I am enquiring about assistance regarding regression testing. I have a continuous time series that fluctuates between positive and negative integers. I also have events occurring throughout this time series at seemingly random time points. Essentially, when an event occurs I grab the respective integer. I then want to test whether this integer influences the event at all. As in, are there more positive/negative integers.
I originally thought logistic regression with the positive/negative number but that would require at least two distinct groups. Whereas, I only have info on events that have occurred. I can't really include that amount of events that don't occur as it's somewhat continuous and random. The amount of times an event doesn't occur is impossible to measure.
So my distinct group is all true in a sense as I don't have any results from something that didn't occur. What I am trying to classify is:
When an outcome occurs, does the positive or negative integer influence this outcome.
It sounds like you are interested in determining the underlying forces that are producing a given stream of data. Such mathematical models are called Markov Models. A classic example is the study of text.
For example, if I run a Hidden Markov Model algorithm on a paragraph of English text, then I will find that there are two driving categories that are determining the probabilities of what letters show up in the paragraph. Those categories can be roughly broken into two groups, "aeiouy " and "bcdfghjklmnpqrstvwxz". Neither the mathematics nor the HMM "knew" what to call those categories, but they are what is statistically converged to upon analysis of a paragraph of text. We might call those categories "vowels" and "consonants". So, yes, vowels and consonants are not just 1st grade categories to learn, they follow from how text is written statistically. Interestingly, a "space" behaves more like a vowel than a consonant. I didn't give the probabilities for the example above, but it is interesting to note that "y" ends up with a probability of roughly 0.6 vowel and 0.4 consonant; meaning that "y" is the most consonant behaving vowel statistically.
A great paper is https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf which goes over the basic ideas of this kind of time-series analysis and even provides some sudo-code for reference.
I don't know much about the data that you are dealing with and I don't know if the concepts of "positive" and "negative" are playing a determining factor in the data that you see, but if you ran an HMM on your data and found the two groups to be the collection of positive numbers and collection of negative numbers, then your answer would be confirmed, yes, the most influential two-categories that are driving your data are the concepts of positive and negative. If they don't split evenly, then your answer is that those concepts are not an influential factor in driving the data. Regardless, in the end, the algorithm would end with several probability matricies that would show you how much each integer in your data is being influenced by each category, hence you would have much greater insight in the behavior of your time-series data.
Although, the question is quite difficult to understand after first paragraph. Let me help from what I could understand from this question.
Assuming you want to understand if there is relationship between the events happening and the integers in the data.
1st approach: Plot the data on a 2d scale and check visually if there is a relationship between data.
2nd approach: make the data from the events continuous and remove the events from other data and using rolling window smooth out the data and then compare both trends.
Above given approach only works well if I am understanding your problem correctly.
There is also one more thing known as Survivorship bias. You might be missing data, please also check that part also.
Maybe I am misunderstanding your problem but I don't believe that you can preform any kind of meaningful regression without more information.
Regression is usually used to find a relationship between two or more variables, however It appears that you only have one variable (if they are positive or negative) and one constant (outcome is always true in data). Maybe you could do some statistics on the distribution of the numbers (mean, median, standard deviation) but I am unsure how you might do regression.
https://en.wikipedia.org/wiki/Regression_analysis
You might want to consider that there might be some strong survivorship bias if you are missing a large chunk of your data.
https://en.wikipedia.org/wiki/Survivorship_bias
Hope this is at least a bit helpful to get you steered in the right direction
In a data science task I have some physical data from the instrument and need to predict continous time value. The data is divided into signal samples with some peaks occuring before that target time. In order to create new features I will have to use some statistical information about the signal - but not necessarily for the whole signal sample.
I was thinking about dividing the sample into chunks and use statistical data derived from these chunks as separate features.
I could divide the sample into say 1000 chunks. But it can be that such division doesn't make much sense. Maybe it would be better to get statistical info from the first 10% of the sample, then, say, last 20% and so on. Or at least use some other value for division based on the specific sample. Maybe for some samples dividing into 1000 chunks is good but for some others it should 500 or 2000 etc.
My idea was to use Neural Network to derive that division value (or maybe a few values, like the number of chunks and their sizes)
Does it makes sense at all and if yes, any ideas how to do that? It sounds like something like parameter optimisation using neural network but googling such thing didn't give me the required result.
Maybe someone stumbled upon similar problem?
I am a PhD student who is trying to use the NEAT algorithm as a controller for a robot and I am having some accuracy issues with it. I am working with Python 2.7 and for it and am using two NEAT python implementations:
The NEAT which is in this GitHub repository: https://github.com/CodeReclaimers/neat-python
Searching in Google, it looks like it has been used in some projects with succed.
The multiNEAT library developed by Peter Chervenski and Shane Ryan: http://www.multineat.com/index.html.
Which appears in the "official" software web page of NEAT software catalog.
While testing the first one, I've found that my program converges quickly to a solution, but this solution is not precise enough. As lack of precision I want to say a deviation of a minimum of 3-5% in the median and average related to the "perfect" solution at the end of the evolution (Depending on the complexity of the problem, an error around 10% is normal for my solutions. Furthermore, I could said that I've "never" seen an error value under the 1% between the solution given by the NEAT and the solution that it is the correct one). I must said that I've tried a lot of different parameter combinations and configurations (this is an old problem for me).
Due to that, I tested the second library. The MultiNEAT library converges quickly and easier that the previous one. (I assume that is due to the C++ implementation instead the pure Python) I get similar results, but I still have the same problem; lack of accuracy. This second library has different configuration parameters too, and I haven't found a proper combination of them to improve the performance of the problem.
My question is:
Is it normal to have this lack of accuracy in the NEAT results? It achieves good solutions, but not good enough for controlling a robot arm, which is what I want to use it for.
I'll write what I am doing in case someone sees some conceptual or technical mistake in the way I set out my problem:
To simplify the problem, I'll show a very simple example: I have a very simple problem to solve, I want a NN that may calculate the following function: y = x^2 (similar results are found with y=x^3 or y = x^2 + x^3 or similar functions)
The steps that I follow to develop the program are:
"Y" are the inputs to the network and "X" the outputs. The
activation functions of the neural net are sigmoid functions.
I create a data set of "n" samples given values to "X" between the
xmin = 0.0 and the xmax = 10.0
As I am using sigmoid functions, I make a normalization of the "Y"
and "X" values:
"Y" is normalized linearly between (Ymin, Ymax) and (-2.0, 2.0) (input range of sigmoid).
"X" is normalized linearly between (Xmin, Xmax) and (0.0, 1.0) (the output range of sigmoid).
After creating the data set, I subdivide in in a train sample (70%
percent of the total amount), a validation sample and a test sample
(15% each one).
At this point, I create a population of individuals for doing
evolution. Each individual of the population is evaluated in all the
train samples. Each position is evaluated as:
eval_pos = xmax - abs(xtarget - xobtained)
And the fitness of the individual is the average value of all the train positions (I've selected the minimum too but it gives me worse performance).
After the whole evaluation, I test the best obtained individual
against the test sample. And here is where I obtained those
"un-precise values". Moreover, during the evaluation process, the
maximum value where "abs(xtarget - xobtained) = 0" is never
obtained.
Furthermore, I assume that how I manipulate the data is right because, I use the same data set for training a neural network in Keras and I get much better results than with NEAT (an error less than a 1% is achievable after 1000 epochs in a layer with 5 neurons).
At this point, I would like to know if what is happened is normal because I shouldn't use a data set of data for developing the controller, it must be learned "online" and NEAT looks like a suitable solution for my problem.
Thanks in advance.
EDITED POST:
Firstly, Thanks for comment nick.
I'll answer your questions below::
I am using the NEAT algorithm.
Yes, I've carried out experiments increasing the number of individuals in the population and the generations number. A typical graph that I get is like this:
Although the population size in this example is not such big, I've obtained similar results in experiments incrementing the number of individuals or the number of generations. Populations of 500 in individuals and 500 generations, for example. In this experiments, the he algorithm converge fast to a solution, but once there, the best solution is stucked and it does not improve any more.
As I mentioned in my previous post, I've tried several experiments with many different parameters configurations... and the graphics are more or less similar to the previous showed.
Furthermore, other two experiments that I've tried were: once the evolution reach the point where the maximum value and the median converge, I generate other population based on that genome with new configuration parameters where:
The mutation parameters change with a high probability of mutation (weight and neuron probability) in order to find new solutions with the aim to "jumping" from the current genome to other better.
The neuron mutation is reduced to 0, while the weight "mutation probability" increase for "mutate weight" in a lower range in order to get slightly modifications with the aim to get a better adjustment of the weights. (trying to get a "similar" functionality as backprop. making slighty changes in the weights)
This two experiments didn't work as I expected and the best genome of the population was also the same of the previous population.
I am sorry, but I do not understand very well what do you want to say with "applying your own weighted penalties and rewards in your fitness function". What do you mean with including weight penalities in the fitness function?
Regards!
Disclaimer: I have contributed to these libraries.
Have you tried increasing the population size to speed up the search and increasing the number of generations? I use it for a trading task, and by increasing the population size my champions were found much sooner.
Another thing to think about is applying your own weighted penalties and rewards in your fitness function, so that anything that doesn't get very close right away is "killed off" sooner and the correct genome is found faster. It should be noted that neat uses a fitness function to learn as a opposed to gradient descent so it wont converge in the same way and its possible you may have to train a bit longer.
Last question, are you using the neat or hyperneat algo from multineat?