Create high nr of random sequences with min Edit Distance time efficient

Create high nr of random sequences with min Edit Distance time efficient - python

I need to create a program/script for the creation of a high numbers of random sequences (20 letter long sequence based on 4 different letters) with a minimum edit distance between all sequences. "High" would here be a minimum of 100k sequences, but if possible up to 1 million.
I started with a naive approach of just generating random 20 letter sequences, and for each sequence, calculate the edit distance between the sequence and all other sequences already created and stored. If the new sequence pass my threshold value, store it, otherwise discard.
As you understand, this scales very badly for higher number of sequences. Up to 10k is reasonably fine, but trying to get 100k this starts to get troublesome.
I really only need to create the sequences once and store the output, so I'm really not that fussy about speed, but making 1 million at this rate today is just no possible.
Been trying to think of alternatives to speed up the process, like building the sequences is "blocks" of minimal ED and then combining, but haven't come up with any solution.
Wondering, do anyone have any smart idea/method that could be implemented to create such high number of sequences with minimal ED more time efficient?
Cheers,
JB

It seems, from wikipedia, that edit distance is one of three operations insertion, deletion, substitution; performed on a starting string. Why not systematically generate all strings up to N edits from a starting string then stop when you reach your limit?
There would be no need to check for the actual edit distance as they would be correct by generation. For randomness could you generate a number then shuffle them.

Related

Regression Tests on Arbitrary Number Sequences

I am trying to come up with a method to regression test number sequences.
My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.
Example:
"Good" version:
version width height depth
1 123 43 302
2 122 44 304
3 120 46 300
4 124 45 301
"New" version:
5 121 60 305
In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.
My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.
The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.
I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).
The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.
I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).
So my question is this:
Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)
For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.
Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.
To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:
http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

Is there a technique, an algorithm, or a science discipline out there
which would help me analyse number sequences to find abnormalities (in
a last value)?
The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.
You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.
EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:
1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)
2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"
3.) increase the bin width by 1 and repeat step 1 and 2
4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"
5.) let the test cases for the bad values fail
This algorithm should work for cases such as:
a single large peak with some outliers
multiple large peaks and some outliers in between
a flat distribution with a concentration in a certain region (or in multiple regions)
a number sequences where all numbers are equal
Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.
Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

I wonder if different columns in your data can be treated in different ways? Is it appropriate to, for example treat the width with a "close to the mean" check; another column with "value seen in set of good examples"; a third column maybe treated by "In existing cluster from K-means clustering of good examples".
You could score for each column and flag any new value that has any one or more columns not deemed to fit and state why.
Hmm, it's not restricted to individual columns - if, for example, there is some relation between column values then that could be checked for too - maybe width times height is limited; or the volume has limits.
Time: It may be that successive values can only deviate in some given manner by some value - If, for example the sides were continuously varied by some robot and the time between measurements was short enough, then that would limit the delta values between successive readings to that which the robotic mechanism could produce when it is working correctly.
I guess a large part of this answer is to use any knowledge you have about the data source to help.

I am not sure if I understand you correctly, but I think you want to predict if a sample presented to you (after experiencing a sequence of previous examples) is anomalous or not? You are therefore implying some sort of temporal dependency of the new sample?
If you have lots of training data i. e. (hundreds or thousands of) examples of (labelled) good and bad sequences, then you might be able to train a neural architecture to classify if the 'next element in the sequence' is anomalous or not. You could train an LSTM (long short-term memory) architecture that would generalise over input sequences to accurately classify the new sample presented to the architecture.
LSTMs will be available in any good neural network library and basically you will be running a general Supervised Learning routine. Tutorials about this are all over the Internet and in any good machine learning (ML) book.
As always in ML, take care of not over-fitting!

Find the 'pits' in a list of integers. Fast

I'm attempting to write a program which finds the 'pits' in a list of
integers.
A pit is any integer x where x is less than or equal to the integers
immediately preceding and following it. If the integer is at the start
or end of the list it is only compared on the inward side.
For example in:
[2,1,3] 1 is a pit.
[1,1,1] all elements are pits.
[4,3,4,3,4] the elements at 1 and 3 are pits.
I know how to work this out by taking a linear approach and walking along
the list however i am curious about how to apply divide and conquer
techniques to do this comparatively quickly. I am quite inexperienced and
am not really sure where to start, i feel like something similar to a binary
tree could be applied?
If its pertinent i'm working in Python 3.
Thanks for your time :).

Without any additional information on the distribution of the values in the list, it is not possible to achieve any algorithmic complexity of less than O(x), where x is the number of elements in the list.
Logically, if the dataset is random, such as a brownian noise, a pit can happen anywhere, requiring a full 1:1 sampling frequency in order to correctly find every pit.
Even if one just wants to find the absolute lowest pit in the sequence, that would not be possible to achieve in sub-linear time without repercussions on the correctness of the results.
Optimizations can be considered, such as mere parallelization or skipping values neighbor to a pit, but the overall complexity would stay the same.

Looking for a better evaluation method for a genetic algorithm

I'm currently trying to solve the hard Challenge #151 on reddit with a unuasual method, a genetic algorithm.
In short, after seperating a string to consonants and vowels and removing spaces I need to put it together without knowing what character comes first.
hello world is seperated to hllwrld and eoo and needs to be put together again. One solution for example would be hlelworlod, but that doesn't make much sense. The exhaustive approach that takes all possible solutions works, but isn't feasible for longer problem sets.
What I already have
A database with the frequenzy of english words
An algorithm that constructs a relative cost database using Zipf's law and can consistently seperate words from sentences without spaces (borrowed from this question/answer
A method that puts consonants and vowels into a stack and randomly takes a character from either one and encodes this in a string that consists of 1 and 2, effectively encoding the construction in a gene. The correct gene for the example would be 1211212111
A method that mutates such a string, randomly swapping characters around
What I tried
Generating 500 random sequences, using the infer_spaces() method and evaluating fitness with the cost of all the words, taking the best 25% and mutate 4 new from those, works for small strings, but falls into local minima very often, especially for longer sequences. Hello World is found already in the first generation, thisisnotworkingverygood (which is correctly seperated and has a cost of 41.223) converges to th iss n ti wo or king v rye good (270 cost) already in the second generation.
What I need
Clearly, using the calculated cost as a evaluation method does only work for the separation of sentences that are grammatically correct, not for for this genetic algorithm. Do you have better ideas I could try? Or is another part of solution, for example the representation of the gene, the problem?

I would simplify the problem into two parts,
Finding candidate words to split the string into (so hllwrld => hll wrld)
How to then expand those words by adding vowels.
I would first take your dictionary of word frequencies, and process it to create a second list of words without vowels, along with a list of the possible vowel list for each collapsed word (and the associated frequency). You technically don't need a GA to solve this (and I think it would be easier to solve without one), but as you asked, I will provide 2 answers:
Without GA: you should be able to solve the first problem using a depth first search, matching substrings of the word against that dictionary, and doing so with the remaining word parts, only accepting partitions of the word into words (without vowels) where all words are in the second dictionary. Then you have to substitute in the vowels. Given that second dictionary, and the partition you already have, this should be easy. You can also use the list of vowels to further constrain the partitioning, as valid words in the partitions can only be made whole using vowels from the vowel list that is input into the algorithm. Starting at the left hand side of the string and iterating over all valid partitions in a depth first manner should solve this problem relatively quickly.
With GA: To solve this with a GA, I would create the dictionary of words without vowels. Then using the GA, create binary strings (as your chromosomes) of the same length as the input string of consonants, where a 1 = split a word at that position, and 0 = leave unchanged. These strings will all be the same length. Then create a fitness function that returns the proportion of words obtained after performing a split using the chromosome that are valid words without vowels, according to that dictionary. Create a second fitness function that takes the valid no-vowel words, and computes the proportion of overlap between the vowels missing in all these valid no-vowel words, and the original vowel list. Combine both fitness functions into one by multiplying the value from the first one by ten (assuming the second one returns a value between 0 and 1). That will force the algorithm to focus on the segmentation problem first and the vowel insertion problem second, and will also favor segmentations that are of the same quality, but preferring those that have a closer set of missing vowels to the original list. I would also include cross over in the solution. As all your chromosomes are the same length, this should be trivial. Once you have a solution that scores perfectly on the fitness function, then it should be trivial to recreate the original sentence given that dictionary of words without vowels (provided you maintain a second dictionary that list the possible missing vowel set for each non-vowel word - there could be multiple for each, as some vowel-less words will be the same with the vowels removed.

Let's say you have several generations and you plot the cost for the best specimen in each generation (we consider long sentence). Does this graph go down or converges after 2-3 generations to a specific value (let the algorithm run for example for 10 generations)? Can you run your algorithm several times with various initial conditions (random sequences) and see whether you get good results sometimes or not?
Depending of the results, you may try the following (this graph is a really good tool to improve the performance):
1) If you have a graph that goes up and down too much all the time - you have too much mutation (average number of swaps per gene for example), try to decrease it.
2) If you stuck up in a local minimum (cost of the best specimen doesn't change much after some time) try to increase mutation or run several isolated populations (3-4) of let's say 100 species at the beginning of your algorithm for a few generations. Then select the best population (that's closer to global minimum) and try to improve it as much as possible through mutation
PS: By the way interesting problem, I tried to figure out on how to use crossover to improve the algorithm but haven't figured it out

The fitness function is the key to the success of GA algorithm ( Which I kind of agree is suitable here ).
I agree with #Simon that the vowel non-vowel separation is not that important. just trip your text corpus to remove the vowels.
what is important in the fitness:
matched word frequency ( frequent words better )
grammar - structure of the sentence ( which you might need to use NLTK to get related infomation )
and don't forget to update the end result ^^

In python, how can I efficiently find a consecutive sequence that is a subset of a larger consecutive sequence?

I need to find all the days of the month where a certain activity occurs. The days when the activity occurs will be sequential. The sequence of days can range from one to the entire month, and the sequence will occur exactly one time per month.
To test whether or not the activity occurs on any given day is not an expensive calculation, but I thought I would use this problem learn something new. Which algorithm minimizes the number of days I have to test?

You can't really do much better than iterating through the sequence to find the first match, then iterating until the first non match. You can use itertools to make it nice and readable:
itertools.takewhile(mytest,
itertools.dropwhile(lambda x: not mytest(x), mysequence))

I think the linear probe suggested by #isbadawi is the best way to find the beginning of the subsequence. This is because the subsequence could be very short and could be anywhere within the larger sequence.
However, once the beginning of the subsequence is found, we can use a binary search to find the end of it. That will require fewer tests than doing a second linear probe, so it's a better solution for you.
As others have pointed out, there is not much practical reason for doing this. This is true for two reasons: your large sequence is quite short (only about 31 elements), and you still need to do at least one linear probe anyway, so the big-O runtime will be still be linear in the length of the large sequence, even though we have reduced part of the algorithm from linear to logarithmic.

The best method depends a bit on your input data structure. If your input data structure is a list of booleans for each day of the month then you can use the following code.
start = activity.find(True)
end = activity.rfind(True)

how to generate all possible combinations of a 14x10 matrix containing only 1's and 0's

I'm working on a problem and one solution would require an input of every 14x10 matrix that is possible to be made up of 1's and 0's... how can I generate these so that I can input every possible 14x10 matrix into another function? Thank you!
Added March 21: It looks like I didn't word my post appropriately. Sorry. What I'm trying to do is optimize the output of 10 different production units (given different speeds and amounts of downtime) for several scenarios. My goal is to place blocks of downtime to minimized the differences in production on a day-to-day basis. The amount of downtime and frequency each unit is allowed is given. I am currently trying to evaluate a three week cycle, meaning every three weeks each production unit is taken down for a given amount of hours. I was asking the computer to determine the order the units would be taken down based on the constraint that the lines come down only once every 3 weeks and the difference in daily production is the smallest possible. My first approach was to use Excel (as I tried to describe above) and it didn't work (no suprise there)... where 1- running, 0- off and when these are summed to calculate production. The calculated production is subtracted from a set max daily production. Then, these differences were compared going from Mon-Tues, Tues-Wed, etc for a three week time frame and minimized using solver. My next approach was to write a Matlab code where the input was a tolerance (set allowed variation day-to-day). Is there a program that already does this or an approach to do this easiest? It seems simple enough, but I'm still thinking through the different ways to go about this. Any insight would be much appreciated.

The actual implementation depends heavily on how you want to represent matrices… But assuming the matrix can be represented by a 14 * 10 = 140 element list:
from itertools import product
for matrix in product([0, 1], repeat=140):
# ... do stuff with the matrix ...
Of course, as other posters have noted, this probably isn't what you want to do… But if it really is what you want to do, that's the best code (given your requirements) to do it.

Generating Every possible matrix of 1's and 0's for 14*10 would generate 2**140 matrixes. I don't believe you would have enough lifetime for this. I don't know, if the sun would still shine before you finish that. This is why it is impossible to generate all those matrices. You must look for some other solution, this looks like a brute force.

This is absolutely impossible! The number of possible matrices is 2140, which is around 1.4e42. However, consider the following...
If you were to generate two 14-by-10 matrices at random, the odds that they would be the same are 1 in 1.4e42.
If you were to generate 1 billion unique 14-by-10 matrices, then the odds that the next one you generate would be the same as one of those would still be exceedingly slim: 1 in 1.4e33.
The default random number stream in MATLAB uses a Mersenne twister algorithm that has a period of 219936-1. Therefore, the random number generator shouldn't start repeating itself any time this eon.
Your approach should be thus:
Find a computer no one ever wants to use again.
Give it as much storage space as possible to save your results.
Install MATLAB on it and fire it up.
Start computing matrices at random like so:
while true
newMatrix = randi([0 1],14,10);
%# Process the matrix and output your results to disk
end
Walk away
Since there are so many combinations, you don't have to compare newMatrix with any of the previous matrices since the length of time before a repeat is likely to occur is astronomically large. Your processing is more likely to stop due to other reasons first, such as (in order of likely occurrence):
You run out of disk space to store your results.
There's a power outage.
Your computer suffers a fatal hardware failure.
You pass away.
The Earth passes away.
The Universe dies a slow heat death.
NOTE: Although I injected some humor into the above answer, I think I have illustrated one useful alternative. If you simply want to sample a small subset of the possible combinations (where even 1 billion could be considered "small" due to the sheer number of combinations) then you don't have to go through the extra time- and memory-consuming steps of saving all of the matrices you've already processed and comparing new ones to it to make sure you aren't repeating matrices. Since the odds of repeating a combination are so low, you could safely do this:
for iLoop = 1:whateverBigNumberYouWant
newMatrix = randi([0 1],14,10); %# Generate a new matrix
%# Process the matrix and save your results
end

Are you sure you want every possible 14x10 matrix? There are 140 elements in each matrix, and each element can be on or off. Therefore there are 2^140 possible matrices. I suggest you reconsider what you really want.
Edit: I noticed you mentioned in a comment that you are trying to minimize something. There is an entire mathematical field called optimization devoted to doing this type of thing. The reason this field exists is because quite often it is not possible to exhaustively examine every solution in anything resembling a reasonable amount of time.

Trying this:
import numpy
for i in xrange(int(1e9)): a = numpy.random.random_integers(0,1,(14,10))
(which is much, much, much smaller than what you require) should be enough to convince you that this is not feasible. It also shows you how to calculate one, or few, such random matrices even up to a million is pretty fast).
EDIT: changed to xrange to "improve speed and memory requirements" :)

You don't have to iterate over this:
def everyPossibleMatrix(x,y):
N=x*y
for i in range(2**N):
b="{:0{}b}".format(i,N)
yield '\n'.join(b[j*x:(j+1)*x] for j in range(y))

Depending on what you want to accomplish with the generated matrices, you might be better off generating a random sample and running a number of simulations. Something like:
matrix_samples = []
# generate 10 matrices
for i in range(10):
sample = numpy.random.binomial(1, .5, 14*10)
sample.shape = (14, 10)
matrix_samples.append(sample)
You could do this a number of times to see how results vary across simulations. Of course, you could also modify the code to ensure that there are no repeats in a sample set, again depending on what you're trying to accomplish.

Are you saying that you have a table with 140 cells and each value can be 1 or 0 and you'd like to generate every possible output? If so, you would have 2^140 possible combinations...which is quite a large number.

Instead of just suggesting the this is unfeasible, I would suggest considering a scheme that samples the important subset of all possible combinations instead of applying a brute force approach. As one of your replies suggested, you are doing minimization. There are numerical techniques to do this such as simulated annealing, monte carlo sampling as well as traditional minimization algorithms. You might want to look into whether one is appropriate in your case.

I was actually much more pessimistic to begin with, but consider:
from math import log, e
def timeInYears(totalOpsNeeded=2**140, currentOpsPerSecond=10**9, doublingPeriodInYears=1.5):
secondsPerYear = 365.25 * 24 * 60 * 60
doublingPeriodInSeconds = doublingPeriodInYears * secondsPerYear
k = log(2,e) / doublingPeriodInSeconds # time-proportionality constant
timeInSeconds = log(1 + k*totalOpsNeeded/currentOpsPerSecond, e) / k
return timeInSeconds / secondsPerYear
if we assume that computer processing power continues to double every 18 months, and you can currently do a billion combinations per second (optimistic, but for sake of argument) and you start today, your calculation will be complete on or about April 29th 2137.

Here is an efficient way to do get started Matlab:
First generate all 1024 possible rows of length 10 containing only zeros and ones:
dec2bin(0:2^10-1)
Now you have all possible rows, and you can sample from them as you wish. For example by calling the following line a few times:
randperm(1024,14)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.