I am trying to solve this problem and unable to come up with a robust solution. Any idea, pseudo-code or a python implementation would be greatly appreciated. For sake of simplicity, consider a small matrix like in Figure 1. The rows in the matrix represent days and the columns represent minutes. We can assume that a bus travels between two points that takes 10 minutes and stops at a particular cell defined by a letter in that cell at each minute. Given the historical pattern (day 1 thru 5), we want to find the best sequence of letters. To do that we need to follow certain rules:
We want to select the most frequently observed letter per minute interval. If there is more than one letter with the same frequency, we can select any of them.
We want to maintain the continuity.
We want to preserve the original sequence the best we can.
We are not looking for the shortest distance (most straight line, etc.)
Here are a couple examples:
The sequence in Figure 1 satisfies all these rules. The highlighted sequence is just for visualization purpose. There are other ways of visualizing this sequence in Figure 1.
The sequence in Figure 2 is discontinuous. Hence the most frequent letters can't be stitched together. For that reason, we select the second most frequent letter in minute 3, one of the C, A, D instead of B. With that we can satisfy the rules. However, keep in mind, when 365 days used along with 100+ minutes, it gets complex. For instance, using the second most frequent letter may have resulted in rewiring the rest of the sequence.
Any guidance is highly appreciated.
This sounds like a relative straightforward dynamic programming task.
Start at the end: each cell in the last column gets 0 if it is the most frequent letter or 1 otherwise.
Move on to the second last column. Each cell gets 0 if it is the most frequent letter or 1 other + min(cell_above, cell_directly_right, cell_below). Note which cell you selected.
Repeat until you reach the end.
You will now have in the first column one or more cells with minimal value. Follow the cells you noted in step 2.
You now have a path from the beginning to the end which is continous and minimizes sum([0 if cell.most_frequent else 1 for cell in cells])
You might have to tweak the target function e.g. the last most frequent and the second most frequent letter are treated the same. Maybe you want to give a score based on how frequent they are.
Related
I am starting a project which involves an allocation problem, and having explored a bit by myself, it is pretty challenging for me to solve it efficiently.
What I call here allocation problem is the following:
There is a set of available slots in 3D space, randomly sampled ([xspot_0, yspot_0, zspot_0], [xspot_1, yspot_1, zspot_1], etc.). These slots are identified with an ID and a position, and are fixed, so they will not change with time.
There are then mobile elements (same number as the number of available slots, on the order of 250,000) which can go from spot to spot. They are identified with an ID, and at a given time step, the spot in which they are.
Each spot must have one and only one element at a given step.
At first, elements are ordered in the same way as spots: the first element (element_id=0) is in the first spot (spot_id=0), etc.
But then, these elements need to move, based on a motion vector that is defined for each spot, which is also fixed. For example, ideally at the first step, the first element should move from [xspot_0, yspot_0, zspot_0] to [xspot_0 + dxspot_0, yspot_0 + dyspot_0, zspot_0 + dzspot_0], etc.
Since spots were randomly sampled, the new target position might not exist among the spots. The goal is therefore to find a candidate slot for the next step that is as close as possible to the "ideal" position the element should be in.
On top of that first challenge, since this will probably be done through a loop, it is possible that the best candidate was already assigned to another element.
Once all new slots are defined for each element (or each element is assigned to a new slot, depending on how you see it), we do it again, applying the same motion with the new order. This is repeated as many times as I need.
Now that I defined the problem, the first thing I tried was a simple allocation based on this information. However, if I pick the best candidate every time based on the distance to the target position, as I said some elements have their best candidate already taken, so they pick the 2nd, 3rd, ... 20th, ... 100th candidate slot, which becomes highly wrong compared to the ideal position.
Another technique I was trying, without being entirely sure about what I was doing, was to assign a probability distribution calculated by doing the inverse exponential of the distance between the slots and the target position. Then I normalized this distribution to obtain probabilities (which seem arbitrary). I still do not get very good results for a single step.
Therefore, I was wondering if someone knows how to solve this type of problem in a more accurate/more efficient way. For your information, I mainly use Python 3 for development.
Thank you!
A little background:
I'm constructing lists of words for a psychology experiment. We're trying to create a chain of words such that words adjacent in the list are related, but all other words in the list are not related. So for example:
SCHOOL, CAFETERIA, PIZZA, CRUST, EARTH, OCEAN, WHALE, ...
So here we see the first word is related to the second, and the second is related to the third, but the third isn't related to the first. (And the first isn't related to the fourth, fifth, sixth, ... either)
What I have so far...
I have a list of 1600 words such that each number from 0 to 1600 corresponds to a word. I also have a very large matrix (1600 x 1600) that tells me (on a scale of 0 to 1) how related each word is to every other word. (These are from a latent semantic analysis; http://lsa.colorado.edu/)
I can make the lists, but it's not very efficient at all, and my adjacent words aren't super strongly related to each other.
Here's my basic algorithm:
Set thresholds for minimum value for how related the adjacent words must be and for how unrelated the non-adjacent words must be.
Create a list from 0 to 1600. Shuffle that list. The first item of the list will be our first word.
Loop through our words, checking one by one if the word meets our thresholds (i.e., check that this new word is high enough related to the last added word in the list, loop through our list and check that it's unrelated to all other words and that it isn't already in our list). If it meets the criteria, add it to the list. If we loop through all words with no success, dump the list and start all over.
Continue this until the list has as many words as I want (ideally, 16).
Does anyone have a better approach? The problem with my approach is that I'll sometimes settle for an okay match that meets my criteria when a better match is potentially still out there. Also, it would be nice if I didn't have to dump the whole list but could backtrack a few steps to where the list potentially went wrong.
This might be a good candidate for a genetic algorithm. You can create a large number of completely random possibilities, score each one with an objective function, and then iterate the population by crossing over mates based on fitness (possibly throwing some mutations in as well).
If done properly, this should give you a large-ish population of good solutions. If the population is large enough, the fitness function defined well enough and mutation is sufficient to get you out of any valleys you might otherwise get stuck in, you might even converge overwhelmingly on the optimal answer.
Loop through our words...If it meets the criteria, add it to the list.
This seems to be the point of issue. You are stopping at the first match, not the best match. Using your 1600 square matrix of relatedness values, you can simply get the index of the maximum relatedness value for the remaining words, then go the word matrix and add the corresponding word to the list.
The simplest solution seems to be probabilistic. You're not looking for the absolute best lists, just a set of "good enough" lists.
1 - Pick a random starting word, add it to your list.
2 - Find the set of all highly related words (pick a sensible relatedness value based on your data). Pick one word randomly from the set of related words, make sure it doesn't relate too closely to any other words in the list. Loop this until you find one that works (then append it to your list and go back to 2 until you reach the desired list size) or exhaust all related words (discard your list and go back to 1).
3 - go back to 1 until you've constructed enough lists.
Preprocess to a different data structure:Dictionary of lists by words.
Each dictionary gets a list, sorted low to high of related words (using your proximity matrix).
Pick random word for 1st. 2nd word is lowest word in 1st's list(closest match).
3rd word is picked from at/near the end of 1st' list - ie unrelated. 4th word is from start of 3rd word's list. Repeat.
Reread reqs - as you pick each word up (close match from left and non-match from right) you need to revisit the lists of the words picked so far and make sure that the candidate's position is far enough right i.e. low match) from words picked so far. If not advance 1 right (next closest) or 1 left (next furthest).
I'm not sure of the rules to create a matrix for a word search puzzle game. I am able to create a matrix with initial values of 0.
Is it correct that I'll randomly select a starting point(coordinates), and a random direction(horizontally,vertically,&,diagonally) for a word then manage if it would overlap with another word in the matrix? If it does then check if the characters are the same (although there's only a little chance) then if no I'll assign it there. The problem is it's like I lessen the chance of words to overlap.
I have also read that I need to check first the words that have the same characters. But if that's the case, it seems like the words that I am going to put in the matrix are always overlapping.
I would rather look at words that are already there and then randomly select a word from the set of words that fit there.
Of course you might not fill the whole matrix like this. If you have put one word somewhere where it blocks all other words (no other word fits), you might have to backtrack, but that will kill the running time.
If you really want to fill the whole matrix, I would iterate over all possible starting positions, see how many words fit there, and then recurse over the possibilities of the starting position with the least number of candidates. That will cause your program recognize and leave "dead ends" early, which improves the running time drastically. That is a powerful technique from fixed parameter algorithms, which I like to call Branching-vector minimization.
Start with the longest word.
First of all you must find all points and directions, where this word may fit. For example word 'WORD' may fit, when at the first pos there is NULL or W, on the second pos NULL or O, on the third NULL or R and on the fourth NULL or D.
Then You should group it to positions with no NULLS, with one NULL, with two NULLs and so on.
Then select randomly position from the group with the smallest ammount of NULLS. If there are no posible positions, skip the word.
This attempt will allow you to put more words and prevent situations, where random search can't find the proper place (when there is only a few of them).
I'm currently trying to solve the hard Challenge #151 on reddit with a unuasual method, a genetic algorithm.
In short, after seperating a string to consonants and vowels and removing spaces I need to put it together without knowing what character comes first.
hello world is seperated to hllwrld and eoo and needs to be put together again. One solution for example would be hlelworlod, but that doesn't make much sense. The exhaustive approach that takes all possible solutions works, but isn't feasible for longer problem sets.
What I already have
A database with the frequenzy of english words
An algorithm that constructs a relative cost database using Zipf's law and can consistently seperate words from sentences without spaces (borrowed from this question/answer
A method that puts consonants and vowels into a stack and randomly takes a character from either one and encodes this in a string that consists of 1 and 2, effectively encoding the construction in a gene. The correct gene for the example would be 1211212111
A method that mutates such a string, randomly swapping characters around
What I tried
Generating 500 random sequences, using the infer_spaces() method and evaluating fitness with the cost of all the words, taking the best 25% and mutate 4 new from those, works for small strings, but falls into local minima very often, especially for longer sequences. Hello World is found already in the first generation, thisisnotworkingverygood (which is correctly seperated and has a cost of 41.223) converges to th iss n ti wo or king v rye good (270 cost) already in the second generation.
What I need
Clearly, using the calculated cost as a evaluation method does only work for the separation of sentences that are grammatically correct, not for for this genetic algorithm. Do you have better ideas I could try? Or is another part of solution, for example the representation of the gene, the problem?
I would simplify the problem into two parts,
Finding candidate words to split the string into (so hllwrld => hll wrld)
How to then expand those words by adding vowels.
I would first take your dictionary of word frequencies, and process it to create a second list of words without vowels, along with a list of the possible vowel list for each collapsed word (and the associated frequency). You technically don't need a GA to solve this (and I think it would be easier to solve without one), but as you asked, I will provide 2 answers:
Without GA: you should be able to solve the first problem using a depth first search, matching substrings of the word against that dictionary, and doing so with the remaining word parts, only accepting partitions of the word into words (without vowels) where all words are in the second dictionary. Then you have to substitute in the vowels. Given that second dictionary, and the partition you already have, this should be easy. You can also use the list of vowels to further constrain the partitioning, as valid words in the partitions can only be made whole using vowels from the vowel list that is input into the algorithm. Starting at the left hand side of the string and iterating over all valid partitions in a depth first manner should solve this problem relatively quickly.
With GA: To solve this with a GA, I would create the dictionary of words without vowels. Then using the GA, create binary strings (as your chromosomes) of the same length as the input string of consonants, where a 1 = split a word at that position, and 0 = leave unchanged. These strings will all be the same length. Then create a fitness function that returns the proportion of words obtained after performing a split using the chromosome that are valid words without vowels, according to that dictionary. Create a second fitness function that takes the valid no-vowel words, and computes the proportion of overlap between the vowels missing in all these valid no-vowel words, and the original vowel list. Combine both fitness functions into one by multiplying the value from the first one by ten (assuming the second one returns a value between 0 and 1). That will force the algorithm to focus on the segmentation problem first and the vowel insertion problem second, and will also favor segmentations that are of the same quality, but preferring those that have a closer set of missing vowels to the original list. I would also include cross over in the solution. As all your chromosomes are the same length, this should be trivial. Once you have a solution that scores perfectly on the fitness function, then it should be trivial to recreate the original sentence given that dictionary of words without vowels (provided you maintain a second dictionary that list the possible missing vowel set for each non-vowel word - there could be multiple for each, as some vowel-less words will be the same with the vowels removed.
Let's say you have several generations and you plot the cost for the best specimen in each generation (we consider long sentence). Does this graph go down or converges after 2-3 generations to a specific value (let the algorithm run for example for 10 generations)? Can you run your algorithm several times with various initial conditions (random sequences) and see whether you get good results sometimes or not?
Depending of the results, you may try the following (this graph is a really good tool to improve the performance):
1) If you have a graph that goes up and down too much all the time - you have too much mutation (average number of swaps per gene for example), try to decrease it.
2) If you stuck up in a local minimum (cost of the best specimen doesn't change much after some time) try to increase mutation or run several isolated populations (3-4) of let's say 100 species at the beginning of your algorithm for a few generations. Then select the best population (that's closer to global minimum) and try to improve it as much as possible through mutation
PS: By the way interesting problem, I tried to figure out on how to use crossover to improve the algorithm but haven't figured it out
The fitness function is the key to the success of GA algorithm ( Which I kind of agree is suitable here ).
I agree with #Simon that the vowel non-vowel separation is not that important. just trip your text corpus to remove the vowels.
what is important in the fitness:
matched word frequency ( frequent words better )
grammar - structure of the sentence ( which you might need to use NLTK to get related infomation )
and don't forget to update the end result ^^
I need to find all the days of the month where a certain activity occurs. The days when the activity occurs will be sequential. The sequence of days can range from one to the entire month, and the sequence will occur exactly one time per month.
To test whether or not the activity occurs on any given day is not an expensive calculation, but I thought I would use this problem learn something new. Which algorithm minimizes the number of days I have to test?
You can't really do much better than iterating through the sequence to find the first match, then iterating until the first non match. You can use itertools to make it nice and readable:
itertools.takewhile(mytest,
itertools.dropwhile(lambda x: not mytest(x), mysequence))
I think the linear probe suggested by #isbadawi is the best way to find the beginning of the subsequence. This is because the subsequence could be very short and could be anywhere within the larger sequence.
However, once the beginning of the subsequence is found, we can use a binary search to find the end of it. That will require fewer tests than doing a second linear probe, so it's a better solution for you.
As others have pointed out, there is not much practical reason for doing this. This is true for two reasons: your large sequence is quite short (only about 31 elements), and you still need to do at least one linear probe anyway, so the big-O runtime will be still be linear in the length of the large sequence, even though we have reduced part of the algorithm from linear to logarithmic.
The best method depends a bit on your input data structure. If your input data structure is a list of booleans for each day of the month then you can use the following code.
start = activity.find(True)
end = activity.rfind(True)