Pebbling a Checkerboard with Dynamic Programming - python

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

Related

How to merge two dataframes by similar (but not matching) values?

I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling HÃ¥land 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).
In order to use standard pandas to "merge dataframes"
you will pretty much need to eliminate "similar"
from the problem statement.
So we're faced with mapping to "matching" values
that are identical across dataframes.
Here's a pair of plausible approaches.
normalize each name in isolation
examine quadratic pairwise distances
1. normalize
Map variant spellings down to a smaller universe
of spellings where collisions (matches) are more likely.
There are many approaches:
case smash to lower
map accented vowels to [aeiou]
discard all vowels
use simplifying regexes like s/sch/sh/ and s/de /de/
use Soundex or later competitors like Metaphone
manually curate a restricted vocabulary of correct spellings
Cost is O(N) linear with total length of dataframes.
2. pairwise distances
We wish to canonicalize,
to boil down multiple variant spellings
to a distinguished canonical spelling.
Begin by optionally normalizing,
then sort, at cost of O(N log N), and finally
make a linear pass that outputs only unique names.
This trivial pre-processing step reduces N,
which helps a lot when dealing
with O(N^2) quadratic cost.
Define a distance metric which accepts two names.
When given a pair of identical names it must
report a distance of zero. Otherwise it
deterministically reports a positive real number.
You might use Levenshtein,
or MRA.
Use nested loops
to compare all names against all names.
If distance between names is less than
threshold, arbitrarily declare name1
the winner, overwriting 2nd name with that 1st value.
The effect is to cluster multiple variant spellings
down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the
distance function a bit.
You might give initial letter(s) a heavier weight,
such that mismatched prefix guarantees the distance
shall exceed threshold.
In that case sorted names will help out,
and the nested loop can be confined to
just a small window of similarly prefixed names,
with early termination once it sees the prefix has changed.
Noting the distance between adjacent sorted names
can help with manually choosing a sensible
threshold parameter.
Finally, with adjusted names in hand,
you're in a position to .merge()
using exact equality tests.

Debug Exact Cover Pentominoes, Wikipedia example incomplete? OR... I'm misunderstanding something (includes code)

The Problem:
I've implemented Knuth's DLX "dancing links" algorithm for Pentominoes in two completely different ways and am still getting incorrect solutions. The trivial Wikipedia example works OK (https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X#Example), but more complex examples fail.
Debugging the full Pentominoes game requires a table with almost 2,000 entries, so I came up with a greatly reduced puzzle (pictured below) that is still complex enough to show the errant behavior.
Below is my trivial 3x5 Pentominoes example, using only 3 pieces to place. I can work through the algorithm with pen and paper, and sure enough my code is doing exactly what I told it to, but on the very first step, it nukes all of my rows! When I look at the connectedness, the columns certainly do seem to be OK. So clearly I'm misunderstanding something.
The Data Model:
This is the trivial solution I'm trying to get DLX to solve:
Below is the "moves" table, which encodes all the valid moves that the 3 pieces can make. (I filter out moves where a piece would create a hole size not divisible by 5)
The left column is the encoded move, for example the first row is
piece "L", placed at 0,0, then rotated ONE 90-degree turn
counter-clockwise.
vertical bar (|) delimiter
First 3 columns are the selector bit for which piece I'm referring to.
Since "l" is the first piece (of only 3), it has a 1 in the leftmost column.
The next 15 columns are 1 bit for every spot on a 3x5 pentominoes board.
l_0,0_rr10|100111100001000000
l_0,1_rr10|100011110000100000
l_1,1_rr10|100000000111100001
l_0,0_rr01|100111101000000000
l_0,1_rr01|100011110100000000
l_1,0_rr01|100000001111010000
l_0,0_rr30|100100001111000000
l_1,0_rr30|100000001000011110
l_1,1_rr30|100000000100001111
l_0,1_rr01|100000010111100000
l_1,0_rr01|100000000001011110
l_1,1_rr01|100000000000101111
t_0,1_rr00|010011100010000100
t_0,0_rr10|010100001110010000
t_0,1_rr20|010001000010001110
t_0,2_rr30|010000010011100001
y_1,0_rr00|001000000100011110
y_1,1_rr00|001000000010001111
y_1,0_rr01|001000000100011110
y_1,1_rr01|001000000010001111
y_0,0_rr20|001111100010000000
y_0,1_rr20|001011110001000000
y_0,0_rr01|001111100100000000
y_0,1_rr01|001011110010000000
An Example Failure:
The First Move kills all the rows of my array (disregarding the numeric header row and column)
Following the wikipedia article cited earlier, I do:
Look for minimum number of bits set in a column
4 is the min count, and column 2 is the leftmost column with that bit set
I choose the first row intersecting with column 2, which is row 13.
Column 4 and row 13 will be added to the columns and rows to be "covered" (aka deleted)
Now I look at row 13 and find all intersecting columns: 2, 5, 6, 7, 11 & 16
Now I look at all the rows that intersect with any 1 in any of those columns - THIS seem to be the problematic step - that criteria selects ALL 24 data rows to remove.
Since the board is empty, the system thinks it has found a valid solution.
Here's a picture of my pen-and-paper version of the algorithm:
Given the requests for code, I'm now attaching it. The comments at the top explain where to look.
Here's the code:
https://gist.github.com/ttennebkram/8bd27adece6fb3a5cd1bdb4ab9b51166
Second Test
There's a second 3x5 puzzle I thought of, but it hits the same problem the first example has. For the record, the second 3x5 is:
# Tiny Set 2: 3x5
# u u v v v
# u p p p v
# u u p p v
The issue you're seeing with your hand-run of the algorithm is that a matrix with no rows is not a solution. You need to eliminate all the columns, just getting rid of the rows is a failure. Your example run still has 12 columns that need to be solved left, so it's not a success.
Your exact cover implementation seems OK for the reduced instance, but the plotter was broken. I fixed it by changing
boardBitmap = fullBitmap[12:]
to
boardBitmap = fullBitmap[3:]
in plotMoveToBoard_np, since there are only three pieces in the reduced instance.
EDIT: there's also a problem with how you generate move names. There are distinct moves with the same name. There are also duplicate moves (which don't affect correctness but do affect performance). I changed
- g_rowNames.append(rowName)
+ g_rowNames.append(str(hash(str(finalBitmask))))
and 3x20 starts working as it should. (That's not a great way to generate the names, because in theory the hashes could collide, but it's one line.)

Collapsing set of strings based on a given hamming distance

Given a set of strings (first column) along with counts (second column), e.g.:
aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8
Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.
Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.
For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count.
The collapsed entry would have the combined count, here aaaa 15
For this set, we'd, therefore, get the following collapsed set:
aaaa 15
abbb 6
cccc 8
Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.
Further background and motivation
Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).
Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).
I thought up the following:
This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.
I suggest using a Vantage-point tree as the metric index.
The algorithm would look like this:
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty
Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty
I can not provide a complexity analysis.
I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.
In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.
This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.
The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.
It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

Converting an overlapping DNA region into a variable and then changing the length of overlapping region

I am relatively new to Python,so I would very much appreciate any constructive feedback on my code and would really appreciate it if you could guide me in the right direction if I a wrong.
So I have a designed a program in Python that takes in a DNA sequence, basically a string comprising only four letters (A,T,C and G) and finds the complement of the sequence. Thereafter, it takes the two sequences and divides them into fragments of certain length such that each fragment overlaps with the neighboring fragment by same number of letters.
For instance, it would take in DNA sequence, say s1, and produce the following output for its complement and fragments.
s1 = "AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT"
print(dna_complement(s1))
>>>> complement = TCGGGAGGTCCTGTCCGACGTAGTCTTCTCCGGTAGTTCGTCCAGACAAGGTTCCCGGAAACGCAGTCCA
print(dna_fragment(s1, oligo_size=8, oligo_overlap=3)):
>>>> AGCCCTCC..GACAGGCT..ATCAGAAG..GCCATCAA..AGGTCTGT..CAAGGGCC..TGCGTCAGGT
.....AGGTCCTG..CGACGTAG..TTCTCCGG..GTTCGTCC..ACAAGGTT..CGGAAACG.......
As you can see in the above example, the output of dna_fragmentation is two strings that share an oligo overlap of 3 base pairs ( i.e. three letters) and the size of oligos is at minimum 8 base pairs.
Hereafter, the program tries to determine the value for specific parameter (Tm) for the overlapping region (Tm is the temperature at which half of the overlapping region is coiled and other half is in form of a DNA duplex). Now the program must try to change the length of overlapping region such that the Tm for the overlapping regions must be approximately the same for all overlapping regions. I have successfully completed the former tasks by essentially storing the oligos (fragments) in a list and using list comprehension to find the Tm for each oligo; however, I have failed in accomplishing the latter task.
So my question is how can I take the overlapping region, store it in a variable and then change its length so that it matches a certain Tm?

Challenging dynamic programming problem

This is a toned down version of a computer vision problem I need to solve. Suppose you are given parameters n,q and have to count the number of ways of assigning integers 0..(q-1) to elements of n-by-n grid so that for each assignment the following are all true
No two neighbors (horizontally or vertically) get the same value.
Value at positions (i,j) is 0
Value at position (k,l) is 0
Since (i,j,k,l) are not given, the output should be an array of evaluations above, one for every valid setting of (i,j,k,l)
A brute force approach is below. The goal is to get an efficient algorithm that works for q<=100 and for n<=18.
def tuples(n,q):
return [[a,]+b for a in range(q) for b in tuples(n-1,q)] if n>1 else [[a] for a in range(q)]
def isvalid(t,n):
grid=[t[n*i:n*(i+1)] for i in range(n)];
for r in range(n):
for c in range(n):
v=grid[r][c]
left=grid[r][c-1] if c>0 else -1
right=grid[r][c-1] if c<n-1 else -1
top=grid[r-1][c] if r > 0 else -1
bottom=grid[r+1][c] if r < n-1 else -1
if v==left or v==right or v==top or v==bottom:
return False
return True
def count(n,q):
result=[]
for pos1 in range(n**2):
for pos2 in range(n**2):
total=0
for t in tuples(n**2,q):
if t[pos1]==0 and t[pos2]==0 and isvalid(t,n):
total+=1
result.append(total)
return result
assert count(2,2)==[1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
Update 11/11
I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 3 hours for n=10, any q, from author's estimate)
Maybe this sounds too simple, but it works. Randomly distribute values to all the cells until only two are empty. Test for adjacency of all values. Compute the average the percent of successful casts vs. all casts until the variance drops to within an acceptable margin.
The risk goes to zero and the that which is at risk is only a little runtime.
This isn't an answer, just a contribution to the discussion which is too long for a comment.
tl; dr; Any algorithm which boils down to, "Compute the possibilities and count them," such as Eric Lippert's or a brute force approach won't work for #Yaroslav's goal of q <= 100 and n <= 18.
Let's first think about a single n x 1 column. How many valid numberings of this one column exist? For the first cell we can pick between q numbers. Since we can't repeat vertically, we can pick between q - 1 numbers for the second cell, and therefore q - 1 numbers for the third cell, and so on. For q == 100 and n == 18 that means there are q * (q - 1) ^ (n - 1) = 100 * 99 ^ 17 valid colorings which is very roughly 10 ^ 36.
Now consider any two valid columns (call them the bread columns) separated by a buffer column (call it the mustard column). Here is a trivial algorithm to find a valid set of values for the mustard column when q >= 4. Start at the top cell of the mustard column. We only have to worry about the adjacent cells of the bread columns which have at most 2 unique values. Pick any third number for the mustard column. Consider the second cell of the mustard column. We must consider the previous mustard cell and the 2 adjacent bread cells with a total of at most 3 unique values. Pick the 4th value. Continue to fill out the mustard column.
We have at most 2 columns containing a hard coded cell of 0. Using mustard columns, we can therefore make at least 6 bread columns, each with about 10 ^ 36 solutions for a total of at least 10 ^ 216 valid solutions, give or take an order of magnitude for rounding errors.
There are, according to Wikipedia, about 10 ^ 80 atoms in the universe.
Therefore, be cleverer.
Update 11/11 I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 41 hours hours for n=10, any q, from author's estimate)
I'm the author. Not 41, just 3 embarrassingly parallelizable CPU hours. I've counted symmetries. For n=10 there are only 675 really distinct pairs of (i,j) and (k,l). My program needs ~ 16 seconds per each.
I'm building a contribution based on the contribution to the discussion by Dave Aaron Smith.
Let's not consider for now the last two constraints ((i,j) and (k,l)).
With only one column (nx1) the solution is q * (q - 1) ^ (n - 1).
How many choices for a second column ? (q-1) for the top cell (1,2) but then q-1 or q-2 for the cell (2,2) if (1,2)/(2,1) have or not the same color.
Same thing for (3,2) : q-1 or q-2 solutions.
We can see we have a binary tree of possibilities and we need to sum over that tree. Let's assume left child is always "same color on top and at left" and right child is "different colors".
By computing over the tree the number of possibilities for the left column to create a such configurations and the number of possibilities for the new cells we are coloring we would count the number of possibilities for coloring two columns.
But let's now consider the probability distribution foe the coloring of the second column : if we want to iterate the process, we need to have an uniform distribution on the second column, it would be like the first one never existed and among all coloring of the first two column we could say things like 1/q of them have color 0 in the top cell of second column.
Without an uniform distribution it would be impossible.
The problem : is the distribution uniform ?
Answer :
We would have obtain the same number of solution by building first the second column them the first one and then the third one. The distribution of the second column is uniform in that case so it also is in the first case.
We can now apply the same "tree idea" to count the number of possibilities for the third column.
I will try to develop on that and build a general formula (since the tree is of size 2^n we don't want to explicitly explore it).
A few observations which might help other answerers as well:
The values 1..q are interchangeable - they could be letters and the result would be the same.
The constraints that no neighbours match is a very mild one, so a brute force approach will be excessively expensive. Even if you knew the values in all but one cell, there would still be at least q-8 possibilities for q>8.
The output of this will be pretty long - every set of i,j,k,l will need a line. The number of combinations is something like n2(n2-3), since the two fixed zeroes can be anywhere except adjacent to each other, unless they need not obey the first rule. For n=100 and q=18, the maximally hard case, this is ~ 1004 = 100 million. So that's your minimum complexity, and is unavoidable as the problem is currently stated.
There are simple cases - when q=2, there are the two possible checkerboards, so for any given pair of zeroes the answer is 1.
Point 3 makes the whole program O( n2(n2-3) ) as a minimum, and also suggests that you will need something reasonably efficient for each pair of zeroes as simply writing 100 million lines without any computation will take a while. For reference, at a second per line, that is 1x108s ~ 3 years, or 3 months on a 12-core box.
I suspect that there is an elegant answer given a pair of zeroes, but I'm not sure that there is an analytic solution to it. Given that you can do it with 2 or 3 colours depending on the positions of the zeroes, you could split the map into a series of regions, each of which uses only 2 or 3 colours, and then it's just the number of different combinations of 2 or 3 in q (qC2 or qC3) for each region times the number of regions, times the number of ways of splitting the map.
I'm not a mathematician, but it occurs to me that there ought to be an analytical solution to this problem, namely:
First, compute now many different colourings are possible for NxN board with Q colours (including that neighbours, defined as having common edge don't get same color). This ought to be pretty simple formula.
Then figure out how many of these solutions have 0 in (i,j), this should be 1/Q's fraction.
Then figure out how many of remaining solutions have 0 in (k,l) depending on manhattan distance |i-k|+|j-l|, and possibly distance to the board edge and "parity" of these distances, as in distance divisible by 2, divisible by 3, divisible by Q.
The last part is the hardest, though I think it might still be doable if you are really good at math.

Categories