Issues implementing the "Wave Collapse Function" algorithm in Python

Issues implementing the "Wave Collapse Function" algorithm in Python - python

In a nutshell:
My implementation of the Wave Collapse Function algorithm in Python 2.7 is flawed but I'm unable to identify where the problem is located. I would need help to find out what I'm possibly missing or doing wrong.
What is the Wave Collapse Function algorithm ?
It is an algorithm written in 2016 by Maxim Gumin that can generate procedural patterns from a sample image. You can see it in action here (2D overlapping model) and here (3D tile model).
Goal of this implementation:
To boil down the algorithm (2D overlapping model) to its essence and avoid the redondancies and clumsiness of the original C# script (surprisingly long and difficult to read). This is an attempt to make a shorter, clearer and pythonic version of this algorithm.
Characteristics of this implementation:
I'm using Processing (Python mode), a software for visual design that makes image manipulation easier (no PIL, no Matplotlib, ...). The main drawbacks are that I'm limited to Python 2.7 and can NOT import numpy.
Unlike the original version, this implementation:
is not object oriented (in its current state), making it easier to understand / closer to pseudo-code
is using 1D arrays instead of 2D arrays
is using array slicing for matrix manipulation
The Algorithm (as I understand it)
1/ Read the input bitmap, store every NxN patterns and count their occurences.
(optional: Augment pattern data with rotations and reflections.)
For example, when N = 3:
2/ Precompute and store every possible adjacency relations between patterns.
In the example below, patterns 207, 242, 182 and 125 can overlap the right side of pattern 246
3/ Create an array with the dimensions of the output (called W for wave). Each element of this array is an array holding the state (True of False) of each pattern.
For example, let's say we count 326 unique patterns in input and we want our output to be of dimensions 20 by 20 (400 cells). Then the "Wave" array will contain 400 (20x20) arrays, each of them containing 326 boolan values.
At start, all booleans are set to True because every pattern is allowed at any position of the Wave.
W = [[True for pattern in xrange(len(patterns))] for cell in xrange(20*20)]
4/ Create another array with the dimensions of the output (called H). Each element of this array is a float holding the "entropy" value of its corresponding cell in output.
Entropy here refers to Shannon Entropy and is computed based on the number of valid patterns at a specific location in the Wave. The more a cell has valid patterns (set to True in the Wave), the higher its entropy is.
For example, to compute the entropy of cell 22 we look at its corresponding index in the wave (W[22]) and count the number of booleans set to True. With that count we can now compute the entropy with the Shannon formula. The result of this calculation will be then stored in H at the same index H[22]
At start, all cells have the same entropy value (same float at every position in H) since all patterns are set to True, for each cell.
H = [entropyValue for cell in xrange(20*20)]
These 4 steps are introductory steps, they are necessary to initalize the algorithm. Now starts the core of the algorithm:
5/ Observation:
Find the index of the cell with the minimum nonzero entropy (Note that at the very first iteration all entropies are equal so we need to pick the index of a cell randomly.)
Then, look at the still valid patterns at the corresponding index in the Wave and select one of them randomly, weighted by the frequency that pattern appears in the input image (weighted choice).
For example if the lowest value in H is at index 22 (H[22]), we look at all the patterns set to True at W[22] and pick one randomly based on the number of times it appears in the input. (Remember at step 1 we've counted the number of occurences for each pattern). This insures that patterns appear with a similar distribution in the output as are found in the input.
6/ Collapse:
We now assign the index of the selected pattern to the cell with the minimum entropy. Meaning that every pattern at the corresponding location in the Wave are set to False except for the one that has been chosen.
For example if pattern 246 in W[22] was set to True and has been selected, then all other patterns are set to False. Cell 22 is assigned pattern 246.
In output cell 22 will be filled with the first color (top left corner) of pattern 246. (blue in this example)
7/ Propagation:
Because of adjacency constraints, that pattern selection has consequences on the neighboring cells in the Wave. The arrays of booleans corresponding to the cells on the left and right, on top of and above the recently collapsed cell need to be updated accordingly.
For example if cell 22 has been collapsed and assigned with pattern 246, then W[21] (left), W[23] (right), W[2] (up) and W[42] (down) have to be modified so as they only keep to True the patterns that are adjacent to pattern 246.
For example, looking back at the picture of step 2, we can see that only patterns 207, 242, 182 and 125 can be placed on the right of pattern 246. That means that W[23] (right of cell 22) needs to keep patterns 207, 242, 182 and 125 as True and set all other patterns in the array as False. If these patterns are not valid anymore (already set to False because of a previous constraint) then the algorithm is facing a contradiction.
8/ Updating entropies
Because a cell has been collapsed (one pattern selected, set to True) and its surrounding cells updated accordingly (setting non adjacent patterns to False) the entropy of all these cells have changed and needs to be computed again. (Remember that the entropy of a cell is correlated to the number of valid pattern it holds in the Wave.)
In the example, the entropy of cell 22 is now 0, (H[22] = 0, because only pattern 246 is set to True at W[22]) and the entropy of its neighboring cells have decreased (patterns that were not adjacent to pattern 246 have been set to False).
By now the algorithm arrives at the end of the first iteration and will loop over steps 5 (find cell with minimum non zero entropy) to 8 (update entropies) until all cells are collapsed.
My script
You'll need Processing with Python mode installed to run this script.
It contains around 80 lines of code (short compared to the ~1000 lines of the original script) that are fully annotated so it can be rapidly understood. You'll also need to download the input image and change the path on line 16 accordingly.
from collections import Counter
from itertools import chain, izip
import math
d = 20 # dimensions of output (array of dxd cells)
N = 3 # dimensions of a pattern (NxN matrix)
Output = [120 for i in xrange(d*d)] # array holding the color value for each cell in the output (at start each cell is grey = 120)
def setup():
size(800, 800, P2D)
textSize(11)
global W, H, A, freqs, patterns, directions, xs, ys, npat
img = loadImage('Flowers.png') # path to the input image
iw, ih = img.width, img.height # dimensions of input image
xs, ys = width//d, height//d # dimensions of cells (squares) in output
kernel = [[i + n*iw for i in xrange(N)] for n in xrange(N)] # NxN matrix to read every patterns contained in input image
directions = [(-1, 0), (1, 0), (0, -1), (0, 1)] # (x, y) tuples to access the 4 neighboring cells of a collapsed cell
all = [] # array list to store all the patterns found in input
# Stores the different patterns found in input
for y in xrange(ih):
for x in xrange(iw):
''' The one-liner below (cmat) creates a NxN matrix with (x, y) being its top left corner.
This matrix will wrap around the edges of the input image.
The whole snippet reads every NxN part of the input image and store the associated colors.
Each NxN part is called a 'pattern' (of colors). Each pattern can be rotated or flipped (not mandatory). '''
cmat = [[img.pixels[((x+n)%iw)+(((a[0]+iw*y)/iw)%ih)*iw] for n in a] for a in kernel]
# Storing rotated patterns (90°, 180°, 270°, 360°)
for r in xrange(4):
cmat = zip(*cmat[::-1]) # +90° rotation
all.append(cmat)
# Storing reflected patterns (vertical/horizontal flip)
all.append(cmat[::-1])
all.append([a[::-1] for a in cmat])
# Flatten pattern matrices + count occurences
''' Once every pattern has been stored,
- we flatten them (convert to 1D) for convenience
- count the number of occurences for each one of them (one pattern can be found multiple times in input)
- select unique patterns only
- store them from less common to most common (needed for weighted choice)'''
all = [tuple(chain.from_iterable(p)) for p in all] # flattern pattern matrices (NxN --> [])
c = Counter(all)
freqs = sorted(c.values()) # number of occurences for each unique pattern, in sorted order
npat = len(freqs) # number of unique patterns
total = sum(freqs) # sum of frequencies of unique patterns
patterns = [p[0] for p in c.most_common()[:-npat-1:-1]] # list of unique patterns sorted from less common to most common
# Computes entropy
''' The entropy of a cell is correlated to the number of possible patterns that cell holds.
The more a cell has valid patterns (set to 'True'), the higher its entropy is.
At start, every pattern is set to 'True' for each cell. So each cell holds the same high entropy value'''
ent = math.log(total) - sum(map(lambda x: x * math.log(x), freqs)) / total
# Initializes the 'wave' (W), entropy (H) and adjacencies (A) array lists
W = [[True for _ in xrange(npat)] for i in xrange(d*d)] # every pattern is set to 'True' at start, for each cell
H = [ent for i in xrange(d*d)] # same entropy for each cell at start (every pattern is valid)
A = [[set() for dir in xrange(len(directions))] for i in xrange(npat)] #see below for explanation
# Compute patterns compatibilities (check if some patterns are adjacent, if so -> store them based on their location)
''' EXAMPLE:
If pattern index 42 can placed to the right of pattern index 120,
we will store this adjacency rule as follow:
A[120][1].add(42)
Here '1' stands for 'right' or 'East'/'E'
0 = left or West/W
1 = right or East/E
2 = up or North/N
3 = down or South/S '''
# Comparing patterns to each other
for i1 in xrange(npat):
for i2 in xrange(npat):
for dir in (0, 2):
if compatible(patterns[i1], patterns[i2], dir):
A[i1][dir].add(i2)
A[i2][dir+1].add(i1)
def compatible(p1, p2, dir):
'''NOTE:
what is refered as 'columns' and 'rows' here below is not really columns and rows
since we are dealing with 1D patterns. Remember here N = 3'''
# If the first two columns of pattern 1 == the last two columns of pattern 2
# --> pattern 2 can be placed to the left (0) of pattern 1
if dir == 0:
return [n for i, n in enumerate(p1) if i%N!=2] == [n for i, n in enumerate(p2) if i%N!=0]
# If the first two rows of pattern 1 == the last two rows of pattern 2
# --> pattern 2 can be placed on top (2) of pattern 1
if dir == 2:
return p1[:6] == p2[-6:]
def draw(): # Equivalent of a 'while' loop in Processing (all the code below will be looped over and over until all cells are collapsed)
global H, W, grid
### OBSERVATION
# Find cell with minimum non-zero entropy (not collapsed yet)
'''Randomly select 1 cell at the first iteration (when all entropies are equal),
otherwise select cell with minimum non-zero entropy'''
emin = int(random(d*d)) if frameCount <= 1 else H.index(min(H))
# Stoping mechanism
''' When 'H' array is full of 'collapsed' cells --> stop iteration '''
if H[emin] == 'CONT' or H[emin] == 'collapsed':
print 'stopped'
noLoop()
return
### COLLAPSE
# Weighted choice of a pattern
''' Among the patterns available in the selected cell (the one with min entropy),
select one pattern randomly, weighted by the frequency that pattern appears in the input image.
With Python 2.7 no possibility to use random.choice(x, weight) so we have to hard code the weighted choice '''
lfreqs = [b * freqs[i] for i, b in enumerate(W[emin])] # frequencies of the patterns available in the selected cell
weights = [float(f) / sum(lfreqs) for f in lfreqs] # normalizing these frequencies
cumsum = [sum(weights[:i]) for i in xrange(1, len(weights)+1)] # cumulative sums of normalized frequencies
r = random(1)
idP = sum([cs < r for cs in cumsum]) # index of selected pattern
# Set all patterns to False except for the one that has been chosen
W[emin] = [0 if i != idP else 1 for i, b in enumerate(W[emin])]
# Marking selected cell as 'collapsed' in H (array of entropies)
H[emin] = 'collapsed'
# Storing first color (top left corner) of the selected pattern at the location of the collapsed cell
Output[emin] = patterns[idP][0]
### PROPAGATION
# For each neighbor (left, right, up, down) of the recently collapsed cell
for dir, t in enumerate(directions):
x = (emin%d + t[0])%d
y = (emin/d + t[1])%d
idN = x + y * d #index of neighbor
# If that neighbor hasn't been collapsed yet
if H[idN] != 'collapsed':
# Check indices of all available patterns in that neighboring cell
available = [i for i, b in enumerate(W[idN]) if b]
# Among these indices, select indices of patterns that can be adjacent to the collapsed cell at this location
intersection = A[idP][dir] & set(available)
# If the neighboring cell contains indices of patterns that can be adjacent to the collapsed cell
if intersection:
# Remove indices of all other patterns that cannot be adjacent to the collapsed cell
W[idN] = [True if i in list(intersection) else False for i in xrange(npat)]
### Update entropy of that neighboring cell accordingly (less patterns = lower entropy)
# If only 1 pattern available left, no need to compute entropy because entropy is necessarily 0
if len(intersection) == 1:
H[idN] = '0' # Putting a str at this location in 'H' (array of entropies) so that it doesn't return 0 (float) when looking for minimum entropy (min(H)) at next iteration
# If more than 1 pattern available left --> compute/update entropy + add noise (to prevent cells to share the same minimum entropy value)
else:
lfreqs = [b * f for b, f in izip(W[idN], freqs) if b]
ent = math.log(sum(lfreqs)) - sum(map(lambda x: x * math.log(x), lfreqs)) / sum(lfreqs)
H[idN] = ent + random(.001)
# If no index of adjacent pattern in the list of pattern indices of the neighboring cell
# --> mark cell as a 'contradiction'
else:
H[idN] = 'CONT'
# Draw output
''' dxd grid of cells (squares) filled with their corresponding color.
That color is the first (top-left) color of the pattern assigned to that cell '''
for i, c in enumerate(Output):
x, y = i%d, i/d
fill(c)
rect(x * xs, y * ys, xs, ys)
# Displaying corresponding entropy value
fill(0)
text(H[i], x * xs + xs/2 - 12, y * ys + ys/2)
Problem
Despite all my efforts to carefully put into code all the steps described above, this implementation returns very odd and disappointing results:
Example of a 20x20 output
Both the pattern distribution and the adjacency constraints seem to be respected (same amount of blue, green, yellow and brown colors as in input and same kind of patterns: horizontal ground , green stems).
However these patterns:
are often disconnected
are often incomplete (lack of "heads" composed of 4-yellow petals)
run into way too many contradictory states (grey cells marked as "CONT")
On that last point, I should clarify that contradictory states are normal but should happen very rarely (as stated in the middle of page 6 of this paper and in this article)
Hours of debugging convinced me that introductory steps (1 to 5) are correct (counting and storing patterns, adjacency and entropy computations, arrays initialization). This has led me to think that something must be off with the core part of the algorithm (steps 6 to 8). Either I am implementing one of these steps incorrectly or I am missing a key element of the logic.
Any help regarding that matter would thus be immensely appreciated !
Also, any answer that is based on the script provided (using Processing or not) is welcomed.
Useful additionnal ressources:
This detailed article from Stephen Sherratt and this explanatory paper from Karth & Smith.
Also, for comparison I would suggest to check this other Python implementation (contains a backtracking mechanism that isn't mandatory) .
Note: I did my best to make this question as clear as possible (comprehensive explanation with GIFs and illustrations, fully annotated code with useful links and ressources) but if for some reasons you decide to vote it down, please leave a brief comment to explain why you're doing so.

The hypothesis suggested by #mbrig and #Leon that the propagation step iterates over a whole stack of cells (instead of being limited to a set of 4 direct neighbors) was correct. The following is an attempt to provide further details while answering my own questions.
The problem occured at step 7, while propagating. The original algorithm does update the 4 direct neighbors of a specific cell BUT:
the index of that specific cell is in turns replaced by the indices of the previously updated neighbors.
this cascading process is triggered every time a cell is collapsed
and last as long as the adjacent patterns of a specific cell are available in 1 of its neighboring cell
In other words, and as mentionned in the comments, this is a recursive type of propagation that updates not only the neighbors of the collapsed cell, but also the neighbors of the neighbors... and so on as long as adjacencies are possible.
Detailed Algorithm
Once a cell is collapsed, its index is put in a stack. That stack is meant later to temporarily store indices of neighoring cells
stack = set([emin]) #emin = index of cell with minimum entropy that has been collapsed
The propagation will last as long as that stack is filled with indices:
while stack:
First thing we do is pop() the last index contained in the stack (the only one for now) and get the indices of its 4 neighboring cells (E, W, N, S). We have to keep them withing bounds and make sure they wrap around.
while stack:
idC = stack.pop() # index of current cell
for dir, t in enumerate(mat):
x = (idC%w + t[0])%w
y = (idC/w + t[1])%h
idN = x + y * w # index of neighboring cell
Before going any further, we make sure the neighboring cell is not collapsed yet (we don't want to update a cell that has only 1 pattern available):
if H[idN] != 'c':
Then we check all the patterns that could be placed at that location. ex: if the neighboring cell is on the left of the current cell (east side), we look at all the patterns that can be placed on the left of each pattern contained in the current cell.
possible = set([n for idP in W[idC] for n in A[idP][dir]])
We also look at the patterns that are available in the neighboring cell:
available = W[idN]
Now we make sure that the neighboring cell really have to be updated. If all its available patterns are already in the list of all the possible patterns —> there’s no need to update it (the algorithm skip this neighbor and goes on to the next) :
if not available.issubset(possible):
However, if it is not a subset of the possible list —> we look at the intersection of the two sets (all the patterns that can be placed at that location and that, "luckily", are available at that same location):
intersection = possible & available
If they don't intersect (patterns that could have been placed there but are not available) it means we ran into a "contradiction". We have to stop the whole WFC algorithm.
if not intersection:
print 'contradiction'
noLoop()
If, on the contrary, they do intersect --> we update the neighboring cell with that refined list of pattern's indices:
W[idN] = intersection
Because that neighboring cell has been updated, its entropy must be updated as well:
lfreqs = [freqs[i] for i in W[idN]]
H[idN] = (log(sum(lfreqs)) - sum(map(lambda x: x * log(x), lfreqs)) / sum(lfreqs)) - random(.001)
Finally, and most importantly, we add the index of that neighboring cell to the stack so it becomes the next current cell in turns (the one whose neighbors will be updated during the next while loop):
stack.add(idN)
Full updated script
from collections import Counter
from itertools import chain
from random import choice
w, h = 40, 25
N = 3
def setup():
size(w*20, h*20, P2D)
background('#FFFFFF')
frameRate(1000)
noStroke()
global W, A, H, patterns, freqs, npat, mat, xs, ys
img = loadImage('Flowers.png')
iw, ih = img.width, img.height
xs, ys = width//w, height//h
kernel = [[i + n*iw for i in xrange(N)] for n in xrange(N)]
mat = ((-1, 0), (1, 0), (0, -1), (0, 1))
all = []
for y in xrange(ih):
for x in xrange(iw):
cmat = [[img.pixels[((x+n)%iw)+(((a[0]+iw*y)/iw)%ih)*iw] for n in a] for a in kernel]
for r in xrange(4):
cmat = zip(*cmat[::-1])
all.append(cmat)
all.append(cmat[::-1])
all.append([a[::-1] for a in cmat])
all = [tuple(chain.from_iterable(p)) for p in all]
c = Counter(all)
patterns = c.keys()
freqs = c.values()
npat = len(freqs)
W = [set(range(npat)) for i in xrange(w*h)]
A = [[set() for dir in xrange(len(mat))] for i in xrange(npat)]
H = [100 for i in xrange(w*h)]
for i1 in xrange(npat):
for i2 in xrange(npat):
if [n for i, n in enumerate(patterns[i1]) if i%N!=(N-1)] == [n for i, n in enumerate(patterns[i2]) if i%N!=0]:
A[i1][0].add(i2)
A[i2][1].add(i1)
if patterns[i1][:(N*N)-N] == patterns[i2][N:]:
A[i1][2].add(i2)
A[i2][3].add(i1)
def draw():
global H, W
emin = int(random(w*h)) if frameCount <= 1 else H.index(min(H))
if H[emin] == 'c':
print 'finished'
noLoop()
id = choice([idP for idP in W[emin] for i in xrange(freqs[idP])])
W[emin] = [id]
H[emin] = 'c'
stack = set([emin])
while stack:
idC = stack.pop()
for dir, t in enumerate(mat):
x = (idC%w + t[0])%w
y = (idC/w + t[1])%h
idN = x + y * w
if H[idN] != 'c':
possible = set([n for idP in W[idC] for n in A[idP][dir]])
if not W[idN].issubset(possible):
intersection = possible & W[idN]
if not intersection:
print 'contradiction'
noLoop()
return
W[idN] = intersection
lfreqs = [freqs[i] for i in W[idN]]
H[idN] = (log(sum(lfreqs)) - sum(map(lambda x: x * log(x), lfreqs)) / sum(lfreqs)) - random(.001)
stack.add(idN)
fill(patterns[id][0])
rect((emin%w) * xs, (emin/w) * ys, xs, ys)
Overall improvements
In addition to these fixes I also did some minor code optimization to speed-up both the observation and propagation steps, and shorten the weighted choice computation.
The "Wave" is now composed of Python sets of indices whose size decrease as cells are "collapsed" (replacing large fixed size lists of booleans).
Entropies are stored in a defaultdict whose keys are progressively deleted.
The starting entropy value is replaced by a random integer (first entropy calculation not needed since equiprobable high level of uncertainty at start)
Cells are diplayed once (avoiding storing them in a array and redrawing at each frame)
The weighted choice is now a one-liner (avoiding several dispensable lines of list comprehension)

While looking at the live demo linked in one of your examples, and based on a quick review of the original algorithm code, I believe your error lies in the "Propagation" step.
The propagation is not just updating the neighbouring 4 cells to the collapsed cell. You must also update all of those cells neighbours, and then the neighbours to those cells, etc, recursively. Well, to be specific, as soon as you update a single neighbouring cell, you then update its neighbour (before getting to the other neighbours of the first cell), i.e. depth-first, not breadth-first updates. At least, that's what I gather from the live demo.
The actual C# code implementation of the original algorithm is quite complicated and I don't fully understand it, but the key points appear to be creation of the "propagator" object here, as well as the Propagate function itself, here.

Related

Hysteresis thresholding in Python

I am working currently on an Canny edge detector implementation in Python.
The problem I have is with an efficient array manipulation. To make it easier for Python people I wrote the same thing I need in matlab just for sake of the algorithm. At the end there is a commented matlab code to explain it. Before I would like to show my Python code that i wrote until now and explain what I need. My code is working but the problem is speed:
I have a picture as an np.array. I need to implement this:
There are two thresholds. High and low. I need to find all elements that are >high, all elements that are connected to high ones(vertically, horizontally and diagonally) and are between those two thresholds. Elements below low are discarded. Also elements that are between two thresholds but not connected in any way to elements above high should be discarded. By connected I mean connected even over other elements that form edges(chains).
In Python as I could see until now I should take the opposite route from one in Matlab. That means first find all elements above low. The reason for this is if I first find elements > high I just don't see a way how to keep connections.(For this to work I would need the possibility to call cv2.connectedComponents on specific rows and columns of a np.array but then I dont get a matrix of the same size). So with that being said, find all connected elements to those above low. In those connected groups check if there is an element above high. If not set all to zero. My problem here is speed. I would like to know how could my for loop be replaced with one or two lines of code like I wrote it in matlab
def hyst_thresh(edge_img: np.array, high_thresh: float, low_thresh: float)
matrix = np.zeros(edge_img.shape) #create an empty matrix
r, c = np.where(edge_img > low_thresh) # find positions of all elements that are above low threshold
matrix[r, c] = 1
r, c = np.where(edge_img > h_thresh) # find positions of all elements that are above high threshold
label, neighbors = cv2.connectedComponents(matrix.astype(np.uint8)) ## this gives me label which is number of groups(labels of groups) and a matrix neighbors
#Example:
#
#label = 4(group 0, group 1 group 2 group 3)
#
#neighbors
#
#0 0 0 1 0
#0 1 1 1 0
#0 0 0 0 0
#2 2 2 0 3
for i in range(1, label):
y, z = np.where(neighbours == i)
k1 = np.isin(r, y)
k2 = np.isin(c,z)
if not any(k1*k2):
matrix[y, z] = 0
Matlab equivalent which does the same thing but in the opposite direction
function edges = hyst_thresh(edges_in, low, high)
edges = zeros(size(edges_in)); % creating zero matrix
%This returns two vectors r and c which contain row and column of elements that satisfy
%the condition
[r, c] = find(edges_in > high);
% Next line finds all matrix elements that are connected to elements given by r and c. Return is a
%binray matrix
bw = bwselect(edges_in, c, r, 8);
%Next line sets all elements of my empty matrix on positions where bw is non zero to 1
edges(find(bw)) = 1;
%At this point I have extracted values that are above high threshold and all other values that are
%connected to them. Those positions are now ones in my empty zero matrix from the beginning.
%This last line finally should check if between those values are some values that are below the lower
%threshold. In order to do that, values on all positions from my input matrix that are below are set to
% zero. This way if there are some elements that were connected but have a low value will be discarded
% in my binary matrix
edges(find(edges_in < low)) = 0;
end

Fastest way to sort string to match second string - only adjacent swaps allowed

I want to get the minimum number of letter-swaps needed to convert one string to match a second string. Only adjacent swaps are allowed.
Inputs are: length of strings, string_1, string_2
Some examples:
Length | String 1 | String 2 | Output
-------+----------+----------+-------
3 | ABC | BCA | 2
7 | AABCDDD | DDDBCAA | 16
7 | ZZZAAAA | ZAAZAAZ | 6
Here's my code:
def letters(number, word_1, word_2):
result = 0
while word_1 != word_2:
index_of_letter = word_1.find(word_2[0])
result += index_of_letter
word_1 = word_1.replace(word_2[0], '', 1)
word_2 = word_2[1:]
return result
It gives the correct results, but the calculation should stay under 20 seconds.
Here are two sets of input data (1 000 000 characters long strings): https://ufile.io/8hp46 and https://ufile.io/athxu.
On my setup the first one is executed in around 40 seconds and the second in 4 minutes.
How to calculate the result in less than 20 seconds?

#KennyOstrom's is 90% there. The inversion count is indeed the right angle to look at this problem.
The only bit that is missing is that we need a "relative" inversion count, meaning the number of inversions not to get to normal sort order but to the other word's order. We therefore need to compute the permutation that stably maps word1 to word2 (or the other way round), and then compute the inversion count of that. Stability is important here, because obviously there will be lots of nonunique letters.
Here is a numpy implementation that takes only a second or two for the two large examples you posted. I did not test it extensively, but it does agree with #trincot's solution on all test cases. For the two large pairs it finds 1819136406 and 480769230766.
import numpy as np
_, word1, word2 = open("lit10b.in").read().split()
word1 = np.frombuffer(word1.encode('utf8')
+ (((1<<len(word1).bit_length()) - len(word1))*b'Z'),
dtype=np.uint8)
word2 = np.frombuffer(word2.encode('utf8')
+ (((1<<len(word2).bit_length()) - len(word2))*b'Z'),
dtype=np.uint8)
n = len(word1)
o1 = np.argsort(word1, kind='mergesort')
o2 = np.argsort(word2, kind='mergesort')
o1inv = np.empty_like(o1)
o1inv[o1] = np.arange(n)
order = o2[o1inv]
sum_ = 0
for i in range(1, len(word1).bit_length()):
order = np.reshape(order, (-1, 1<<i))
oo = np.argsort(order, axis = -1, kind='mergesort')
ioo = np.empty_like(oo)
ioo[np.arange(order.shape[0])[:, None], oo] = np.arange(1<<i)
order[...] = order[np.arange(order.shape[0])[:, None], oo]
hw = 1<<(i-1)
sum_ += ioo[:, :hw].sum() - order.shape[0] * (hw-1)*hw // 2
print(sum_)

Your algorithm runs in O(n2) time:
The find() call will take O(n) time
The replace() call will create a complete new string which takes O(n) time
The outer loop executes O(n) times
As others have stated, this can be solved by counting inversions using merge sort, but in this answer I try to stay close to your algorithm, keeping the outer loop and result += index_of_letter, but changing the way index_of_letter is calculated.
The improvement can be done as follows:
preprocess the word_1 string and note the first position of each distinct letter in word_1 in a dict keyed by these letters. Link each letter with its next occurrence. I think it is most efficient to create one list for this, having the size of word_1, where at each index you store the index of the next occurrence of the same letter. This way you have a linked list for each distinct letter. This preprocessing can be done in O(n) time, and with it you can replace the find call with a O(1) lookup. Every time you do this, you remove the matched letter from the linked list, i.e. the index in the dict moves to the index of the next occurrence.
The previous change will give the absolute index, not taking into account the removals of letters that you have in your algorithm, so this will give wrong results. To solve that, you can build a binary tree (also preprocessing), where each node represents an index in word_1, and which gives the actual number of non-deleted letters preceding a given index (including itself as well if not deleted yet). The nodes in the binary tree never get deleted (that might be an idea for a variant solution), but the counts get adjusted to reflect a deletion of a character. At most O(logn) nodes need to get a decremented value upon such a deletion. But apart from that no string would be rebuilt like with replace. This binary tree could be represented as a list, corresponding to nodes in in-order sequence. The values in the list would be the numbers of non-deleted letters preceding that node (including itself).
The initial binary tree could be depicted as follows:
The numbers in the nodes reflect the number of nodes at their left side, including themselves. They are stored in the numLeft list. Another list parent precalculates at which indexes the parents are located.
The actual code could look like this:
def letters(word_1, word_2):
size = len(word_1) # No need to pass size as argument
# Create a binary tree for word_1, organised as a list
# in in-order sequence, and with the values equal to the number of
# non-matched letters in the range up to and including the current index:
treesize = (1<<size.bit_length()) - 1
numLeft = [(i >> 1 ^ ((i + 1) >> 1)) + 1 for i in range(0, treesize)]
# Keep track of parents in this tree (could probably be simpler, I welcome comments).
parent = [(i & ~((i^(i+1)) + 1)) | (((i ^ (i+1))+1) >> 1) for i in range(0, treesize)]
# Create a linked list for each distinct character
next = [-1] * size
head = {}
for i in range(len(word_1)-1, -1, -1): # go backwards
c = word_1[i]
# Add index at front of the linked list for this character
if c in head:
next[i] = head[c]
head[c] = i
# Main loop counting number of swaps needed for each letter
result = 0
for i, c in enumerate(word_2):
# Extract next occurrence of this letter from linked list
j = head[c]
head[c] = next[j]
# Get number of preceding characters with a binary tree lookup
p = j
index_of_letter = 0
while p < treesize:
if p >= j: # On or at right?
numLeft[p] -= 1 # Register that a letter has been removed at left side
if p <= j: # On or at left?
index_of_letter += numLeft[p] # Add the number of left-side letters
p = parent[p] # Walk up the tree
result += index_of_letter
return result
This runs in O(nlogn) where the logn factor is provided by the upwards walk in the binary tree.
I tested on thousands of random inputs, and the above code produces the same results as your code in all cases. But... it runs a lot faster on the larger inputs.

I am going by the assumption that you just want to find the number of swaps, quickly, without needing to know what exactly to swap.
google how to count inversions. It is often taught with merge-sort. Several of the results are on stack overflow, like Merge sort to count split inversions in Python
Inversions are the number of adjacent swaps to get to a sorted string.
Count the inversions in string 1.
Count the inversions in string 2.
Error edited out here, see correction in correct answer. I would normally just delete a wrong answer but this answer is referenced in correct answer.
It makes sense, and it happens to work for all three of your small test cases, so I'm going to just assume this is the answer you want.
Using some code that I happen to have lying around from retaking some algorithms classes on free online classes (for fun):
print (week1.count_inversions('ABC'), week1.count_inversions('BCA'))
print (week1.count_inversions('AABCDDD'), week1.count_inversions('DDDBCAA'))
print (week1.count_inversions('ZZZAAAA'), week1.count_inversions('ZAAZAAZ'))
0 2
4 20
21 15
That lines up with the values you gave above: 2, 16, and 6.

How to create random-dot stereogram (RDS)?

I am trying to understand and code a python script that create a random-dot stereogram (RDS) from a depthmap and a random-dot generated pattern. From what I've understood, to create the illusion of depth, pixels are shifted so when we make them merge by changing focus the difference of shifting creates the illusion.
I put this into practice with this depth map:
Here is the result:
But I don't understand why I can see on the result 2 objects, 1 star "close" to me and an other star "far" from me. And there is different possible results depending of how I focus my eyes.
I have read many things on the subject but I don't get it. Maybe the problem is my poor english or understanding of what I've read but I will appreciate some detailed explanations since there not that much technical explanations on the web about how to code this from scratch.
Note: I have tried with different size on shift and pattern and it doesn't seem to change anything
Code: (Tell me if you need other part of the code or some comment about how it work. I didn't clean it yet)
import os, sys
import pygame
def get_linked_point(depthmap, d_width, d_height, sep):
"""
In this function we link each pixel in white in the depth map with the
coordinate of the shifted pixel we will need to create the illusion
ex: [[x,y],[x_shifted,y]]
:param sep: is the shift value in pixels
"""
deptharray = pygame.PixelArray(depthmap)
list_linked_point = []
for x in range(d_width):
for y in range(d_height):
if deptharray[x][y] != 0x000000:
list_linked_point.append([[x, y], [x+sep, y]])
del deptharray
return list_linked_point
def display_stereogram(screen, s_width, pattern, p_width, linked_points):
"""
Here we fill the window with the pattern. Then for each linked couple of
point we make the shifted pixel [x_shifted,y] equal to the other one
[x,y]
"""
x = 0
while x < s_width:
screen.blit(pattern, [x, 0])
x += p_width
pixAr = pygame.PixelArray(screen)
for pair in linked_points:
pixAr[pair[0][0], pair[0][1]] = pixAr[pair[1][0], pair[1][1]]
del pixAr

The problem "I can see on the result 2 objects, 1 star "close" to me and an other star "far" from me" is due to the fact that I get the wrong approach when I try to generalize my understanding of stereograms made with 2 images to stereograms using repeated pattern.
To create 2 images stereograms you need to shift pixels of one image to make the depth illusion.
What was wrong in my approch is that I only shift pixels that should create the star. What I didn't get is that because RDS are made by repeated patterns, shifting these pixels also create an opposite shifting with next patterns creating an other star of the opposite depth.
To correct this I paired every point of the depth map (not only the white one) in order to come back to the base shifting amount after the end of the star.
Here is the result:
Code: (This code is the previous one quickly modified after the help of Neil Slater so it's not clean yet. I will try to improve this)
def get_linked_point(depthmap, d_width, d_height, p_width, sep):
"""
In this function we link each pixel in white in the depth map with the
coordinate of the shifted pixel we will need to create the illusion
ex: [[x,y],[x_shifted,y]]
:param sep: is the shift value in pixels
"""
deptharray = pygame.PixelArray(depthmap)
list_linked_point = []
for x in range(d_width):
for y in range(d_height):
if deptharray[x][y] == 0x000000:
list_linked_point.append([[x, y], [x+p_width, y]])
else:
list_linked_point.append([[x, y], [x-sep+p_width, y]])
del deptharray
return list_linked_point
def display_stereogram(screen, s_width, pattern, p_width, linked_points):
"""
Here we fill the window with the pattern. Then for each linked couple of
point we make the shifted pixel [x_shifted,y] equal to the other one
[x,y]
"""
x = 0
while x < s_width:
screen.blit(pattern, [x, 0])
x += p_width
pixAr = pygame.PixelArray(screen)
for pair in linked_points:
pixAr[pair[1][0], pair[1][1]] = pixAr[pair[0][0], pair[0][1]]
del pixAr

Protein Mutual Information

I'm trying to find the mutual information (MI) between a multiple sequence alignment (MSA).
The math behind it is ok for me. Though, I don't know how to implement it in Python, at least in a fast way.
How should I compute the overall frequency P(i;x); P(j;y); P(ij;xy). The Px and Py frequency is easy to calculate a hash could deal with it, but what about the P(ij;xy)?
So my real question is, how should I calculate the probability of Pxy in a given i and j column?
please note that MI could be defined as:
MI(i,j) = Sum(x->n)Sum(y->m) P(ij,xy) * log(P(ij,xy)/P(i,x)*P(j,y))
In which i and j are amino acid position in the columns, x and y are different amino acids found in a given i or j column.
Thanks,
EDIT
My input data looks like a df:
A = [
['M','T','S','K','L','G','-'.'-','S','L','K','P'],
['M','A','A','S','L','A','-','A','S','L','P','E'],
...,
['M','T','S','K','L','G','A','A','S','L','P','E'],
]
So indeed it is trully easy to compute any frequency of amino acid in a given position,
for example:
P(M) at position 1: 1
P(T) at position 2: 2/3
P(A) at position 2: 1/3
P(S) at position 3: 2/3
P(A) at position 3: 1/3
How should I proceed to get, for example, P of a T at position 2 and a S at position 3 at the same time:
In this example is 2/3.
So P(ij, xy) means the probability (or frequency) of a amino acid x in a column i occur at the same time of a amino acid y in a column j.
Ps: for a more simple explanation of MI please refer to this link mistic.leloir.org.ar/docs/help.html 'Thanks to Aaron'

I am not 100% sure if this is correct (e.g., how is '-' supposed to be handled)? I assume that the sum is over all pairs for which the frequencies in the numerator and denominator inside the log are all nonzero and furthermore, I assumed that it should be the natural log:
from math import log
from collections import Counter
def MI(sequences,i,j):
Pi = Counter(sequence[i] for sequence in sequences)
Pj = Counter(sequence[j] for sequence in sequences)
Pij = Counter((sequence[i],sequence[j]) for sequence in sequences)
return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)
The code works by using 3 Counter objects to get the relevant counts, and then returning a sum which is a straightforward translation of the formula.
If this isn't correct, it would be helpful if you edit your question so that it has some expected output to test against.
On Edit. Here is a version which doesn't treat '-' as just another amino acid but instead filters away sequences in which it appears in either of the two columns, interpreting those sequences as sequences for which the requisite information is not available:
def MI(sequences,i,j):
sequences = [s for s in sequences if not '-' in [s[i],s[j]]]
Pi = Counter(s[i] for s in sequences)
Pj = Counter(s[j] for s in sequences)
Pij = Counter((s[i],s[j]) for s in sequences)
return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)

Here's a place to get started... read the comments
import numpy as np
A = [ # you'll need to pad the end of your strings so that they're all the
# same length for this to play nice with numpy
"MTSKLG--SLKP",
"MAASLA-ASLPE",
"MTSKLGAASLPE"]
#create an array of bytes
B = np.array([np.fromstring(a, dtype=np.uint8) for a in A],)
#create search string to do bytetwise xoring
#same length as B.shape[1]
search_string = "-TS---------" # P of T at pos 1 and S at pos 2
#"M-----------" # P of M at pos 0
#take ord of each char in string
search_ord = np.fromstring(search_string, dtype=np.uint8)
#locate positions not compared
search_mask = search_ord != ord('-')
#xor with search_ord. 0 indicates letter in that position matches
#multiply with search_mask to force uninteresting positions to 0
#any remaining arrays that are all 0 are a match. ("any()" taken along axis 1)
#this prints [False, True, False]. take the sum to get the number of non-matches
print(((B^search_ord) * search_mask).any(1))

Two point series intersect only after extended?

I have two point series
A = [(18.405316791178798, -22.039859853332942),
(18.372696520198463, -21.1145),
(18.746540658574137, -20.1145),
(18.698714698430614, -19.1145),
(18.80081378263931, -18.1145),
(18.838536172339943, -17.1145),
(18.876258562040572, -16.1145),
(18.967679510389303, -15.1145),
(19.004907703822514, -14.1145),
(19.042135897255729, -13.1145),
(19.345372798084995, -12.1145),
(19.391824245372803, -11.598937753853679),
(19.435471418833544, -11.1145),
(19.420235820376909, -10.1145),
(19.423148861774159, -9.1145),
(19.426061903171416, -8.1145),
(19.452752569112423, -7.1145),
(19.489649834463115, -6.1145),
(19.444635952332344, -5.1145),
(19.443635102001071, -5.0430597023976906),
(19.430626347601358, -4.1145),
(19.421676068414001, -3.1144999999999996),
(19.362954522948439, -2.1144999999999996),
(19.346848825989134, -1.1144999999999996),
(19.359781116687397, -0.1144999999999996),
(19.396797325132418, 0.69011368336827994)]
B=[(21.7744, -17.859620414326386),
(22.7744, -17.858000854574556),
(23.7744, -18.065164294951039),
(24.7744, -18.051109497755608),
(25.7744, -18.037054700560173),
(26.7744, -18.022999903364742),
(27.7744, -18.008945106169307),
(28.7744, -18.014846881456318),
(29.7744, -18.02764295838865),
(30.7744, -18.098340990366935)]
I know for sure that they will intersect, if one of them is to be extended from one head.
Now, I wish to find their "potential" intersection. I have written a function that can find the intersection point for "already-intersected" point series:
# find the intersection between two line segments
# if none, return None
# else, return sequence numbers in both rep1 and rep2 and the intersection
def _findIntersection(rep1, rep2):
x_down = [elem[0] for elem in rep1]
y_down = [elem[1] for elem in rep1]
x_up = [elem[0] for elem in rep2]
y_up = [elem[1] for elem in rep2]
for m in xrange(len(x_down)-1):
p0 = np.array([x_down[m], y_down[m]])
p1 = np.array([x_down[m+1], y_down[m+1]])
for n in xrange(len(x_up)-1):
q0 = np.array([x_up[n], y_up[n]])
q1 = np.array([x_up[n+1], y_up[n+1]])
try: # to ignore the parallel cases
params = np.linalg.solve(np.column_stack((p1-p0, q0-q1)), q0-p0)
if np.all((params >= 0) & (params <= 1)):
return m, n, ((p0+params[0]*(p1-p0))[0], (p0+params[0]*(p1-p0))[1])
except:
pass
So, I think what I need is to find out which end of which point series needs to be extended. As long as I know this, I can simply extend it and find the intersection with existing _findIntersection().
We can safely assume in this problem that the two point series are roughly both straight lines, which implies only one intersection exists.
I am using Python, but any generic solution is also very much welcomed!

I think one way of doing this is to find the functions of both lines and then using these functions, find the intersection. Here is how I would do that using numpy (making the assumption the lines are linear):
import numpy as np
def find_func(x,y):
return np.polyfit(x, y, 1)
def find_intersect(funcA, funcB):
a = funcA[0]-funcB[0]
b = funcB[1]-funcA[1]
x = b / a
assert np.around(find_y(funcA,x),3) == np.around(find_y(funcB,x),3)
return x, find_y(funcA,x)
def find_y(func, x):
return func[0] * x + func[1]
#find fits
func_A = find_func(A[:,1],A[:,0])
func_B = find_func(B[:,1],B[:,0])
#find intersection
x_intersect, y_intersect = find_intersect(func_A, func_B)
Here is the plotted output of the approximated linear point of intersection:

First off, get the regression line of each of your point series. Convert the lines into the line segments s1 and s2 by projecting the endpoints of the lines' respective point series onto the lines themselves.
Looking at the problem in terms of linear algebra, the two line segments are vectors. Unless they are paralel or colinear, multiplying each vector with a given coefficient will cause them to be extended up to the intersection point. Thus, you need to find the coefficients alpha and beta such that alpha * s1 = beta * s2. In other words, solve the linear equation alpha * s1 + beta * (-s1) = 0, as you have done already with the individual line segments.
There are three cases that you need to be aware of.
If the absolute values of both alpha and beta are smaller than or equal to 1, the intersection point is inside both line segments.
If one absolute value is <=1 but the other is >1, the intersection point i is inside only one of the two line segments (say, s2). Multiply that line segment's vector with the coefficient you have just obtained, then add the origin of the vector, to obtain the intersection point. You can then determine which endpoint in the other line segment (s1 in this case) is closer to the intersection point; the closer one is the one to be extended from.
If both absolute values are >1, simply find the intersection point by multiplying s1 with (alpha / beta), then adding s1[0] to that. Once the intersection point is found, simply find the closest endpoint to it on each line segment. These are the two endpoints from which the point series must be extended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.