Given two strings, find the common characters between the two strings which are in same order from left to right.
Example 1
string_1 = 'hcarry'
string_2 = 'sallyc'
Output - 'ay'
Example 2
string_1 = 'jenny'
string_2 = 'ydjeu'
Output - 'je'
Explanation for Example 1 -
Common characters between string_1 and string_2 are c,a,y. But since c comes before ay in string_1 and after ay in string_2, we won't consider character c in output. The order of common characters between the two strings must be maintained and must be same.
Explanation for Example 2 -
Common characters between string_1 and string_2 are j,e,y. But since y comes before je in string_2 and after je in string_1, we won't consider character y in output. The order of common characters between the two strings must be maintained and must be same.
My approach -
Find the common characters between the strings and then store it in another variable for each individual string.
Example -
string_1 = 'hcarry'
string_2 = 'sallyc'
Common_characters = c,a,y
string_1_com = cay
string_2_com = ayc
I used sorted, counter, enumerate functions to get string_1_com and string_2_com in Python.
Now find the longest common sub-sequence in between string_1_com and string_2_com . You get the output as the result.
This is the brute force solution.
What is the optimal solution for this?
The algorithm for this is just called string matching in my book. It runs in O(mn) where m and n are the word lengths. I guess it might as well run on the full words, what's most efficient would depend on the expected number of common letters and how the sorting and filtering is performed. I will explain it for common letters strings as that's easier.
The idea is that you look at a directed acyclic graph of (m+1)*(n+1) nodes. Each path (from upper left to lower right) through this graph represents a unique way of matching the words. We want to match the strings, and additionally put in blanks (-) in the words so that they align with the highest number of common letters. For example the end state of cay and ayc would be
cay-
-ayc
Each node stores the highest number of matches for the partial matching which it represents, and at the end of the algorithm the end node will give us the highest number of matches.
We start at the upper left corner where nothing is matched with nothing and so we have 0 matching letters here (score 0).
c a y
0 . . .
a . . . .
y . . . .
c . . . .
We are to walk through this graph and for each node calculate the highest number of matching letters, by using the data from previous nodes.
The nodes are connected left->right, up->down and diagonally left-up->right-down.
Moving right represents consuming one letter from cay and matching the letter we arrive at with a - inserted in ayc.
Moving down represents the opposite (consuming from ayc and inserting - to cay).
Moving diagonally represents consuming one letter from each word and matching those.
Looking at the first node to the right of our starting node it represents the matching
c
-
and this node can (obviously) only be reached from the starting node.
All nodes in first row and first column will be 0 since they all represent matching one or more letters with an equal number of -.
We get the graph
c a y
0 0 0 0
a 0 . . .
y 0 . . .
c 0 . . .
That was the setup, now the interesting part begins.
Looking at the first unevaluated node, which represents matching the substrings c with a, we want to decide how we can get there with the most number of matching letters.
Alternative 1: We can get there from the node to the left. The node to the left represents the matching
-
a
so by choosing this path to get to our current node we arrive at
-c
a-
matching c with - gives us no correct match and thus the score for this path is 0 (taken from the last node) plus 0 (score for the match c/- just made). So 0 + 0 = 0 for this path.
Alternative 2: We can get to this node from above, this path represents moving from
c -> c-
- -a
which also gives us 0 extra points. Score for this is 0.
Alternative 3: We can get to this node from upper-left. This is moving from starting node (nothing at all) to consuming one character from each letter. That is matching
c
a
Since c and a is different letters we get 0 + 0 = 0 for this path as well.
c a y
0 0 0 0
a 0 0 . .
y 0 . . .
c 0 . . .
But for the next node it looks better. We still have the three alternatives to look at.
Alternative 1 & 2 always gives us 0 extra points as they always represent matching a letter with -, so those paths will give us score 0. Let's move on to alternative 3.
For our current node moving diagonally means going from
c -> ca
- -a
IT'S A MATCH!
That means there is a path to this node that gives us 1 in score. We throw away the 0s and save the 1.
c a y
0 0 0 0
a 0 0 1 .
y 0 . . .
c 0 . . .
For the last node on this row we look at our three alternatives and realize we won't get any new points (new matches), but we can get to the node by using our previous 1 point path:
ca -> cay
-a -a-
So this node is also 1 in score.
Doing this for all nodes we get the following complete graph
c a y
0 0 0 0
a 0 0 1 1
y 0 0 1 2
c 0 1 1 2
where the only increases in score come from
c -> ca | ca -> cay | - -> -c
- -a | -a -ay | y yc
An so the end node tells us the maximal match is 2 letters.
Since in your case you wish to know that longest path with score 2, you need to track, for each node, the path taken as well.
This graph is easily implemented as a matrix (or an array of arrays).
I would suggest that you as elements use a tuple with one score element and one path element and in the path element you just store the aligning letters, then the elements of the final matrix will be
c a y
0 0 0 0
a 0 0 (1, a) (1, a)
y 0 0 (1, a) (2, ay)
c 0 (1, c) (1, a/c) (2, ay)
At one place I noted a/c, this is because string ca and ayc have two different sub-sequences of maximum length. You need to decide what to do in those cases, either just go with one or save both.
EDIT:
Here's an implementation for this solution.
def longest_common(string_1, string_2):
len_1 = len(string_1)
len_2 = len(string_2)
m = [[(0,"") for _ in range(len_1 + 1)] for _ in range(len_2 + 1)] # intitate matrix
for row in range(1, len_2+1):
for col in range(1, len_1+1):
diag = 0
match = ""
if string_1[col-1] == string_2[row-1]: # score increase with one if letters match in diagonal move
diag = 1
match = string_1[col - 1]
# find best alternative
if m[row][col-1][0] >= m[row-1][col][0] and m[row][col-1][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row][col-1] # path from left is best
elif m[row-1][col][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row-1][col] # path from above is best
else:
m[row][col] = (m[row-1][col-1][0]+diag, m[row-1][col-1][1]+match) # path diagonally is best
return m[len_2][len_1][1]
>>> print(longest_common("hcarry", "sallyc"))
ay
>>> print(longest_common("cay", "ayc"))
ay
>>> m
[[(0, ''), (0, ''), (0, ''), (0, '')],
[(0, ''), (0, ''), (1, 'a'), (1, 'a')],
[(0, ''), (0, ''), (1, 'a'), (2, 'ay')],
[(0, ''), (1, 'c'), (1, 'c'), (2, 'ay')]]
Here is a simple, dynamic programming based implementation for the problem:
def lcs(X, Y):
m, n = len(X), len(Y)
L = [[0 for x in xrange(n+1)] for x in xrange(m+1)]
# using a 2D Matrix for dynamic programming
# L[i][j] stores length of longest common string for X[0:i] and Y[0:j]
for i in range(m+1):
for j in range(n+1):
if i == 0 or j == 0:
L[i][j] = 0
elif X[i-1] == Y[j-1]:
L[i][j] = L[i-1][j-1] + 1
else:
L[i][j] = max(L[i-1][j], L[i][j-1])
# Following code is used to find the common string
index = L[m][n]
# Create a character array to store the lcs string
lcs = [""] * (index+1)
lcs[index] = ""
# Start from the right-most-bottom-most corner and
# one by one store characters in lcs[]
i = m
j = n
while i > 0 and j > 0:
# If current character in X[] and Y are same, then
# current character is part of LCS
if X[i-1] == Y[j-1]:
lcs[index-1] = X[i-1]
i-=1
j-=1
index-=1
# If not same, then find the larger of two and
# go in the direction of larger value
elif L[i-1][j] > L[i][j-1]:
i-=1
else:
j-=1
print ("".join(lcs))
But.. you have already known term "longest common subsequence" and can find numerous descriptions of dynamic programming algorithm.
Wiki link
pseudocode
function LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j] //i-1 and j-1 if reading X & Y from zero
C[i,j] := C[i-1,j-1] + 1
else
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]
function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i = 0 or j = 0
return ""
if X[i] = Y[j]
return backtrack(C, X, Y, i-1, j-1) + X[i]
if C[i,j-1] > C[i-1,j]
return backtrack(C, X, Y, i, j-1)
return backtrack(C, X, Y, i-1, j)
Much easier solution ----- Thank you!
def f(s, s1):
cc = list(set(s) & set(s1))
ns = ''.join([S for S in s if S in cc])
ns1 = ''.join([S for S in s1 if S in cc])
found = []
b = ns[0]
for e in ns[1:]:
cs = b+e
if cs in ns1:
found.append(cs)
b = e
return found
Related
I want to calculate the largest covering of a string from many sets of substrings.
All strings in this problem are lowercased, and contain no whitespace or unicode strangeness.
So, given a string: abcdef, and two groups of strings: ['abc', 'bc'], ['abc', 'd'], the second group (['abc', 'd']) covers more of the original string. Order matters for exact matches, so the term group ['fe', 'cba'] would not match the original string.
I have a large collection of strings, and a large collection of terms-groups. So I would like a bit faster implementation if possible.
I've tried the following in Python for an example. I've used Pandas and Numpy because I thought it may speed it up a bit. I'm also running into an over-counting problem as you'll see below.
import re
import pandas as pd
import numpy as np
my_strings = pd.Series(['foobar', 'foofoobar0', 'apple'])
term_sets = pd.Series([['foo', 'ba'], ['foo', 'of'], ['app', 'ppl'], ['apple'], ['zzz', 'zzapp']])
# For each string, calculate best proportion of coverage:
# Try 1: Create a function for each string.
def calc_coverage(mystr, term_sets):
# Total length of string
total_chars = len(mystr)
# For each term set, sum up length of any match. Problem: this over counts when matches overlap.
total_coverage = term_sets.apply(lambda x: np.sum([len(term) if re.search(term, mystr) else 0 for term in x]))
# Fraction of String covered. Note the above over-counting can result in fractions > 1.0.
coverage_proportion = total_coverage/total_chars
return coverage_proportion.argmax(), coverage_proportion.max()
my_strings.apply(lambda x: calc_coverage(x, term_sets))
This results in:
0 (0, 0.8333333333333334)
1 (0, 0.5)
2 (2, 1.2)
Which presents some problems. The biggest problem I see is that over-lapping terms are being counted up separately, which results in the 1.2 or 120% coverage.
I think the ideal output would be:
0 (0, 0.8333333333333334)
1 (0, 0.8)
2 (3, 1.0)
I think I can write a double for loop and brute force it. But this problem feels like there's a more optimal solution. Or a small change on what I've done so far to get it to work.
Note: If there is a tie- returning the first is fine. I'm not too interested in returning all best matches.
Ok, this is not optimized but let's start fixing the results. I believe you have two issues: one is the over-counting in apple; the other is the under-counting in foofoobar0.
Solving the second issue when the term set is composed of two non-overlapping terms (or just one term), is easy:
sum([s.count(t)*len(t) for t in ts])
will do the job.
Similarly, when we have two overlapping terms, we will just take the "best" one:
max([s.count(t)*len(t) for t in ts])
So we are left with the problem of recognizing when the two terms overlap. I don't even consider term sets with more than two terms, because the solution will already be painfully slow with two :(
Let's define a function to test for overlapping:
def terms_overlap(s, ts):
if ts[0] not in s or ts[1] not in s:
return False
start = 0
while (pos_0 := s.find(ts[0], start)) > -1:
if (pos_1 := s.find(ts[1], pos_0)) > -1:
if pos_0 <= pos_1 < (pos_0 + len(ts[0]) - 1):
return True
start += pos_0 + len(ts[0])
start = 0
while (pos_1 := s.find(ts[1], start)) > -1:
if (pos_0 := s.find(ts[0], pos_1)) > -1:
if pos_1 <= pos_0 < (pos_1 + len(ts[1]) - 1):
return True
start += pos_1 + len(ts[1])
return False
With that function ready we can finally do:
def calc_coverage(strings, tsets):
for xs, s in enumerate(strings):
best_cover = 0
best_ts = 0
for xts, ts in enumerate(tsets):
if len(ts) == 1:
cover = s.count(ts[0])*len(ts[0])
elif len(ts) == 2:
if terms_overlap(s, ts):
cover = max([s.count(t)*len(t) for t in ts])
else:
cover = sum([s.count(t)*len(t) for t in ts])
else:
raise ValueError('Cannot handle term sets of more than two terms')
if cover > best_cover:
best_cover = cover
best_ts = xts
print(f'{xs}: {s:15} {best_cover:2d} / {len(s):2d} = {best_cover/len(s):8.3f} ({best_ts}: {tsets[best_ts]})')
>>> calc_coverage(my_strings, term_sets)
0: foobar 5 / 6 = 0.833 (0: ['foo', 'ba'])
1: foofoobar0 8 / 10 = 0.800 (0: ['foo', 'ba'])
2: apple 5 / 5 = 1.000 (3: ['apple'])
Problem Statement:
Problem
Apollo is playing a game involving polyominos. A polyomino is a shape made by joining together one or more squares edge to edge to form a single connected shape. The game involves combining N polyominos into a single rectangular shape without any holes. Each polyomino is labeled with a unique character from A to Z.
Apollo has finished the game and created a rectangular wall containing R rows and C columns. He took a picture and sent it to his friend Selene. Selene likes pictures of walls, but she likes them even more if they are stable walls. A wall is stable if it can be created by adding polyominos one at a time to the wall so that each polyomino is always supported. A polyomino is supported if each of its squares is either on the ground, or has another square below it.
Apollo would like to check if his wall is stable and if it is, prove that fact to Selene by telling her the order in which he added the polyominos.
Input
The first line of the input gives the number of test cases, T. T test cases follow. Each test case begins with a line containing the two integers R and C. Then, R lines follow, describing the wall from top to bottom. Each line contains a string of C uppercase characters from A to Z, describing that row of the wall.
Output
For each test case, output one line containing Case #x: y, where x is the test case number (starting from 1) and y is a string of N uppercase characters, describing the order in which he built them. If there is more than one such order, output any of them. If the wall is not stable, output -1 instead.
Limits
Time limit: 20 seconds per test set.
Memory limit: 1GB.
1 ≤ T ≤ 100.
1 ≤ R ≤ 30.
1 ≤ C ≤ 30.
No two polyominos will be labeled with the same letter.
The input is guaranteed to be valid according to the rules described in the statement.
Test set 1
1 ≤ N ≤ 5.
Test set 2
1 ≤ N ≤ 26.
Sample
Input
Output
4
4 6
ZOAAMM
ZOAOMM
ZOOOOM
ZZZZOM
4 4
XXOO
XFFO
XFXO
XXXO
5 3
XXX
XPX
XXX
XJX
XXX
3 10
AAABBCCDDE
AABBCCDDEE
AABBCCDDEE
Case #1: ZOAM
Case #2: -1
Case #3: -1
Case #4: EDCBA
In sample case #1, note that ZOMA is another possible answer.
In sample case #2 and sample case #3, the wall is not stable, so the answer is -1.
In sample case #4, the only possible answer is EDCBA.
Syntax pre-check
Show Test Input
My Code:
class Case:
def __init__(self, arr):
self.arr = arr
def solve(self):
n = len(self.arr)
if n == 1:
return ''.join(self.arr[0])
m = len(self.arr[0])
dep = {}
used = set() # to save letters already used
res = []
for i in range(n-1):
for j in range(m):
# each letter depends on the letter below it
if self.arr[i][j] not in dep:
dep[self.arr[i][j]] = set()
# only add dependency besides itself
if self.arr[i+1][j] != self.arr[i][j]:
dep[self.arr[i][j]].add(self.arr[i+1][j])
for j in range(m):
if self.arr[n-1][j] not in dep:
dep[self.arr[n-1][j]] = set()
# always find and desert the letters with all dependencies met
while len(dep) > 0:
# count how many letters are used in this round, if none is used, return -1
count = 0
next_dep = {}
for letter in dep:
if len(dep[letter]) == 0:
used.add(letter)
count += 1
res.append(letter)
else:
all_used = True
for neigh in dep[letter]:
if neigh not in used:
all_used = False
break
if all_used:
used.add(letter)
count += 1
res.append(letter)
else:
next_dep[letter] = dep[letter]
dep = next_dep
if count == 0:
return -1
if count == 0:
return -1
return ''.join(res)
t = int(input())
for i in range(1, t + 1):
R, C = [int(j) for j in input().split()]
arr = []
for j in range(R):
arr.append([c for c in input()])
case = Case(arr)
print("Case #{}: {}".format(i,case.solve()))
My code successfully passes all sample cases I can think of, but still keeps getting WA when submitted. Can anyone spot what is wrong with my solution? Thanks
I am trying to solve the usaco problem combination lock where you are given a two lock combinations. The locks have a margin of error of +- 2 so if you had a combination lock of 1-3-5, the combination 3-1-7 would still solve it.
You are also given a dial. For example, the dial starts at 1 and ends at the given number. So if the dial was 50, it would start at 1 and end at 50. Since the beginning of the dial is adjacent to the end of the dial, the combination 49-1-3 would also solve the combination lock of 1-3-5.
In this program, you have to output the number of distinct solutions to the two lock combinations. For the record, the combination 3-2-1 and 1-2-3 are considered distinct, but the combination 2-2-2 and 2-2-2 is not.
I have tried creating two functions, one to check whether three numbers match the constraints of the first combination lock and another to check whether three numbers match the constraints of the second combination lock.
a,b,c = 1,2,3
d,e,f = 5,6,7
dial = 50
def check(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(a-i) <= 2 and abs(b-j) <= 2 and abs(c-k) <= 2:
return True
return False
def check1(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(d-i) <= 2 and abs(e-j) <= 2 and abs(f-k) <= 2:
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
if check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
The dial is 50 and the first combination is 1-2-3 and the second combination is 5-6-7.
The program should output 249 as the count, but it instead outputs 225. I am not really sure why this is happening. I have added the array for display purposes only. Any help would be greatly appreciated!
You're going to a lot of trouble to solve this by brute force.
First of all, your two check routines have identical functionality: just call the same routine for both combinations, giving the correct combination as a second set of parameters.
The critical logic problem is handling the dial wrap-around: you miss picking up the adjacent numbers. Run 49 through your check against a correct value of 1:
# using a=1, i=49
i = (1+50)%50 # i = 1
...
if abs(1-49) <= 2 ... # abs(1-49) is 48. You need it to show up as 2.
Instead, you can check each end of the dial:
a_diff = abs(i-a)
if a_diff <=2 or a_diff >= (dial-2) ...
Another way is to start by making a list of acceptable values:
a_vals = [(a-oops) % dial] for oops in range(-2, 3)]
... but note that you have to change the 0 value to dial. For instance, for a value of 1, you want a list of [49, 50, 1, 2, 3]
With this done, you can check like this:
if i in a_vals and j in b_vals and k in c_vals:
...
If you want to upgrade to the itertools package, you can simply generate all desired combinations:
combo = set(itertools.product(a_list, b_list_c_list) )
Do that for both given combinations and take the union of the two sets. The length of the union is the desired answer.
I see the follow-up isn't obvious -- at least, it's not appearing in the comments.
You have 5*5*5 solutions for each combination; start with 250 as your total.
Compute the sizes of the overlap sets: the numbers in each triple that can serve for each combination. For your given problem, those are [3],[4],[5]
The product of those set sizes is the quantity of overlap: 1*1*1 in this case.
The overlapping solutions got double-counted, so simply subtract the extra from 250, giving the answer of 249.
For example, given 1-2-3 and 49-6-6, you would get sets
{49, 50, 1}
{4}
{4, 5}
The sizes are 3, 1, 2; the product of those numbers is 6, so your answer is 250-6 = 244
Final note: If you're careful with your modular arithmetic, you can directly compute the set sizes without building the sets, making the program very short.
Here is one approach to a semi-brute-force solution:
import itertools
#The following code assumes 0-based combinations,
#represented as tuples of numbers in the range 0 to dial - 1.
#A simple wrapper function can be used to make the
#code apply to 1-based combos.
#The following function finds all combos which open lock with a given combo:
def combos(combo,tol,dial):
valids = []
for p in itertools.product(range(-tol,1+tol),repeat = 3):
valids.append(tuple((x+i)%dial for x,i in zip(combo,p)))
return valids
#The following finds all combos for a given iterable of target combos:
def all_combos(targets,tol,dial):
return set(combo for target in targets for combo in combos(target,tol,dial))
For example, len(all_combos([(0,1,2),(4,5,6)],2,50)) evaluate to 249.
The correct code for what you are trying to do is the following:
dial = 50
a = 1
b = 2
c = 3
d = 5
e = 6
f = 7
def check(i,j,k):
if (abs(a-i) <= 2 or (dial-abs(a-i)) <= 2) and \
(abs(b-j) <= 2 or (dial-abs(b-j)) <= 2) and \
(abs(c-k) <= 2 or (dial-abs(c-k)) <= 2):
return True
return False
def check1(i,j,k):
if (abs(d-i) <= 2 or (dial-abs(d-i)) <= 2) and \
(abs(e-j) <= 2 or (dial-abs(e-j)) <= 2) and \
(abs(f-k) <= 2 or (dial-abs(f-k)) <= 2):
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
elif check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
And the result is 249, the total combinations are 2*(5**3) = 250, but we have the duplicates: [3, 4, 5]
I need to use a certain program, to validate some of my results. I am relatively new in Python. The output is so different for each entry, see a snippit below:
SEQENCE ID TM SP PREDICTION
YOL154W_Q12512_Saccharomyces_cerevisiae 0 Y n8-15c20/21o
YDR481C_P11491_Saccharomyces_cerevisiae 1 0 i34-53o
YAL007C_P39704_Saccharomyces_cerevisiae 1 Y n5-20c25/26o181-207i
YAR028W_P39548_Saccharomyces_cerevisiae 2 0 i51-69o75-97i
YBL040C_P18414_Saccharomyces_cerevisiae 7 0 o6-26i38-56o62-80i101-119o125-143i155-174o186-206i
YBR106W_P38264_Saccharomyces_cerevisiae 1 0 o28-47i
YBR287W_P38355_Saccharomyces_cerevisiae 8 0 o12-32i44-63o69-90i258-275o295-315i327-351o363-385i397-421o
So, I need the last transmembrane region, in this case its always the last numbers between o and i or vise versa. if TM = 0, there is no transmembrane region, so I want the numbers if TM > 0
output I need:
34-53
181-207
75-97
186-206
28-47
397-421
preferably in seperate values, like:
first_number = 34
second_number = 53
Because I will be using a loop the values will be overwritten anyway. To summarize: I need the last region between the o and i or vise versa, with very variable strings (both in length and composition).
Trouble: If I just search (for example with regular expression) for the last region between o and i, I will sometimes pick the wrong region.
If the Phobius output is stored in a file, change 'Phobius_output' to the path, then the following code should give the expected result:
with open('Phobius_output') as file:
for line in file.readlines()[1:]:
if int(line.split()[1]) > 0:
prediction = line.split()[3]
i_idx, o_idx = prediction.rfind('i'), prediction.rfind('o')
last_region = prediction[i_idx + 1:o_idx] if i_idx < o_idx else prediction[o_idx + 1:i_idx]
first_number, second_number = map(int, last_region.split('-'))
print(last_region)
I have a min-heap code for Huffman coding which you can see here: http://rosettacode.org/wiki/Huffman_coding#Python
I'm trying to make a max-heap Shannon-Fano code which is similar to min-heap.
Here is a code:
from collections import defaultdict, Counter
import heapq, math
def _heappop_max(heap):
"""Maxheap version of a heappop."""
lastelt = heap.pop() # raises appropriate IndexError if heap is empty
if heap:
returnitem = heap[0]
heap[0] = lastelt
heapq._siftup_max(heap, 0)
return returnitem
return lastelt
def _heappush_max(heap, item):
"""Push item onto heap, maintaining the heap invariant."""
heap.append(item)
heapq._siftdown_max(heap, 0, len(heap)-1)
def sf_encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
heapq._heapify_max(heap)
while len(heap) > 1:
lo = _heappop_max(heap)
hi = _heappop_max(heap)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
print heap
return sorted(_heappop_max(heap)[1:], key=lambda p: (len(p[1]), p))
But i've got output like this:
Symbol Weight Shannon-Fano Code
! 1 1
3 1 01
: 1 001
J 1 0001
V 1 00001
z 1 000001
E 3 0000001
L 3 00000001
P 3 000000001
N 4 0000000001
O 4 00000000001
Am I right using heapq to implement Shannon-Fano coding? The problem in this string:
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
and I don't understand how to fix it.
Expect output similar to Huffman encoding
Symbol Weight Huffman Code
2875 01
a 744 1001
e 1129 1110
h 606 0000
i 610 0001
n 617 0010
o 668 1000
t 842 1100
d 358 10100
l 326 00110
Added:
Well, I've tried to do this without heapq, but have unstopable recursion:
def sf_encode(iA, iB, maxP):
global tupleList, total_sf
global mid
maxP = maxP/float(2)
sumP = 0
for i in range(iA, iB):
tup = tupleList[i]
if sumP < maxP or i == iA: # top group
sumP += tup[1]/float(total_sf)
tupleList[i] = (tup[0], tup[1], tup[2] + '0')
mid = i
else: # bottom group
tupleList[i] = (tup[0], tup[1], tup[2] + '1')
print tupleList
if mid - 1 > iA:
sf_encode(iA, mid - 1, maxP)
if iB - mid > 0:
sf_encode(mid, iB, maxP)
return tupleList
In Shannon-Fano coding you need the following steps:
A Shannon–Fano tree is built according to a specification designed to
define an effective code table. The actual algorithm is simple:
For a given list of symbols, develop a corresponding list of
probabilities or frequency counts so that each symbol’s relative
frequency of occurrence is known.
Sort the lists of symbols according
to frequency, with the most frequently occurring symbols at the left
and the least common at the right.
Divide the list into two parts,
with the total frequency counts of the left part being as close to the
total of the right as possible.
The left part of the list is assigned
the binary digit 0, and the right part is assigned the digit 1. This
means that the codes for the symbols in the first part will all start
with 0, and the codes in the second part will all start with 1.
Recursively apply the steps 3 and 4 to each of the two halves,
subdividing groups and adding bits to the codes until each symbol has
become a corresponding code leaf on the tree.
So you will need code to sort (your input appears already sorted so you may be able to skip this), plus a recursive function that chooses the best partition and then recurses on the first and second halves of the list.
Once the list is sorted, the order of the elements never changes so there is no need to use heapq to do this style of encoding.