Shannon-Fano code as max-heap in python

Shannon-Fano code as max-heap in python - python

I have a min-heap code for Huffman coding which you can see here: http://rosettacode.org/wiki/Huffman_coding#Python
I'm trying to make a max-heap Shannon-Fano code which is similar to min-heap.
Here is a code:
from collections import defaultdict, Counter
import heapq, math
def _heappop_max(heap):
"""Maxheap version of a heappop."""
lastelt = heap.pop() # raises appropriate IndexError if heap is empty
if heap:
returnitem = heap[0]
heap[0] = lastelt
heapq._siftup_max(heap, 0)
return returnitem
return lastelt
def _heappush_max(heap, item):
"""Push item onto heap, maintaining the heap invariant."""
heap.append(item)
heapq._siftdown_max(heap, 0, len(heap)-1)
def sf_encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
heapq._heapify_max(heap)
while len(heap) > 1:
lo = _heappop_max(heap)
hi = _heappop_max(heap)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
print heap
return sorted(_heappop_max(heap)[1:], key=lambda p: (len(p[1]), p))
But i've got output like this:
Symbol Weight Shannon-Fano Code
! 1 1
3 1 01
: 1 001
J 1 0001
V 1 00001
z 1 000001
E 3 0000001
L 3 00000001
P 3 000000001
N 4 0000000001
O 4 00000000001
Am I right using heapq to implement Shannon-Fano coding? The problem in this string:
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
and I don't understand how to fix it.
Expect output similar to Huffman encoding
Symbol Weight Huffman Code
2875 01
a 744 1001
e 1129 1110
h 606 0000
i 610 0001
n 617 0010
o 668 1000
t 842 1100
d 358 10100
l 326 00110
Added:
Well, I've tried to do this without heapq, but have unstopable recursion:
def sf_encode(iA, iB, maxP):
global tupleList, total_sf
global mid
maxP = maxP/float(2)
sumP = 0
for i in range(iA, iB):
tup = tupleList[i]
if sumP < maxP or i == iA: # top group
sumP += tup[1]/float(total_sf)
tupleList[i] = (tup[0], tup[1], tup[2] + '0')
mid = i
else: # bottom group
tupleList[i] = (tup[0], tup[1], tup[2] + '1')
print tupleList
if mid - 1 > iA:
sf_encode(iA, mid - 1, maxP)
if iB - mid > 0:
sf_encode(mid, iB, maxP)
return tupleList

In Shannon-Fano coding you need the following steps:
A Shannon–Fano tree is built according to a specification designed to
define an effective code table. The actual algorithm is simple:
For a given list of symbols, develop a corresponding list of
probabilities or frequency counts so that each symbol’s relative
frequency of occurrence is known.
Sort the lists of symbols according
to frequency, with the most frequently occurring symbols at the left
and the least common at the right.
Divide the list into two parts,
with the total frequency counts of the left part being as close to the
total of the right as possible.
The left part of the list is assigned
the binary digit 0, and the right part is assigned the digit 1. This
means that the codes for the symbols in the first part will all start
with 0, and the codes in the second part will all start with 1.
Recursively apply the steps 3 and 4 to each of the two halves,
subdividing groups and adding bits to the codes until each symbol has
become a corresponding code leaf on the tree.
So you will need code to sort (your input appears already sorted so you may be able to skip this), plus a recursive function that chooses the best partition and then recurses on the first and second halves of the list.
Once the list is sorted, the order of the elements never changes so there is no need to use heapq to do this style of encoding.

Related

Find the common ordered characters between two strings

Given two strings, find the common characters between the two strings which are in same order from left to right.
Example 1
string_1 = 'hcarry'
string_2 = 'sallyc'
Output - 'ay'
Example 2
string_1 = 'jenny'
string_2 = 'ydjeu'
Output - 'je'
Explanation for Example 1 -
Common characters between string_1 and string_2 are c,a,y. But since c comes before ay in string_1 and after ay in string_2, we won't consider character c in output. The order of common characters between the two strings must be maintained and must be same.
Explanation for Example 2 -
Common characters between string_1 and string_2 are j,e,y. But since y comes before je in string_2 and after je in string_1, we won't consider character y in output. The order of common characters between the two strings must be maintained and must be same.
My approach -
Find the common characters between the strings and then store it in another variable for each individual string.
Example -
string_1 = 'hcarry'
string_2 = 'sallyc'
Common_characters = c,a,y
string_1_com = cay
string_2_com = ayc
I used sorted, counter, enumerate functions to get string_1_com and string_2_com in Python.
Now find the longest common sub-sequence in between string_1_com and string_2_com . You get the output as the result.
This is the brute force solution.
What is the optimal solution for this?

The algorithm for this is just called string matching in my book. It runs in O(mn) where m and n are the word lengths. I guess it might as well run on the full words, what's most efficient would depend on the expected number of common letters and how the sorting and filtering is performed. I will explain it for common letters strings as that's easier.
The idea is that you look at a directed acyclic graph of (m+1)*(n+1) nodes. Each path (from upper left to lower right) through this graph represents a unique way of matching the words. We want to match the strings, and additionally put in blanks (-) in the words so that they align with the highest number of common letters. For example the end state of cay and ayc would be
cay-
-ayc
Each node stores the highest number of matches for the partial matching which it represents, and at the end of the algorithm the end node will give us the highest number of matches.
We start at the upper left corner where nothing is matched with nothing and so we have 0 matching letters here (score 0).
c a y
0 . . .
a . . . .
y . . . .
c . . . .
We are to walk through this graph and for each node calculate the highest number of matching letters, by using the data from previous nodes.
The nodes are connected left->right, up->down and diagonally left-up->right-down.
Moving right represents consuming one letter from cay and matching the letter we arrive at with a - inserted in ayc.
Moving down represents the opposite (consuming from ayc and inserting - to cay).
Moving diagonally represents consuming one letter from each word and matching those.
Looking at the first node to the right of our starting node it represents the matching
c
-
and this node can (obviously) only be reached from the starting node.
All nodes in first row and first column will be 0 since they all represent matching one or more letters with an equal number of -.
We get the graph
c a y
0 0 0 0
a 0 . . .
y 0 . . .
c 0 . . .
That was the setup, now the interesting part begins.
Looking at the first unevaluated node, which represents matching the substrings c with a, we want to decide how we can get there with the most number of matching letters.
Alternative 1: We can get there from the node to the left. The node to the left represents the matching
-
a
so by choosing this path to get to our current node we arrive at
-c
a-
matching c with - gives us no correct match and thus the score for this path is 0 (taken from the last node) plus 0 (score for the match c/- just made). So 0 + 0 = 0 for this path.
Alternative 2: We can get to this node from above, this path represents moving from
c -> c-
- -a
which also gives us 0 extra points. Score for this is 0.
Alternative 3: We can get to this node from upper-left. This is moving from starting node (nothing at all) to consuming one character from each letter. That is matching
c
a
Since c and a is different letters we get 0 + 0 = 0 for this path as well.
c a y
0 0 0 0
a 0 0 . .
y 0 . . .
c 0 . . .
But for the next node it looks better. We still have the three alternatives to look at.
Alternative 1 & 2 always gives us 0 extra points as they always represent matching a letter with -, so those paths will give us score 0. Let's move on to alternative 3.
For our current node moving diagonally means going from
c -> ca
- -a
IT'S A MATCH!
That means there is a path to this node that gives us 1 in score. We throw away the 0s and save the 1.
c a y
0 0 0 0
a 0 0 1 .
y 0 . . .
c 0 . . .
For the last node on this row we look at our three alternatives and realize we won't get any new points (new matches), but we can get to the node by using our previous 1 point path:
ca -> cay
-a -a-
So this node is also 1 in score.
Doing this for all nodes we get the following complete graph
c a y
0 0 0 0
a 0 0 1 1
y 0 0 1 2
c 0 1 1 2
where the only increases in score come from
c -> ca | ca -> cay | - -> -c
- -a | -a -ay | y yc
An so the end node tells us the maximal match is 2 letters.
Since in your case you wish to know that longest path with score 2, you need to track, for each node, the path taken as well.
This graph is easily implemented as a matrix (or an array of arrays).
I would suggest that you as elements use a tuple with one score element and one path element and in the path element you just store the aligning letters, then the elements of the final matrix will be
c a y
0 0 0 0
a 0 0 (1, a) (1, a)
y 0 0 (1, a) (2, ay)
c 0 (1, c) (1, a/c) (2, ay)
At one place I noted a/c, this is because string ca and ayc have two different sub-sequences of maximum length. You need to decide what to do in those cases, either just go with one or save both.
EDIT:
Here's an implementation for this solution.
def longest_common(string_1, string_2):
len_1 = len(string_1)
len_2 = len(string_2)
m = [[(0,"") for _ in range(len_1 + 1)] for _ in range(len_2 + 1)] # intitate matrix
for row in range(1, len_2+1):
for col in range(1, len_1+1):
diag = 0
match = ""
if string_1[col-1] == string_2[row-1]: # score increase with one if letters match in diagonal move
diag = 1
match = string_1[col - 1]
# find best alternative
if m[row][col-1][0] >= m[row-1][col][0] and m[row][col-1][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row][col-1] # path from left is best
elif m[row-1][col][0] >= m[row-1][col-1][0]+diag:
m[row][col] = m[row-1][col] # path from above is best
else:
m[row][col] = (m[row-1][col-1][0]+diag, m[row-1][col-1][1]+match) # path diagonally is best
return m[len_2][len_1][1]
>>> print(longest_common("hcarry", "sallyc"))
ay
>>> print(longest_common("cay", "ayc"))
ay
>>> m
[[(0, ''), (0, ''), (0, ''), (0, '')],
[(0, ''), (0, ''), (1, 'a'), (1, 'a')],
[(0, ''), (0, ''), (1, 'a'), (2, 'ay')],
[(0, ''), (1, 'c'), (1, 'c'), (2, 'ay')]]

Here is a simple, dynamic programming based implementation for the problem:
def lcs(X, Y):
m, n = len(X), len(Y)
L = [[0 for x in xrange(n+1)] for x in xrange(m+1)]
# using a 2D Matrix for dynamic programming
# L[i][j] stores length of longest common string for X[0:i] and Y[0:j]
for i in range(m+1):
for j in range(n+1):
if i == 0 or j == 0:
L[i][j] = 0
elif X[i-1] == Y[j-1]:
L[i][j] = L[i-1][j-1] + 1
else:
L[i][j] = max(L[i-1][j], L[i][j-1])
# Following code is used to find the common string
index = L[m][n]
# Create a character array to store the lcs string
lcs = [""] * (index+1)
lcs[index] = ""
# Start from the right-most-bottom-most corner and
# one by one store characters in lcs[]
i = m
j = n
while i > 0 and j > 0:
# If current character in X[] and Y are same, then
# current character is part of LCS
if X[i-1] == Y[j-1]:
lcs[index-1] = X[i-1]
i-=1
j-=1
index-=1
# If not same, then find the larger of two and
# go in the direction of larger value
elif L[i-1][j] > L[i][j-1]:
i-=1
else:
j-=1
print ("".join(lcs))

But.. you have already known term "longest common subsequence" and can find numerous descriptions of dynamic programming algorithm.
Wiki link
pseudocode
function LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j] //i-1 and j-1 if reading X & Y from zero
C[i,j] := C[i-1,j-1] + 1
else
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]
function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i = 0 or j = 0
return ""
if X[i] = Y[j]
return backtrack(C, X, Y, i-1, j-1) + X[i]
if C[i,j-1] > C[i-1,j]
return backtrack(C, X, Y, i, j-1)
return backtrack(C, X, Y, i-1, j)

Much easier solution ----- Thank you!
def f(s, s1):
cc = list(set(s) & set(s1))
ns = ''.join([S for S in s if S in cc])
ns1 = ''.join([S for S in s1 if S in cc])
found = []
b = ns[0]
for e in ns[1:]:
cs = b+e
if cs in ns1:
found.append(cs)
b = e
return found

How to get the correct number of distinct combination locks with a margin or error of +-2?

I am trying to solve the usaco problem combination lock where you are given a two lock combinations. The locks have a margin of error of +- 2 so if you had a combination lock of 1-3-5, the combination 3-1-7 would still solve it.
You are also given a dial. For example, the dial starts at 1 and ends at the given number. So if the dial was 50, it would start at 1 and end at 50. Since the beginning of the dial is adjacent to the end of the dial, the combination 49-1-3 would also solve the combination lock of 1-3-5.
In this program, you have to output the number of distinct solutions to the two lock combinations. For the record, the combination 3-2-1 and 1-2-3 are considered distinct, but the combination 2-2-2 and 2-2-2 is not.
I have tried creating two functions, one to check whether three numbers match the constraints of the first combination lock and another to check whether three numbers match the constraints of the second combination lock.
a,b,c = 1,2,3
d,e,f = 5,6,7
dial = 50
def check(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(a-i) <= 2 and abs(b-j) <= 2 and abs(c-k) <= 2:
return True
return False
def check1(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(d-i) <= 2 and abs(e-j) <= 2 and abs(f-k) <= 2:
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
if check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
The dial is 50 and the first combination is 1-2-3 and the second combination is 5-6-7.
The program should output 249 as the count, but it instead outputs 225. I am not really sure why this is happening. I have added the array for display purposes only. Any help would be greatly appreciated!

You're going to a lot of trouble to solve this by brute force.
First of all, your two check routines have identical functionality: just call the same routine for both combinations, giving the correct combination as a second set of parameters.
The critical logic problem is handling the dial wrap-around: you miss picking up the adjacent numbers. Run 49 through your check against a correct value of 1:
# using a=1, i=49
i = (1+50)%50 # i = 1
...
if abs(1-49) <= 2 ... # abs(1-49) is 48. You need it to show up as 2.
Instead, you can check each end of the dial:
a_diff = abs(i-a)
if a_diff <=2 or a_diff >= (dial-2) ...
Another way is to start by making a list of acceptable values:
a_vals = [(a-oops) % dial] for oops in range(-2, 3)]
... but note that you have to change the 0 value to dial. For instance, for a value of 1, you want a list of [49, 50, 1, 2, 3]
With this done, you can check like this:
if i in a_vals and j in b_vals and k in c_vals:
...
If you want to upgrade to the itertools package, you can simply generate all desired combinations:
combo = set(itertools.product(a_list, b_list_c_list) )
Do that for both given combinations and take the union of the two sets. The length of the union is the desired answer.
I see the follow-up isn't obvious -- at least, it's not appearing in the comments.
You have 5*5*5 solutions for each combination; start with 250 as your total.
Compute the sizes of the overlap sets: the numbers in each triple that can serve for each combination. For your given problem, those are [3],[4],[5]
The product of those set sizes is the quantity of overlap: 1*1*1 in this case.
The overlapping solutions got double-counted, so simply subtract the extra from 250, giving the answer of 249.
For example, given 1-2-3 and 49-6-6, you would get sets
{49, 50, 1}
{4}
{4, 5}
The sizes are 3, 1, 2; the product of those numbers is 6, so your answer is 250-6 = 244
Final note: If you're careful with your modular arithmetic, you can directly compute the set sizes without building the sets, making the program very short.

Here is one approach to a semi-brute-force solution:
import itertools
#The following code assumes 0-based combinations,
#represented as tuples of numbers in the range 0 to dial - 1.
#A simple wrapper function can be used to make the
#code apply to 1-based combos.
#The following function finds all combos which open lock with a given combo:
def combos(combo,tol,dial):
valids = []
for p in itertools.product(range(-tol,1+tol),repeat = 3):
valids.append(tuple((x+i)%dial for x,i in zip(combo,p)))
return valids
#The following finds all combos for a given iterable of target combos:
def all_combos(targets,tol,dial):
return set(combo for target in targets for combo in combos(target,tol,dial))
For example, len(all_combos([(0,1,2),(4,5,6)],2,50)) evaluate to 249.

The correct code for what you are trying to do is the following:
dial = 50
a = 1
b = 2
c = 3
d = 5
e = 6
f = 7
def check(i,j,k):
if (abs(a-i) <= 2 or (dial-abs(a-i)) <= 2) and \
(abs(b-j) <= 2 or (dial-abs(b-j)) <= 2) and \
(abs(c-k) <= 2 or (dial-abs(c-k)) <= 2):
return True
return False
def check1(i,j,k):
if (abs(d-i) <= 2 or (dial-abs(d-i)) <= 2) and \
(abs(e-j) <= 2 or (dial-abs(e-j)) <= 2) and \
(abs(f-k) <= 2 or (dial-abs(f-k)) <= 2):
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
elif check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
And the result is 249, the total combinations are 2*(5**3) = 250, but we have the duplicates: [3, 4, 5]

Changing this Python program to have function def()

The following Python program flips a coin several times, then reports the longest series of heads and tails. I am trying to convert this program into a program that uses functions so it uses basically less code. I am very new to programming and my teacher requested this of us, but I have no idea how to do it. I know I'm supposed to have the function accept 2 parameters: a string or list, and a character to search for. The function should return, as the value of the function, an integer which is the longest sequence of that character in that string. The function shouldn't accept input or output from the user.
import random
print("This program flips a coin several times, \nthen reports the longest
series of heads and tails")
cointoss = int(input("Number of times to flip the coin: "))
varlist = []
i = 0
varstring = ' '
while i < cointoss:
r = random.choice('HT')
varlist.append(r)
varstring = varstring + r
i += 1
print(varstring)
print(varlist)
print("There's this many heads: ",varstring.count("H"))
print("There's this many tails: ",varstring.count("T"))
print("Processing input...")
i = 0
longest_h = 0
longest_t = 0
inarow = 0
prevIn = 0
while i < cointoss:
print(varlist[i])
if varlist[i] == 'H':
prevIn += 1
if prevIn > longest_h:
longest_h = prevIn
print("",longest_h,"")
inarow = 0
if varlist[i] == 'T':
inarow += 1
if inarow > longest_t:
longest_t = inarow
print("",longest_t,"")
prevIn = 0
i += 1
print ("The longest series of heads is: ",longest_h)
print ("The longest series of tails is: ",longest_t)
If this is asking too much, any explanatory help would be really nice instead. All I've got so far is:
def flip (a, b):
flipValue = random.randint
but it's barely anything.

import random
def Main():
numOfFlips=getFlips()
outcome=flipping(numOfFlips)
print(outcome)
def getFlips():
Flips=int(input("Enter number if flips:\n"))
return Flips
def flipping(numOfFlips):
longHeads=[]
longTails=[]
Tails=0
Heads=0
for flips in range(0,numOfFlips):
flipValue=random.randint(1,2)
print(flipValue)
if flipValue==1:
Tails+=1
longHeads.append(Heads) #recording value of Heads before resetting it
Heads=0
else:
Heads+=1
longTails.append(Tails)
Tails=0
longestHeads=max(longHeads) #chooses the greatest length from both lists
longestTails=max(longTails)
return "Longest heads:\t"+str(longestHeads)+"\nLongest tails:\t"+str(longestTails)
Main()
I did not quite understand how your code worked, so I made the code in functions that works just as well, there will probably be ways of improving my code alone but I have moved the code over to functions

First, you need a function that flips a coin x times. This would be one possible implementation, favoring random.choice over random.randint:
def flip(x):
result = []
for _ in range(x):
result.append(random.choice(("h", "t")))
return result
Of course, you could also pass from what exactly we are supposed to take a choice as a parameter.
Next, you need a function that finds the longest sequence of some value in some list:
def longest_series(some_value, some_list):
current, longest = 0, 0
for r in some_list:
if r == some_value:
current += 1
longest = max(current, longest)
else:
current = 0
return longest
And now you can call these in the right order:
# initialize the random number generator, so we get the same result
random.seed(5)
# toss a coin a hundred times
series = flip(100)
# count heads/tails
headflips = longest_series('h', series)
tailflips = longest_series('t', series)
# print the results
print("The longest series of heads is: " + str(headflips))
print("The longest series of tails is: " + str(tailflips))
Output:
>> The longest series of heads is: 8
>> The longest series of heads is: 5
edit: removed the flip implementation with yield, it made the code weird.

Counting the longest run
Let see what you have asked for
I'm supposed to have the function accept 2 parameters: a string or list,
or, generalizing just a bit, a sequence
and a character
again, we'd speak, generically, of an item
to search for. The function should return, as the value of the
function, an integer which is the longest sequence of that character
in that string.
My implementation of the function you are asking for, complete of doc
string, is
def longest_run(i, s):
'Counts the longest run of item "i" in sequence "s".'
c, m = 0, 0
for el in s:
if el==i:
c += 1
elif c:
m = m if m >= c else c
c = 0
return m
We initialize c (current run) and m (maximum run so far) to zero,
then we loop, looking at every element el of the argument sequence s.
The logic is straightforward but for elif c: whose block is executed at the end of a run (because c is greater than zero and logically True) but not when the previous item (not the current one) was not equal to i. The savings are small but are savings...
Flipping coins (and more...)
How can we simulate flipping n coins? We abstract the problem and recognize that flipping n coins corresponds to choosing from a collection of possible outcomes (for a coin, either head or tail) for n times.
As it happens, the random module of the standard library has the exact answer to this problem
In [52]: random.choices?
Signature: choices(population, weights=None, *, cum_weights=None, k=1)
Docstring:
Return a k sized list of population elements chosen with replacement.
If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.
File: ~/lib/miniconda3/lib/python3.6/random.py
Type: method
Our implementation, aimed at hiding details, could be
def roll(n, l):
'''Rolls "n" times a dice/coin whose face values are listed in "l".
E.g., roll(2, range(1,21)) -> [12, 4] simulates rolling 2 icosahedron dices.
'''
from random import choices
return choices(l, k=n)
Putting this together
def longest_run(i, s):
'Counts the longest run of item "i" in sequence "s".'
c, m = 0, 0
for el in s:
if el==i:
c += 1
elif c:
m = m if m >= c else c
c = 0
return m
def roll(n, l):
'''Rolls "n" times a dice/coin whose face values are listed in "l".
E.g., roll(2, range(1,21)) -> [12, 4] simulates rolling 2 icosahedron dices.
'''
from random import choices
return choices(l, k=n)
N = 100 # n. of flipped coins
h_or_t = ['h', 't']
random_seq_of_h_or_t = flip(N, h_or_t)
max_h = longest_run('h', random_seq_of_h_or_t)
max_t = longest_run('t', random_seq_of_h_or_t)

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)

My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.

The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string

The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.

This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...

Recursive Permutation Generator, swapping list items not working

I want to systematically generate permutations of the alphabet.
I cannot don't want to use python itertools.permutation, because pregenerating a list of every permutation causes my computer to crash (first time i actually got it to force a shutdown, it was pretty great).
Therefore, my new approach is to generate and test each key on the fly. Currently, I am trying to handle this with recursion.
My idea is to start with the largest list (i'll use a 3 element list as an example), recurse in to smaller list until the list is two elements long. Then, it will print the list, swap the last two, print the list again, and return up one level and repeat.
For example, for 123
123 (swap position 0 with position 0)
23 --> 123 (swap position 1 with position 1)
32 --> 132 (swap position 1 with position 2)
213 (swap position 0 with position 1)
13 --> 213 (swap position 1 with position 1)
31 --> 231 (swap position 1 with position 2)
321 (swap position 0 with position 2)
21 --> 321 (swap position 1 with position 1)
12 --> 312 (swap position 1 with position 2)
for a four letter number (1234)
1234 (swap position 0 with position 0)
234 (swap position 1 with position 1)
34 --> 1234
43 --> 1243
324 (swap position 1 with position 2)
24 --> 1324
42 --> 1342
432 (swap position 1 with position 3)
32 --> 1432
23 --> 1423
2134 (swap position 0 for position 1)
134 (swap position 1 with position 1)
34 --> 2134
43 --> 2143
314 (swap position 1 with position 2)
14--> 2314
41--> 2341
431 (swap position 1 with position 3)
31--> 2431
13 -->2413
This is the code i currently have for the recursion, but its causing me a lot of grief, recursion not being my strong suit. Here's what i have.
def perm(x, y, key):
print "Perm called: X=",x,", Y=",y,", key=",key
while (x<y):
print "\tLooping Inward"
print "\t", x," ",y," ", key
x=x+1
key=perm(x, y, key)
swap(x,y,key)
print "\tAfter 'swap':",x," ",y," ", key, "\n"
print "\nFull Depth Reached"
#print key, " SWAPPED:? ",swap(x,y,key)
print swap(x, y, key)
print " X=",x,", Y=",y,", key=",key
return key
def swap(x, y, key):
v=key[x]
key[x]=key[y]
key[y]=v
return key
Any help would be greatly appreciated, this is a really cool project and I don't want to abandon it.
Thanks to all! Comments on my method or anything are welcome.

Happened upon my old question later in my career
To efficiently do this, you want to write a generator.
Instead of returning a list of all of the permutations, which requires that you store them (all of them) in memory, a generator returns one permutation (one element of this list), then pauses, and then computes the next one when you ask for it.
The advantages to generators are:
Take up much less space.
Generators take up between 40 and 80 bytes of space. One generators can have generate millions of items.
A list with one item takes up 40 bytes. A list with 1000 items takes up 4560 bytes
More efficient
Only computes as many values as you need. In permuting the alphabet, if the correct permutation was found before the end of the list, the time spend generating all of the other permutations was wasted.
(Itertools.permutation is an example of a generator)
How do I Write a Generator?
Writing a generator in python is actually very easy.
Basically, write code that would work for generating a list of permutations. Now, instead of writing resultList+=[ resultItem ], write yield(resultItem).
Now you've made a generator. If I wanted to loop over my generator, I could write
for i in myGenerator:
It's that easy.
Below is a generator for the code that I tried to write long ago:
def permutations(iterable, r=None):
# permutations('ABCD', 2) --> AB AC AD BA BC BD CA CB CD DA DB DC
# permutations(range(3)) --> 012 021 102 120 201 210
pool = tuple(iterable)
n = len(pool)
r = n if r is None else r
if r > n:
return
indices = range(n)
cycles = range(n, n-r, -1)
yield tuple(pool[i] for i in indices[:r])
while n:
for i in reversed(range(r)):
cycles[i] -= 1
if cycles[i] == 0:
indices[i:] = indices[i+1:] + indices[i:i+1]
cycles[i] = n - i
else:
j = cycles[i]
indices[i], indices[-j] = indices[-j], indices[i]
yield tuple(pool[i] for i in indices[:r])
break
else:
return

I think you have a really good idea, but keeping track of the positions might get a bit difficult to deal with. The general way I've seen for generating permutations recursively is a function which takes two string arguments: one to strip characters from (str) and one to add characters to (soFar).
When generating a permutation then we can think of taking characters from str and adding them to soFar. Assume we have a function perm that takes these two arguments and finds all permutations of str. We can then consider the current string str. We'll have permutations beginning with each character in str so we just need to loop over str, using each of these characters as the initial character and calling perm on the characters remaining in the string:
// half python half pseudocode
def perm(str, soFar):
if(str == ""):
print soFar // here we have a valid permutation
return
for i = 0 to str.length:
next = soFar + str[i]
remaining = str.substr(0, i) + str.substr(i+1)
perm(remaining, next)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shannon-Fano code as max-heap in python - python

Related

Find the common ordered characters between two strings

How to get the correct number of distinct combination locks with a margin or error of +-2?

Changing this Python program to have function def()

Effcient way to find longest duplicate string for Python (From Programming Pearls)

Recursive Permutation Generator, swapping list items not working

Categories

Resources