Finding longest alphabetical substring - understanding the concepts in Python - python

I am completing the Introduction to Computer Science and Programming Using Python Course and am stuck on Week 1: Python Basics - Problem Set 1 - Problem 3.
The problem asks:
Assume s is a string of lower case characters.
Write a program that prints the longest substring of s in which the
letters occur in alphabetical order. For example, if s = 'azcbobobegghakl', then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring. For example, if s = 'abcbcd', then your program should print*
Longest substring in alphabetical order is: abc
There are many posts on stack overflow where people are just chasing or giving the code as the answer. I am looking to understand the concept behind the code as I am new to programming and want gain a better understanding of the basics
I found the following code that seems to answer the question. I understand the basic concept of the for loop, I am having trouble understanding how to use them (for loops) to find alphabetical sequences in a string
Can someone please help me understand the concept of using the for loops in this way.
s = 'cyqfjhcclkbxpbojgkar'
lstring = s[0]
slen = 1
for i in range(len(s)):
for j in range(i,len(s)-1):
if s[j+1] >= s[j]:
if (j+1)-i+1 > slen:
lstring = s[i:(j+1)+1]
slen = (j+1)-i+1
else:
break
print("Longest substring in alphabetical order is: " + lstring)

Let's go through your code step by step.
First we assume that the first character forms the longest sequence. What we will do is try improving this guess.
s = 'cyqfjhcclkbxpbojgkar'
lstring = s[0]
slen = 1
The first loop then picks some index i, it will be the start of a sequence. From there, we will check all existing sequences starting from i by looping over the possible end of a sequence with the nested loop.
for i in range(len(s)): # This loops over the whole string indices
for j in range(i,len(s)-1): # This loops over indices following i
This nested loops will allow us to check every subsequence by picking every combination of i and j.
The first if statement intends to check if that sequence is still an increasing one. If it is not we break the inner loop as we are not interested in that sequence.
if s[j+1] >= s[j]:
...
else:
break
We finally need to check if the current sequence we are looking at is better than our current guess by comparing its length to slen, which is our best guess.
if (j+1)-i+1 > slen:
lstring = s[i:(j+1)+1]
slen = (j+1)-i+1
Improvements
Note that this code is not optimal as it needlessly traverses your string multiple times. You could implement a more efficient approach that traverses the string only once to recover all increasing substrings and then uses max to pick the longuest one.
s = 'cyqfjhcclkbxpbojgkar'
substrings = []
start = 0
end = 1
while end < len(s):
if s[end - 1] > s[end]:
substrings.append(s[start:end])
start = end + 1
end = start + 1
else:
end += 1
lstring = max(substrings, key=len)
print("Longest substring in alphabetical order is: " + lstring)
The list substrings looks like this after the while-loop: ['cy', 'fj', 'ccl', 'bx', 'bo', 'gk']
From these, max(..., key=len) picks the longuest one.

Related

for j in anagram(word[:i] + word[i+1:]): <- how it works?

I built anagram generator. It works, but I don't know for loop for functions works at line 8, why does it works only in
for j in anagram(word[:i] + word[i+1:]):
why not
for j in anagram(word):
Also, I want to know what
for j in anagram(...)
means and doing...
what is j doing in this for loop?
this is my full code
def anagram(word):
n = len(word)
anagrams = []
if n <= 1:
return word
else:
for i in range(n):
for j in anagram(word[:i] + word[i+1:]):
anagrams.append(word[i:i+1] + j)
return anagrams
if __name__ == "__main__":
print(anagram("abc"))
The reason you can't write for i in anagram(word) is that it creates an infinite loop.
So for example if I write the recursive factorial function,
def fact(n):
if n <= 1:
return 1
return n * fact(n - 1)
This works and is not a circular definition because I am giving the computer two separate equations to compute the factorial:
n! = 1
n! = n (n-1)!
and I am telling it when to use each of these: the first one when n is 0 or 1, the second when n is larger than that. The key to its working is that eventually we stop using the second definition, and we instead use the first definition, which is called the “base case.” If I were to instead say another true definition like that n! = n! the computer would follow those instructions but we would never reduce down to the base case and so we would enter an infinite recursive loop. This loop would probably exhaust a resource called the “stack” rapidly, leading to errors about “excessive recursion” or too many “stack frames” or just “stack overflow” (for which this site is named!). And then if you gave it a mathematically invalid expression like n! = n n! it would infinitely loop and also it would be wrong even if it did not infinitely loop.
Factorials and anagrams are closely related, in fact we can say mathematically that
len(anagrams(f)) == fact(len(f))
so solving one means solving the other. In this case we are saying that the anagram of a word which is empty or of length 1 is just [word], the list containing just that word. (Your algorithm messes this case up a little bit, so it's a bug.)
The anagram of any other word must have something to do with anagrams of words of length len(word) - 1. So what we do is we pull each character out of the word and put it at the front of the anagram. So word[:i] + word[i+1:] is the word except it is missing the letter at index i, and word[i:i+1] is the space between these -- in other words it is the letter at index i.
This is NOT an answer but a guide for you to understand the logic by yourself.
Firstly you should understand one thing anagram(word[:i] + word[i+1:]) is not same as anagram(word)
>>> a = 'abcd'
>>> a[:2] + a[(2+1):]
'abd'
You can clearly see the difference.
And for a clearer understanding I would recommend you to print the result of every word in the recursion. put a print(word) statement before the loop starts.

Anagrams code resulting in infinite results

I need to generate anagrams for an application. I am using the following code for generating anagrams
def anagrams(s):
if len(s) < 2:
return s
else:
tmp = []
for i, letter in enumerate(s):
for j in anagrams(s[:i]+s[i+1:]):
tmp.append(j+letter)
print (j+letter)
return tmp
The code above works in general. However, it prints infinite results when the following string is passed
str = "zzzzzzziizzzz"
print anagrams(str)
Can someone tell me where I am going wrong? I need unique anagrams of a string
This is not an infinity of results, this is 13!(*) words (a bit over 6 billions); you are facing a combinatorial explosion.
(*) 13 factorial.
Others have pointed out that your code produces 13! anagrams, many of them duplicates. Your string of 11 z's and 2 i's has only 78 unique anagrams, however. (That's 13! / (11!·2!) or 13·12 / 2.)
If you want only these strings, make sure that you don't recurse down for the same letter more than once:
def anagrams(s):
if len(s) < 2:
return s
else:
tmp = []
for i, letter in enumerate(s):
if not letter in s[:i]:
for j in anagrams(s[:i] + s[i+1:]):
tmp.append(letter + j )
return tmp
The additional test is probably not the most effective way to tell whether a letter has already been used, but in your case with many duplicate letters it will save a lot of recursions.
There isn't infinte results - just 13! or 6,227,020,800
You're just not waiting long enough for the 6 billion results.
Note that much of the output is duplicates. If you are meaning to not print out the duplicates, then the number of results is much smaller.

Python 2.7 "list index out of range"

I keep getting "IndexError: list index out of range", the code does fine with things like "s = 'miruxsexxzlbveznyaidekl'" but this particular length makes it throw an error. Can anyone help me understand what I did wrong here, not just give me the answer? (I'd like to not have to come back and ask more question haha)
__author__ = 'Krowzer'
s = 'abcdefghijklmnopqrstuvwxyz'
def alpha(x):
current_substring = []
all_substring = []
for l in range(len(x) - 1):
current_substring.append(x[l])
if x[l + 1] < x[l]:
all_substring.append(current_substring)
#print("current: ", current_substring)
current_substring = []
#print(all_substring)
largest = all_substring[0]
for i in range(len(all_substring)):
if len(all_substring[i]) > len(largest):
largest = all_substring[i]
answer = ''.join(largest)
print('Longest substring in alphabetical order is: ' + answer )
alpha(s)
I can try to explain what is going on.
You are trying to find the longest substring in alphabetical order by looking for the end of the substring. Your definition of end is that there is a character less than the last character in the string -- something in descending alphabetical order.
Your example substring has no such string. So, the initial loop never finds an end to it. As a result, all_substring[] is empty and trying to get any element out of it (such as all_substring[0]) generates an error.
You can fix the code yourself. The easiest is probably just to check if it is empty. If so, then the entire original string is the match.
EDIT:
On second thought, there are two errors in the code. One is that the last character is not being considered. The second is that the final substring is not being considered.
def alpha(x):
current_substring = []
all_substring = []
for l in range(len(x)):
current_substring.append(x[l])
if l < len(x) - 1 and x[l + 1] < x[l]:
all_substring.append(current_substring)
#print("current: ", current_substring)
current_substring = []
print(all_substring)
all_substring.append(current_substring)
largest = all_substring[0]
for i in range(len(all_substring)):
if len(all_substring[i]) > len(largest):
largest = all_substring[i]
answer = ''.join(largest)
print('Longest substring in alphabetical order is: ' + answer )

python recursion with bubble sort

So, i have this problem where i recieve 2 strings of letters ACGT, one with only letters, the other contain letters and dashes "-".both are same length. the string with the dashes is compared to the string without it. cell for cell. and for each pairing i have a scoring system. i wrote this code for the scoring system:
for example:
dna1: -ACA
dna2: TACG
the scoring is -1. (because dash compared to a letter(T) gives -2, letter compared to same letter gives +1 (A to A), +1 (C to C) and non similar letters give (-1) so sum is -1.
def get_score(dna1, dna2, match=1, mismatch=-1, gap=-2):
""""""
score = 0
for index in range(len(dna1)):
if dna1[index] is dna2[index]:
score += match
elif dna1[index] is not dna2[index]:
if "-" not in (dna1[index], dna2[index]):
score += mismatch
else:
score += gap
this is working fine.
now i have to use recursion to give the best possible score for 2 strings.
i recieve 2 strings, they can be of different sizes this time. ( i cant change the order of letters).
so i wrote this code that adds "-" as many times needed to the shorter string to create 2 strings of same length and put them in the start of list. now i want to start moving the dashes and record the score for every dash position, and finally get the highest posibble score. so for moving the dashes around i wrote a litle bubble sort.. but it dosnt seem to do what i want. i realize its a long quesiton but i'd love some help. let me know if anything i wrote is not understood.
def best_score(dna1, dna2, match=1, mismatch=-1, gap=-2,\
score=[], count=0):
""""""
diff = abs(len(dna1) - len(dna2))
if len(dna1) is len(dna2):
short = []
elif len(dna1) < len(dna2):
short = [base for base in iter(dna1)]
else:
short = [base for base in iter(dna2)]
for i in range(diff):
short.insert(count, "-")
for i in range(diff+count, len(short)-1):
if len(dna1) < len(dna2):
score.append((get_score(short, dna2),\
''.join(short), dna2))
else:
score.append((get_score(dna1, short),\
dna1, ''.join(short)))
short[i+1], short[i] = short[i], short[i+1]
if count is min(len(dna1), len(dna2)):
return score[score.index(max(score))]
return best_score(dna1, dna2, 1, -1, -2, score, count+1)
First, if I correctly deciephered your cost function, your best score value do not depend on gap, as number of dashes is fixed.
Second, it is lineary dependent on number of mismatches and so doesn't depend on match and mismatch exact values, as long as they are positive and negative respectively.
So your task reduces to lookup of a longest subsequence of longest string letters strictly matching subsequence of letters of the shortest one.
Third, define by M(string, substr) function returnin length of best match from above. If you smallest string fisrt letter is S, that is substr == 'S<letters>', then
M(string, 'S<letters>') = \
max(1 + M(string[string.index(S):], '<letters>') + # found S
M(string[1:], '<letters>')) # letter S not found, placed at 1st place
latter is an easy to implement recursive expression.
For a pair string, substr denoting m=M(string, substr) best score is equal
m * match + (len(substr) - m) * mismatch + (len(string)-len(substr)) * gap
It is straightforward, storing what value was max in recursive expression, to find what exactly best match is.

Is this an acceptable algorithm?

I've designed an algorithm to find the longest common subsequence. these are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequence and store its position in index, Otherwise
compare the length of common_subsequence with the length of lcs
and if its greater, asign its value to lcs.
Return to the first string and pick the next letter and repeat the
previous step again, But this time start searching from indexth letter
Repeat this process until there is no letter in the first string to
pick. At the end the value of lcs is the Longest Common
Subsequence.
This is an example:
‫‪
X=A, B, C, B, D, A, B‬‬
‫‪Y=B, D, C, A, B, A‬‬
Pick A in the first string.
Look for A in Y.
Now that there is an A in the second string, append it to common_subsequence.
Return to the first string and pick the next letter that is B.
Look for B in the second string this time starting from the position of A.
There is a B after A so append B to common_subsequence.
Now pick the next letter in the first string that is C. There isn't a C next to B in the second string. So assign the value of common_subsequence to lcs because its length is greater than the length of lcs.
repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.
The complexity of this algorithm is theta(n*m).
Here is my implementations:
First algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:
You turned in that program.
You were able to explain why there's no possible O(N) or O(N log M) alternative.
You were able to participate in a reasonable discussion about other algorithms that might have a better lower bound (or significantly lower constants, etc.), and the time/space tradeoffs, etc., even if you didn't know the outcome of that discussion in advance.
On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)
Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.
Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.
maxLength = 0
foundString = ""
for start in xrange(len(str1)-1):
for end in xrange(start+1, len(str1)):
str1Temp = str1[start:end]
maxLengthTemp = len(str1Temp)
if(str2.find(str1Temp)):
if(maxLengthTemp>maxLength):
maxLength = maxLengthTemp
foundString = str1Temp
print maxLength
print foundString

Categories