I am trying to solve the Longest Common Subsequence in Python. I've completed it and it's working fine although I've submitted it and it says it's 50% partially completed. I'm not sure what I'm missing here, any help is appreciated.
CHALLENGE DESCRIPTION:
You are given two sequences. Write a program to determine the longest common subsequence between the two strings (each string can have a maximum length of 50 characters). NOTE: This subsequence need not be contiguous. The input file may contain empty lines, these need to be ignored.
INPUT SAMPLE:
The first argument will be a path to a filename that contains two strings per line, semicolon delimited. You can assume that there is only one unique subsequence per test case. E.g.:
XMJYAUZ;MZJAWXU
OUTPUT SAMPLE:
The longest common subsequence. Ensure that there are no trailing empty spaces on each line you print. E.g.:
MJAU
My code is
# LONGEST COMMON SUBSEQUENCE
import argparse
def get_longest_common_subsequence(strings):
# here we will store the subsequence list
subsequences_list = list()
# split the strings in 2 different variables and limit them to 50 characters
first = strings[0]
second = strings[1]
startpos = 0
# we need to start from each index in the first string so we can find the longest subsequence
# therefore we do a loop with the length of the first string, incrementing the start every time
for start in range(len(first)):
# here we will store the current subsequence
subsequence = ''
# store the index of the found character
idx = -1
# loop through all the characters in the first string, starting at the 'start' position
for i in first[start:50]:
# search for the current character in the second string
pos = second[0:50].find(i)
# if the character was found and is in the correct sequence add it to the subsequence and update the index
if pos > idx:
subsequence += i
idx = pos
# if we have a subsequence, add it to the subsequences list
if len(subsequence) > 0:
subsequences_list.append(subsequence)
# increment the start
startpos += 1
# sort the list of subsequences with the longest at the top
subsequences_list.sort(key=len, reverse=True)
# return the longest subsequence
return subsequences_list[0]
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()
# read file as the first argument
with open(args.filename) as f:
# loop through each line
for line in f:
# if the line is empty it means it's not valid. otherwise print the common subsequence
if line.strip() not in ['\n', '\r\n', '']:
strings = line.replace('\n', '').split(';')
if len(strings[0]) > 50 or len(strings[1]) > 50:
break
print get_longest_common_subsequence(strings)
return 0
if __name__ == '__main__':
main()
The following solution prints unordered/unsorted longest common subsequences/substrings from semi-colon-separated string pairs. If a string from the pair is longer than 50 characters, then the pair is skipped (its not difficult to trim it to length 50 if that is desired).
Note: if sorting/ordering is desired it can be implemented (either alphabetic order, or sort by the order of the first string or sort by the order of the second string.
with open('filename.txt') as f:
for line in f:
line = line.strip()
if line and ';' in line and len(line) <= 101:
a, b = line.split(';')
a = set(a.strip())
b = set(b.strip())
common = a & b # intersection
if common:
print ''.join(common)
Also note: If the substrings have internal common whitespace (ie ABC DE; ZM YCA) then it will be part of the output because it will not be stripped. If that is not desired then you can replace the line a = set(a.strip()) with a = {char for char in a if char.strip()} and likewise for b.
def lcs_recursive(xlist,ylist):
if not xlist or not ylist:
return []
x,xs,y,ys, = xlist[0],xlist[1:],ylist[0],ylist[1:]
if x == y:
return [x] + lcs_recursive(xs,ys)
else:
return max(lcs_recursive(xlist,ys),lcs_recursive(xs,ylist),key=len)
s1 = 'XMJYAUZ'
s2 = 'MZJAWXU'
print (lcs_recursive(s1,s2))
This will give the correct answer MJAU and X & Z are not part of the answer because they are sequential (Note:- Subsequent)
Related
I have a CSV file with the following data:
bel.lez.za;bellézza
e.la.bo.ra.re;elaboràre
a.li.an.te;alïante
u.mi.do;ùmido
the first value is the word divided in syllables and the second is for the stress.
I'd like to merge the the two info and obtain the following output:
bel.léz.za
e.la.bo.rà.re
a.lï.an.te
ù.mi.do
I computed the position of the stressed vowel and tried to substitute the same unstressed vowel in the first value, but full stops make indexing difficult. Is there a way to tell python to ignore full stops while counting? or is there an easier way to perform it? Thx
After splitting the two values for each line I computed the position of the stressed vowels:
char_list=['ò','à','ù','ì','è','é','ï']
for character in char_list:
if character in value[1]:
position_of_stressed_vowel=value[1].index(character)
I'd suggest merging/aligning the two forms in parallel instead of trying to substitute things via indexing. The idea is to iterate through the plain form and take out one character from the accented form for every character from the plain form, keeping dots as they are.
(Or perhaps, the idea is to add the dots to the accented form instead of adding the accented characters to the syllabified form.)
def merge_accents(plain, accented):
output = ""
acc_chars = iter(accented)
for char in plain:
if char == ".":
output += char
else:
output += next(acc_chars)
return output
Test:
data = [['bel.lez.za', 'bellézza'],
['e.la.bo.ra.re', 'elaboràre'],
['a.li.an.te', 'alïante'],
['u.mi.do', 'ùmido']]
# Returns
# bel.léz.za
# e.la.bo.rà.re
# a.lï.an.te
# ù.mi.do
for plain, accented in data:
print(merge_accents(plain, accented))
Is there a way to tell python to ignore full stops while counting?
Yes, by implementing it yourself using an index lookup that tells you which index in the space-delimited string an index in the word is equivalent to:
i = 0
corrected_index = []
for char in value[0]:
if char != ".":
corrected_index.append(i)
i+=1
now, you can correct the index and replace the character:
value[0][corrected_index[position_of_stressed_vowel]] = character
Make sure to use UTF-16 as encoding for your "stressed vowel" characters to have a single index.
You can loop over the two halfs of the string, keep track of the index in the first half, excluding the dots and add the character at the tracked index from the second half of the string to a buffer (modified) string. Like the code below:
data = ['bel.lez.za;bellézza',
'e.la.bo.ra.re;elaboràre',
'a.li.an.te;alïante',
'u.mi.do;ùmido']
converted_data = []
# Loop over the data.
for pair in data:
# Split the on ";"
first_half, second_half = pair.split(';')
# Create variables to keep track of the current letter and the modified string.
current_letter = 0
modified_second_half = ''
# Loop over the letter of the first half of the string.
for current_char in first_half:
# If the current_char is a dot add it to the modified string.
if current_char == '.':
modified_second_half += '.'
# If the current_char is not a dot add the current letter from the second half to the modified string,
# and update the current letter value.
else:
modified_second_half += second_half[current_letter]
current_letter += 1
converted_data.append(modified_second_half)
print(converted_data)
data = ['bel.lez.za;bellézza',
'e.la.bo.ra.re;elaboràre',
'a.li.an.te;alïante',
'u.mi.do;ùmido']
def slice_same(input, lens):
# slices the given string into the given lengths.
res = []
strt = 0
for size in lens:
res.append(input[strt : strt + size])
strt += size
return res
# split into two.
data = [x.split(';') for x in data]
# Add third column that's the length of each piece.
data = [[x, y, [len(z) for z in x.split('.')]] for x, y in data]
# Put text and lens through function.
data = ['.'.join(slice_same(y, z)) for x, y, z in data]
print(data)
Output:
['bel.léz.za',
'e.la.bo.rà.re',
'a.lï.an.te',
'ù.mi.do']
So I want to replicate a word n times in my function but I want to eliminate the consecutive characters.
For example repete (amanha, 2) = "amanhamanha"
My function:
def repete(palavra,n):
a = []
b=""
for n in range (0,n):
a.append(palavra)
b = b.join(a)
return b
The first step is to determine the longest overlap between the start and end of the word. The next() function can be used to get the number of characters to skip by getting the first match starting from the longest substring down to the shortest and defaulting to zero if there is no overlap. Then the repetition can be performed on the remaining part of the word (i.e. skipping the length of the common part)
def repeat(w,n):
skip = next((i for i in range(len(w)-1,0,-1) if w[:i]==w[-i:]),0)
return w + (n-1)*w[skip:]
print(repeat("amanha",2)) # amanhamanha
print(repeat("abc",2)) # abcabc
print(repeat("abcdab",2)) # abcdabcdab
You could also use the max() function to get the length to skip (not as efficient as next() but shorter to write):
def repeat(w,n):
skip = max(range(len(w)),key=lambda i:i*(w[:i]==w[-i:]))
return w + (n-1)*w[skip:]
In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called "data" below, the 2 longest common substrings that fit my criteria are "johnjack" and "detc". I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']
def ls(data):
if len(data)==0:
prefix = ''
else:
prefix = data[0]
for i in data:
while not i.startswith(prefix) and len(prefix) > 0:
prefix = prefix[:-1]
print(prefix)
ls(data)
Here's one, but I think it's probably not the fastest or most efficient. Let's start with just the data and a container for our answer:
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []
Note I added a dupe for chunganh -- that's a common edge case we should be handling.
See How do I find the duplicates in a list and create another list with them?
So to capture the duplicates in the data
seen = {}
dupes = []
for x in data:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
for dupe in dupes:
substrings.append(dupe)
Now let's record the unique values in the data as-is
# Capture the unique values in the data
last = set(data)
From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we've found a unique substring.
# Handle strings up to 10000 characters long
for k in [0-b for b in range(1, 10000)]:
# Use negative indexing to start from the longest
last, middle = set([i[:k] for i in data]), last
# Unique substring found
if len(last) != len(middle):
for k in last:
count = 0
for word in middle:
if k in word:
count += 1
if count > 1:
substrings.append(k)
# Early stopping
if len(last) == 1:
break
Finally, you mentioned needing only substrings of length 4.
list(filter(lambda x: len(x) >= 4, substrings))
I need to find a given pattern in a text file and print the matching patterns. The text file is a string of digits and the pattern can be any string of digits or placeholders represented by 'X'.
I figured the way to approach this problem would be by loading the sequence into a variable, then creating a list of testable subsequences, and then testing each subsequence. This is my first function in python so I'm confused as to how to create the list of test sequences easily and then test it.
def find(pattern): #finds a pattern in the given input file
with open('sequence.txt', 'r') as myfile:
string = myfile.read()
print('Test data is:', string)
testableStrings = []
#how to create a list of testable sequences?
for x in testableStrings:
if x == pattern:
print(x)
return
For example, searching for "X10X" in "11012102" should print "1101" and "2102".
Let pattern = "X10X", string = "11012102", n = len(pattern) - just for followed illustration:
Without using regular expressions, your algorithm may be as follows:
Construct a list of all subsequences of string with length of n:
In[2]: parts = [string[i:i+n] for i in range(len(string) - n + 1)]
In[3]: parts
Out[3]: ['1101', '1012', '0121', '1210', '2102']
Compare pattern with each element in parts:
for part in parts:
The comparison of pattern with part (both have now equal lengths) will be symbol with symbol in corresponding positions:
for ch1, ch2 in zip(pattern, part):
If ch1 is the X symbol or ch1 == ch2, the comparison of corresponding symbols will continue, else we will break it:
if ch1 == "X" or ch1 == ch2:
continue
else:
break
Finally, if all symbol with symbol comparisons were successful, i. e. all pairs of corresponding symbols were exhausted, the else branch of the for statement will be executed (yes, for statements may have an else branch for that case).
Now you may perform any actions with that matched part, e. g. print it or append it to some list:
else:
print(part)
So all in one place:
pattern = "X10X"
string = "11012102"
n = len(pattern)
parts = [string[i:i+n] for i in range(len(string) - n + 1)]
for part in parts:
for ch1, ch2 in zip(pattern, part):
if ch1 == "X" or ch1 == ch2:
continue
else:
break
else:
print(part)
The output:
1101
2102
You probably wanted to create the list of testable sequences from the individual rows of the input file. So instead of
with open('sequence.txt', 'r') as myfile:
string = myfile.read()
use
with open('sequence.txt') as myfile: # 'r' is default
testableStrings = [row.strip() for row in myfile]
The strip() method removes whitespace characters from the start and end of rows, including \n symbols at the end of lines.
Example of the sequence.txt file:
123456789
87654321
111122223333
The output of the print(testableStrings) command:
['123456789', '87654321', '111122223333']
Using loops, how can I write a function in python, to sort the longest chain of proteins, regardless of order. The function returns a substring that consists only of the character 'A','C','G', and 'T' when ties are mixed up with other elements: Example, in the sequence: 'ACCGXXCXXGTTACTGGGCXTTGT', it returns 'GTTACTGGGC'
If the data is provided as a string you could simply split it by the character 'X' and thereby get a list.
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
Then looping over the list while checking for the length of the element would give you the right result:
# Initialize placeholders for comparison
temp_max_string = ''
temp_max_length = 0
#Loop over each string in the list
for i in array:
# Check if the current substring is longer than the longest found so far
if len(i) > temp_max_length:
# Replace the placeholders if it is longer
temp_max_length = len(i)
temp_max_string = i
print(temp_max_string) # or 'print temp_max_string' if you are using python2.
You could also use the python built-ins to get your result in a more efficient manner:
Sorting by descending length (list.sort())
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
array.sort(key=len, reverse=True)
print(array[0]) #print the longest since we sorted for descending lengths
print(len(array[0])) # Would give you the length of the longest substring
Only get the longest substring (max()):
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
longest = max(array, key=len)
print(longest) # gives the longest substring
print(len(longest)) # gives you the length of the longest substring