Find multiple longest common leading substrings with length >= 4 - python

In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called "data" below, the 2 longest common substrings that fit my criteria are "johnjack" and "detc". I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']
def ls(data):
if len(data)==0:
prefix = ''
else:
prefix = data[0]
for i in data:
while not i.startswith(prefix) and len(prefix) > 0:
prefix = prefix[:-1]
print(prefix)
ls(data)

Here's one, but I think it's probably not the fastest or most efficient. Let's start with just the data and a container for our answer:
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []
Note I added a dupe for chunganh -- that's a common edge case we should be handling.
See How do I find the duplicates in a list and create another list with them?
So to capture the duplicates in the data
seen = {}
dupes = []
for x in data:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
for dupe in dupes:
substrings.append(dupe)
Now let's record the unique values in the data as-is
# Capture the unique values in the data
last = set(data)
From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we've found a unique substring.
# Handle strings up to 10000 characters long
for k in [0-b for b in range(1, 10000)]:
# Use negative indexing to start from the longest
last, middle = set([i[:k] for i in data]), last
# Unique substring found
if len(last) != len(middle):
for k in last:
count = 0
for word in middle:
if k in word:
count += 1
if count > 1:
substrings.append(k)
# Early stopping
if len(last) == 1:
break
Finally, you mentioned needing only substrings of length 4.
list(filter(lambda x: len(x) >= 4, substrings))

Related

how to create all possible orders in a specific length from a list of strings

I have a list of strings that need to fit in 6 characters. The strings can be split but the characters in the string can't be randomized. The strings have different lengths (4 and 3 characters)
I tried a few things with itertools and know how to get all possibilities but not how to get only the possibilities with a specific length requirement.
It's ok to omit the first zero from the list entries.
An example of a list:
wordlist = ["0254", "0294", "0284", "0289", "027", "024", "026", "088"]
It would be ok to get combinations like 025427, 254027, 270254, 027254 (0 and 4 of the list) and the obvious 027088, 088027 (4 and 7 of the list) and even 272488 (4, 5 and 7 of the list)
I think the solution lies in itertools in combination with something else.
Just use a standard double loop and check if you need/can remove zeros to get the desired length
combinations = []
for i in wordlist:
for j in wordlist:
if len(i) == 4 and len(j) == 4: # remove both zeros
combinations.append(i[1:] + j[1:])
elif len(i) == 3 and len(j) == 3:
combinations.append(i + j) # dont remove any zero
else:
combinations.append(i[1:] + j) # remove first element zero
combinations.append(i + j[1:]) # remove second element zero
This will give you all possible combinations (including matching elements with themselves)

Compare adjacent characters in string for differing case

I am working through a coding challenge in python, the rules is to take a string and any two adjacent letters of the same character but differing case should be deleted. The process repeated until there are no matching letters of differing case side by side. Finally the length of the string should be printed. I have made a solution below that iterates left to right. Although I have been told there are better more efficient ways.
list_of_elves=list(line)
n2=len(list_of_elves)
i=0
while i < len(list_of_elves):
if list_of_elves[i-1].lower()==list_of_elves[i].lower() and list_of_elves[i-1] != list_of_elves[i]:
del list_of_elves[i]
del list_of_elves[i-1]
if i<2:
i-=1
else:
i-=2
if len(list_of_elves)<2:
break
else:
i+=1
if len(list_of_elves)<2:
break
print(len(list_of_elves))
I have made some pseudo code as well
PROBLEM STATEMENT
Take a given string of alpabetical characters
Build a process to count the initial string length and store to variable
Build a process to iterate through the list and identify the following rule:
Two adjacent matching letters && Of differing case
Delete the pair
Repeat process
Count final length of string
For example, if we had a string with 'aAa' then 'aA' would be deleted, leaving 'a' behind.
In Python, if you want to do it with a regex, use
re.sub(r"([a-zA-Z])(?=(?!\1)(?i:\1))", "", s) # For ASCII only letters
re.sub(r"([^\W\d_])(?=(?!\1)(?i:\1))", "", s) # For any Unicode letters
See the Python demo
Details
([^\W\d_]) - Capturing group 1: any Unicode letter (or any ASCII letter if ([^\W\d_]) is used)
(?=(?!\1)(?i:\1)) - a positive lookahead that requires the same char as matched in the first capturing group (case insensitive) (see (?i:\1)) that is not the same char as matched in Group 1 (see (?!\1))
This is a very similar problem to matching parenthesis, but instead of a match being opposite pairs, the match is upper/lower case. You can use a similar technique of maintaining a stack. Then iterate through and compare the current letter with the top of the stack. If they match pop the element off the stack; if they don't append the letter to the stack. In the end, the length of the stack will be your answer:
line = "cABbaC"
stack = []
match = lambda m, n: m != n and m.upper() == n.upper()
for c in line:
if len(stack) == 0 or not match(c, stack[-1]):
stack.append(c)
else:
stack.pop()
stack
# stack is empty because `Bb` `Aa` and `Cc` get deleted.
Similarly line = "cGBbgaCF" would result in a stack of ['c', 'a', 'C', 'F'] because Bb, then Gg are deleted.
A method that should be very fast:
result = 1
pairs = zip(string, string[1:])
for a, b in pairs:
if a.lower() == b.lower() and a != b:
next(pairs)
else:
result += 1
print(result)
First we create a zip of the input with the input sliced by 1 position, this gives us an iterable that returns all the pairs in the string in order
Then for every pair that doesn't match we increment the result, for every pair that does match we just advance the iterator by one so that we skip the matching pair.
Result is then the length of what would be the result, we don't actually need to store the result as we can just calculate it as we go along since it's the only thing that needs to be returned
Really only need a single assertion in the regex to match the pair and
delete it.
re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
Code sample :
>>> import re
>>> strs = ["aAa","aaa","aAaAA"]
>>> for target in strs:
... modtarg = re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
... print( target, "\t--> (", len(modtarg), ") ", modtarg )
...
aAa --> ( 1 ) a
aaa --> ( 3 ) aaa
aAaAA --> ( 1 ) A
Info :
(?-i: # Disable Case insensitive if on
( [a-zA-Z] ) # (1), upper or lower case
(?! \1 ) # Not the same cased letter
(?i: \1 ) # Enable Case insensitive, must be the opposite cased letter
)

Time limit exceeded error. Word Ladder leetcode

I am trying to solve leetcode problem(https://leetcode.com/problems/word-ladder/description/):
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Note:
Return 0 if there is no such transformation sequence.
All words have the same length.
All words contain only lowercase alphabetic characters.
You may assume no duplicates in the word list.
You may assume beginWord and endWord are non-empty and are not the same.
Input:
beginWord = "hit",
endWord = "cog",
wordList = ["hot","dot","dog","lot","log","cog"]
Output:
5
Explanation:
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" ->
"cog", return its length 5.
import queue
class Solution:
def isadjacent(self,a, b):
count = 0
n = len(a)
for i in range(n):
if a[i] != b[i]:
count += 1
if count > 1:
return False
if count == 1:
return True
def ladderLength(self,beginWord, endWord, wordList):
word_queue = queue.Queue(maxsize=0)
word_queue.put((beginWord,1))
while word_queue.qsize() > 0:
queue_last = word_queue.get()
index = 0
while index != len(wordList):
if self.isadjacent(queue_last[0],wordList[index]):
new_len = queue_last[1]+1
if wordList[index] == endWord:
return new_len
word_queue.put((wordList[index],new_len))
wordList.pop(index)
index-=1
index+=1
return 0
Can someone suggest how to optimise it and prevent the error!
The basic idea is to find the adjacent words faster. Instead of considering every word in the list (even one that has already been filtered by word length), construct each possible neighbor string and check whether it is in the dictionary. To make those lookups fast, make sure the word list is stored in something like a set that supports fast membership tests.
To go even faster, you could store two sorted word lists, one sorted by the reverse of each word. Then look for possibilities involving changing a letter in the first half in the reversed list and for the latter half in the normal list. All the existing neighbors can then be found without making any non-word strings. This can even be extended to n lists, each sorted by omitting one letter from all the words.

Backward search implementation python

I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.
You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]

CodeEval Hard Challenge 6 - LONGEST COMMON SUBSEQUENCE - python

I am trying to solve the Longest Common Subsequence in Python. I've completed it and it's working fine although I've submitted it and it says it's 50% partially completed. I'm not sure what I'm missing here, any help is appreciated.
CHALLENGE DESCRIPTION:
You are given two sequences. Write a program to determine the longest common subsequence between the two strings (each string can have a maximum length of 50 characters). NOTE: This subsequence need not be contiguous. The input file may contain empty lines, these need to be ignored.
INPUT SAMPLE:
The first argument will be a path to a filename that contains two strings per line, semicolon delimited. You can assume that there is only one unique subsequence per test case. E.g.:
XMJYAUZ;MZJAWXU
OUTPUT SAMPLE:
The longest common subsequence. Ensure that there are no trailing empty spaces on each line you print. E.g.:
MJAU
My code is
# LONGEST COMMON SUBSEQUENCE
import argparse
def get_longest_common_subsequence(strings):
# here we will store the subsequence list
subsequences_list = list()
# split the strings in 2 different variables and limit them to 50 characters
first = strings[0]
second = strings[1]
startpos = 0
# we need to start from each index in the first string so we can find the longest subsequence
# therefore we do a loop with the length of the first string, incrementing the start every time
for start in range(len(first)):
# here we will store the current subsequence
subsequence = ''
# store the index of the found character
idx = -1
# loop through all the characters in the first string, starting at the 'start' position
for i in first[start:50]:
# search for the current character in the second string
pos = second[0:50].find(i)
# if the character was found and is in the correct sequence add it to the subsequence and update the index
if pos > idx:
subsequence += i
idx = pos
# if we have a subsequence, add it to the subsequences list
if len(subsequence) > 0:
subsequences_list.append(subsequence)
# increment the start
startpos += 1
# sort the list of subsequences with the longest at the top
subsequences_list.sort(key=len, reverse=True)
# return the longest subsequence
return subsequences_list[0]
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()
# read file as the first argument
with open(args.filename) as f:
# loop through each line
for line in f:
# if the line is empty it means it's not valid. otherwise print the common subsequence
if line.strip() not in ['\n', '\r\n', '']:
strings = line.replace('\n', '').split(';')
if len(strings[0]) > 50 or len(strings[1]) > 50:
break
print get_longest_common_subsequence(strings)
return 0
if __name__ == '__main__':
main()
The following solution prints unordered/unsorted longest common subsequences/substrings from semi-colon-separated string pairs. If a string from the pair is longer than 50 characters, then the pair is skipped (its not difficult to trim it to length 50 if that is desired).
Note: if sorting/ordering is desired it can be implemented (either alphabetic order, or sort by the order of the first string or sort by the order of the second string.
with open('filename.txt') as f:
for line in f:
line = line.strip()
if line and ';' in line and len(line) <= 101:
a, b = line.split(';')
a = set(a.strip())
b = set(b.strip())
common = a & b # intersection
if common:
print ''.join(common)
Also note: If the substrings have internal common whitespace (ie ABC DE; ZM YCA) then it will be part of the output because it will not be stripped. If that is not desired then you can replace the line a = set(a.strip()) with a = {char for char in a if char.strip()} and likewise for b.
def lcs_recursive(xlist,ylist):
if not xlist or not ylist:
return []
x,xs,y,ys, = xlist[0],xlist[1:],ylist[0],ylist[1:]
if x == y:
return [x] + lcs_recursive(xs,ys)
else:
return max(lcs_recursive(xlist,ys),lcs_recursive(xs,ylist),key=len)
s1 = 'XMJYAUZ'
s2 = 'MZJAWXU'
print (lcs_recursive(s1,s2))
This will give the correct answer MJAU and X & Z are not part of the answer because they are sequential (Note:- Subsequent)

Categories