Common substring in list of strings

Common substring in list of strings - python

i encountered a problem while trying to solve a problem where given some strings and their lengths, you need to find their common substring. My code for the part where it loops through the list and then each through each word in it is this:
num_of_cases = int(input())
for i in range(1, num_of_cases+1):
if __name__ == '__main__':
len_of_str = list(map(int, input().split()))
len_of_virus = int(input())
strings = []
def string(strings, len_of_str):
len_of_list = len(len_of_str)
for i in range(1, len_of_list+1):
strings.append(input())
lst_of_subs = []
virus_index = []
def substr(strings, len_of_virus):
for word in strings:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
lst_of_subs.append(leng)
virus_index.append(i)
print(string(strings, len_of_str))
print(substr(strings, len_of_virus))
And it prints the following given the strings: ananasso, associazione, tassonomia, massone
['anan', 'nan', 'an', 'n', 'asso', 'sso', 'so', 'o', 'tass', 'ass', 'ss', 's', 'mass', 'ass', 'ss', 's']
It seems that the end index doesn't increase, although i tried it by writing len_of_virus += 1 at the end of the loop.
sample input:
1
8 12 10 7
4
ananasso
associazione
tassonomia
massone
where the 1st letter is the number of cases, the second line is the name of the strings, 3rd is the length of the virus(the common substring), and then there are the given strings that i should loop through.
expected output:
Case #1: 4 0 1 1
where the four numbers are the starting indexes of the common substring.(i dont think that code for printing cares us for this particular problem)
What should i do? Please help!!

The problem, beside defining functions in odd places and using said function to get side effect in ways that aren't really encourage, is here:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
i constantly increase in each iteration, but len_of_virus stay the same, so you are effectively doing
word[0:4] #when len_of_virus=4
word[1:4]
word[2:4]
word[3:4]
...
that is where the 'anan', 'nan', 'an', 'n', come from the first word "ananasso", and the same for the other
>>> word="ananasso"
>>> len_of_virus = 4
>>> for i in range(len(word)):
word[i:len_of_virus]
'anan'
'nan'
'an'
'n'
''
''
''
''
>>>
you can fix it moving the upper end by i, but that leave with the same problem in the other end
>>> for i in range(len(word)):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
'sso'
'so'
'o'
>>>
so some simple adjustments in the range and problem solve:
>>> for i in range(len(word)-len_of_virus+1):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
>>>
Now that the substring part is done, the rest is also easy
>>> def substring(text,size):
return [text[i:i+size] for i in range(len(text)-size+1)]
>>> def find_common(lst_text,size):
subs = [set(substring(x,size)) for x in lst_text]
return set.intersection(*subs)
>>> test="""ananasso
associazione
tassonomia
massone""".split()
>>> find_common(test,4)
{'asso'}
>>>
To find the common part to all the strings in our list we can use a set, first we put all the substring of a given word into a set and finally we intersect them all.
the rest is just printing it to your liking
>>> virus = find_common(test,4).pop()
>>> print("case 1:",*[x.index(virus) for x in test])
case 1: 4 0 1 1
>>>

First extract all the substrings of the give size from the shortest string. Then select the first of these substrings that is present in all of the strings. Finally output the position of this common substring in each of the strings:
def commonSubs(strings,size):
base = min(strings,key=len) # shortest string
subs = [base[i:i+size] for i in range(len(base)-size+1)] # all substrings
cs = next(ss for ss in subs if all(ss in s for s in strings)) # first common
return [s.index(cs) for s in strings] # indexes of common substring
output:
S = ["ananasso", "associazione", "tassonomia", "massone"]
print(commonSubs(S,4))
[4, 0, 1, 1]
You could also use a recursive approach:
def commonSubs(strings,size,i=0):
sub = strings[0][i:i+size]
if all(sub in s for s in strings):
return [s.index(sub) for s in strings]
return commonSubs(strings,size,i+1)

from suffix_trees import STree
STree.STree(["come have some apple pies",
'apple pie available',
'i love apple pie haha']).lcs()
the most simple way is use STree

Related

How can I check a string for two letters or more?

I am pulling data from a table that changes often using Python - and the method I am using is not ideal. What I would like to have is a method to pull all strings that contain only one letter and leave out anything that is 2 or more.
An example of data I might get:
115
19A6
HYS8
568
In this example, I would like to pull 115, 19A6, and 568.
Currently I am using the isdigit() method to determine if it is a digit and this filters out all numbers with one letter, which works for some purposes, but is less than ideal.

Try this:
string_list = ["115", "19A6", "HYS8", "568"]
output_list = []
for item in string_list: # goes through the string list
letter_counter = 0
for letter in item: # goes through the letters of one string
if not letter.isdigit(): # checks if the letter is a digt
letter_counter += 1
if letter_counter < 2: # if the string has more then 1 letter it wont be in output list
output_list.append(item)
print(output_list)
Output:
['115', '19A6', '568']

Here is a one-liner with a regular expression:
import re
data = ["115", "19A6", "HYS8", "568"]
out = [string for string in data if len(re.sub("\d", "", string))<2]
print(out)
Output:
['115', '19A6', '568']

This is an excellent case for regular expressions (regex), which is available as the built-in re library.
The code below follows the logic:
Define the dataset. Two examples have been added to show that a string containing two alpha-characters is rejected.
Compile a character pattern to be matched. In this case, zero or more digits, followed by zero or one upper case letter, ending with zero of more digits.
Use the filter function to detect matches in the data list and output as a list.
For example:
import re
data = ['115', '19A6', 'HYS8', '568', 'H', 'HI']
rexp = re.compile('^\d*[A-Z]{0,1}\d*$')
result = list(filter(rexp.match, data))
print(result)
Output:
['115', '19A6', '568', 'H']

Another solution, without re using str.maketrans/str.translate:
lst = ["115", "19A6", "HYS8", "568"]
d = str.maketrans(dict.fromkeys(map(str, range(10)), ""))
out = [i for i in lst if len(i.translate(d)) < 2]
print(out)
Prints:
['115', '19A6', '568']

z=False
a = str(a)
for I in range(len(a)):
if a[I].isdigit():
z = True
break
else:
z="no digit"
print(z)```

Compare list with an exact or scrambled match of a long text

I have the following list I want to iterate over it and find if there's a scramble match with the long string aapxjdnrbtvldptfzbbdbbzxtndrvjblnzjfpvhdhhpxjdnrbt and return the number of matches. The below example should return 4
A scramble string basically starts and ends with the same letter, the rest letters are rearranged.
long_string = 'aapxjdnrbtvldptfzbbdbbzxtndrvjblnzjfpvhdhhpxjdnrbt'
my_list = [
'axpaj', # this is scrambled version of aapxj
'apxaj', # this is scrambled version of aapxj
'dnrbt', # this is exact match of dnrbt
'pjxdn', # this is scrambled version of pxjdn
'abd',
]
matches = 0
for l in my_list:
# check for exact match
if l in long_string:
matches += 1
# check for a scramble match
# ...
# matches = 1. Wrong should be 4.
def is_anagram(str1, str2):
str1_list = list(str1)
str1_list.sort()
str2_list = list(str2)
str2_list.sort()
return (str1_list == str2_list)
is_anagram('axpaj' , 'aapxjdnrbtvldptfzbbdbbzxtndrvjblnzjfpvhdhhpxjdnrbt')
['a', 'a', 'j', 'p', 'x']
['a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'd', ...]

This creates sorted match strings for each different word length required. It builds them on the fly to avoid excess processing.
(Edit: Oops, the previous version assumed one long string in doing the caching. Thanks for the catch, #BeRT2me!)
long_string = 'aapxjdnrbtvldptfzbbdbbzxtndrvjblnzjfpvhdhhpxjdnrbt'
my_list = [
'axpaj', # this is scrambled version of aapxj
'apxaj', # this is scrambled version of aapxj
'dnrbt', # this is exact match of dnrbt
'pjxdn', # this is scrambled version of pxjdn
'abd',
]
anagrams = {} # anagrams contains sorted slices for each word length
def is_anagram(str1,str2):
lettercount = len(str1)
cachekey = (str2,lettercount)
if cachekey not in anagrams:
# build the list for that letter length
anagrams[cachekey] = [sorted(str2[x:x+lettercount]) for x in range(len(str2)-lettercount+1)]
return (sorted(str1) in anagrams[cachekey])
matches = 0
for l in my_list:
if is_anagram(l,long_string):
matches += 1
print (f"There are {matches} in my list.")

I think step 1 of finding a solution is to write code which cuts off the first and last letter of a string.
someFunction("abcde") should return "bcd"
Next you'd need some way to check if the letters in the string are all the same. I would do this by alphabetising and comparing corresponding elements.
alphabetize("khfj") gives "fhjk"
isSame("abc","abc") gives True
You'd then need a way of cutting the string into every substring of a specified length, e.g.:
thisFunction("abcdef", 2) gives ["ab", "bc", "cd", "de", "ef"]
Once you have every possible substring of the length you can check for scramble matches by checking each item in the list against every substring in long_string with the same length

Here's a basic approach – iterate over all substrings of long_string with the matching length and check if they are equal to the search item after sorting.
matches = 0
for x in my_list:
# simple case 1: exact match
if x in long_string:
matches += 1
continue
# sort the string - returns a list
x = sorted(x)
# simple case 2: exact match after sorting
if ''.join(x) in long_string:
matches += 1
continue
window_size = len(x)
for i in range(len(long_string)-window_size):
# sort the substring
y = sorted(long_string[i:i+window_size])
# compare - note that both x and y are lists,
# but it doesn't make any difference for this purpose
if x == y:
matches += 1
break
Note that this won't be very efficient because the substrings are re-sorted for every loop. An easy way to optimize would be to store the sorted substrings in a dictionary that maps substring lengths to sets of sorted substrings ({4: {'aapx', 'ajpx', 'djpx', ...}}).

Throw a little Regex at it for an interesting solution:
Finds str matches that:
Start and end the same.
Are the same length.
Only contain the same inner letters as the sublist.
Checks that the sorted item matches the sorted substring. And that they don't match unsorted.
import re
def is_anagram(str1, str2):
return sorted(str1) == sorted(str2) and str1 != str2
def get_matches(str1, str2):
start, middle, end = str1[0], str1[1:-1], str1[-1]
pattern = fr'{start}[{middle}]{{{len(middle)}}}{end}'
return re.findall(pattern, str2)
matches = 0
for l in my_list:
for item in get_matches(l, long_string):
if is_anagram(item, l):
matches += 1
print(matches)
# Output:
4

How to understand the result of list comprehension of nested lists when the order is reversed?

I'm trying to extract numbers that are mixed in sentences. I am doing this by splitting the sentence into elements of a list, and then I will iterate through each character of each element to find the numbers. For example:
String = "is2 Thi1s T4est 3a"
LP = String.split()
for e in LP:
for i in e:
if i in ('123456789'):
result += i
This can give me the result I want, which is ['2', '1', '4', '3']. Now I want to write this in list comprehension. After reading the List comprehension on a nested list?
post I understood that the right code shall be:
[i for e in LP for i in e if i in ('123456789') ]
My original code for the list comprehension approach was wrong, but I'm trying to wrap my heads around the result I get from it.
My original incorrect code, which reversed the order:
[i for i in e for e in LP if i in ('123456789') ]
The result I get from that is:
['3', '3', '3', '3']
Could anyone explain the process that leads to this result please?

Just reverse the same process you found in the other post. Nest the loops in the same order:
for i in e:
for e in LP:
if i in ('123456789'):
print(i)
The code requires both e and LP to be set beforehand, so the outcome you see depends entirely on other code run before your list comprehension.
If we presume that e was set to '3a' (the last element in LP from your code that ran full loopss), then for i in e will run twice, first with i set to '3'. We then get a nested loop, for e in LP, and given your output, LP is 4 elements long. So that iterates 4 times, and each iteration, i == '3' so the if test passes and '3' is added to the output. The next iteration of for i in e: sets i = 'a', the inner loop runs 4 times again, but not the if test fails.
However, we can't know for certain, because we don't know what code was run last in your environment that set e and LP to begin with.
I'm not sure why your original code uses str.split(), then iterates over all the characters of each word. Whitespace would never pass your if filter anyway, so you could just loop directly over the full String value. The if test can be replaced with a str.isdigit() test:
digits = [char for char in String if char.isdigit()]
or a even a regular expression:
digits = re.findall(r'\d', String)
and finally, if this is a reordering puzzle, you'd want to split out your strings into a number (for ordering) and the remainder (for joining); sort the words on the extracted number, and extract the remainder after sorting:
# to sort on numbers, extract the digits and turn to an integer
sortkey = lambda w: int(re.search(r'\d+', w).group())
# 'is2' -> 2, 'Th1s1' -> 1, etc.
# sort the words by sort key
reordered = sorted(String.split(), key=sortkey)
# -> ['Thi1s', 'is2', '3a', 'T4est']
# replace digits in the words and join again
rejoined = ' '.join(re.sub(r'\d+', '', w) for w in reordered)
# -> 'This is a Test'

From the question you asked in a comment ("how would you proceed to reorder the words using the list that we got as index?"):
We can use custom sorting to accomplish this. (Note that regex is not required, but makes it slightly simpler. Use any method to extract the number out of the string.)
import re
test_string = 'is2 Thi1s T4est 3a'
words = test_string.split()
words.sort(key=lambda s: int(re.search(r'\d+', s).group()))
print(words) # ['Thi1s', 'is2', '3a', 'T4est']
To remove the numbers:
words = [re.sub(r'\d', '', w) for w in words]
Final output is:
['This', 'is', 'a', 'Test']

How to split a list based on whether the elements were next to each other in the list they came from?

I'm going through Problem 3 of the MIT lead python course, and I have an admittedly long drawn out script that feels like it's getting close. I need to print the longest substring of s in which the letters occur in alphabetical order. I'm able to pull out any characters that are in alphabetical order with regards to the character next to it. What I need to see is:
Input : 'aezcbobobegghakl'
needed output: 'beggh'
my output: ['a', 'e', 'b', 'b', 'b', 'e', 'g', 'g', 'a', 'k']
My code:
s = 'aezcbobobegghakl'
a = 'abcdefghijklmnopqrstuvwxyz'
len_a = len(a)
len_s = len(s)
number_list = []
letter_list = []
for i in range(len(s)):
n = 0
letter = s[i+n]
if letter in a:
number_list.append(a.index(letter))
n += 1
print(number_list)
for i in number_list:
letter_list.append(a[i])
print(letter_list)
index_list = []
for i in range(len(letter_list)):
index_list.append(i)
print(index_list)
first_check = []
for i in range(len(letter_list)-1):
while number_list[i] <= number_list[i+1]:
print(letter_list[i])
first_check.append(letter_list[i])
break
print(first_check)
I know after looking that there are much shorter and completely different ways to solve the problem, but for the sake of my understanding, is it even possible to finish this code to get the output I'm looking for? Or is this just a lost cause rabbit hole I've dug?

I would build a generator to output all the runs of characters such that l[i] >= l[i-1]. Then find the longest of those runs. Something like
def runs(l):
it = iter(l)
try:
run = [next(it)]
except StopIteration:
return
for i in it:
if i >= run[-1]:
run.append(i)
else:
yield run
run = [i]
yield run
def longest_increasing(l):
return ''.join(max(runs(l), key=len))
Edit: Notes on your code
for i in range(len(s)):
n = 0
letter = s[i+n]
if letter in a:
number_list.append(a.index(letter))
n += 1
is getting the "number value" for each letter. You can use the ord function to simplify this
number_list = [ord(c) - 97 for c in s if c.islower()]
You never use index_list, and you never should. Look into the enumerate function.
first_check = []
for i in range(len(letter_list)-1):
while number_list[i] <= number_list[i+1]:
print(letter_list[i])
first_check.append(letter_list[i])
break
this part doesn't make a ton of sense. You break out of the while loop every time, so it's basically an if. You have no way of keeping track of more than one run. You have no mechanism here for comparing runs of characters against one another. I think you might be trying to do something like
max_run = []
for i in range(len(letter_list)-1):
run = []
for j in range(i, len(letter_list)):
run.append(letter_list[j])
if letter_list[j] > letter_list[j+1]:
break
if len(run) > len(max_run):
max_run = run
(Disclaimer: I'm pretty sure the above is off by one but it should be illustrative). The above can be improved in a lot of ways. Note that it loops over the last character as many as len(s) times, making it a n**2 solution. Also, I'm not sure why you need number_list, as strings can be compared directly.

What about a simple recursive approach :
data = 'ezcbobobegghakl'
words=list(data)
string_s=list(map(chr,range(97,123)))
final_=[]
def ok(list_1,list_2):
if not list_1:
return 0
else:
first = list_1[0]
chunks = list_2[list_2.index(first):]
track = []
for j, i in enumerate(list_1):
if i in chunks:
track.append(i)
chunks=list_2[list_2.index(i):]
else:
final_.append(track)
return ok(list_1[j:],list_2)
final_.append(track)
print(ok(words,string_s))
print(max(final_,key=lambda x:len(x)))
output:
['b', 'e', 'g', 'g', 'h']

You can find a list of all substrings of the input string, and then find all the strings that are sorted alphabetically. To determine of a letter is sorted alphabetically, sorted the original string by position in the alphabet, and then see if the final string equals the original string:
from string import ascii_lowercase as l
s = 'aezcbobobegghakl'
substrings = set(filter(lambda x:x, [s[i:b] for i in range(len(s)) for b in range(len(s))]))
final_substring = max([i for i in substrings if i == ''.join(sorted(list(i), key=lambda x:l.index(x)))], key=len)
Output:
'beggh'

This is one way of getting the job done:
s = 'aezcbobobegghakl'
l = list(s)
run = []
allrun = []
element = 'a'
for e in l:
if e >= element:
run.append(e)
element = e
else:
allrun.append(run)
run = [e]
element = e
lengths = [len(e) for e in allrun]
result = ''.join(allrun[lengths.index(max(lengths))])
"run" is basically an uninterrupted run; it keeps growing as you add elements bigger than what is previously seen ("b" is bigger than "a", just string comparison), and resets else.
"allrun" contains all "run"s, which looks like this:
[['a', 'e', 'z'], ['c'], ['b', 'o'], ['b', 'o'], ['b', 'e', 'g', 'g', 'h']]
"result" finally picks the longest "run" in "allrun", and merges it into one string.
Regarding your code:
It is very very inefficient, I would not proceed with it. I would adopt one of the posted solutions.
Your number_list can be written as [a.index(_) for _ in s], one liner.
Your letter_list is actually just list(s), and you are using a loop for that!
Your index_list, what does it even do? It is equivalent to range(len(letter_list)), so what are you aiming with the append in the loop?
Finally, the way you write loops reminds me of matlab. You can just iterate on the elements of a list, no need to iterate on index and fetch the corresponding element in list.

How to create a function that finds the index of a word in a list?

I'm trying to extend my hangman game so that your able to select the amount of letters you want your word to be. I'll do this by using the selection method (e.g. if,else) however after making the function i come across an error so i threw away the whole function after finding where the error came from and started working on solving the error. Basically I'm trying to use index() function to locate where the element is but for some reason it doesn't work it always outputs 0.
def wordLength(word):
r = word
num=word.index(r)
print(num)

I think what you are trying to achieve is this (since you mentioned you used .split() in the comments on your question:
sentence = "this is a sentence"
sentence_split = sentence.split()
def wordLength(word, sentence):
try:
index = sentence.index(word)
print(index)
except:
print("word not in sentence")
wordLength("a", sentence_split)
Results in '3', which is the position of 'a' within your sentence.
EDIT
Or, if you want the index number of each letter within each word..
sentence = "this is a sentence"
sentence_split = sentence.split()
letter_index = []
def index_letters():
for i in sentence_split:
# I results in "this" (1st loop), "is" (2nd loop), etc.
for x in range(len(i)):
# loops over every word, and then counts every letter. So within the i='this' loop this will result in four loops (since "this" has 4 letters) in which x = 0 (first loop), 1 (second loop), etc.
letter = i[x]
index = i.index(letter)
letter_index.append([index, letter])
return letter_index
print(index_letters())
Results in: [[0, 't'], [1, 'h'], [2, 'i'], [3, 's'], [0, 'i'], [1, 's'], [0, 'a'], [0, 's'], [1, 'e'], [2, 'n'], [3, 't'], [1, 'e'], [2, 'n'], [6, 'c'], [1, 'e']]

If I understand what you are asking then you should be able to take the word chosen for the hangman game and create a list of every letter in the word then finding the index() of the letter in the list.
Without the question being more clear it seams like you are asking for the index of a letter in a word. What I have for that is below.
something like this should help:
hm_word = [] #the list to be created for the given word
def wordlist(hangman_word): #Creates a list of each letter of the word
for letter in hangman_word:
hm_word.append(letter)
wordlist("bacon")# calls the function and puts in the word chosen
print (hm_word) #just to show the list has been created
print (hm_word.index("o")+1) #This prints the index of the letter 'o' however you can use a user input here.
#NOTE: I used +1 in the print for the index to show the index number starting from 1 not 0
you can use this as a starting point to get the index of the letter being guessed by taking the user input and placing it in the index().
I only used "o" as an example of how it works.

def find_word(word, word_list):
try:
return word_list.index(word)
except ValueError:
return None # or what ever you need
>>> word_list = ['bubble', 'lizzard', 'fish']
>>> find_word('lizzard', word_list)
1
>>> print(find_word('blizzard', word_list))
None

It's returning 0, because you are running index on the input variable for itself. You will need to index on a segment of the input variable. A slight modification of your example, that will return something other than 0:
def wordLength(word):
r = 'r'
num=word.index(r)
print(num)
This should print the number 2.

"wordLength" is a bit of a misnomer here for your function. If you want the length of a word, you can just do:
len(word)
def findChar(word, char):
# don't go any further if they aren't strings
assert isinstance(word, basestring)
assert isinstance(char, basestring)
# return the character you are looking for,
# in human friendly form (1 more than the python index)
return word.lower().index(char.lower()) + 1
to use:
>>>print findChar('heat', 'a')
3 # a is in the third position.
This does not address a single character being in a word many times, obviously.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Common substring in list of strings - python

from suffix_trees import STree STree.STree(["come have some apple pies", 'apple pie available', 'i love apple pie haha']).lcs() the most simple way is use STree

Related

How can I check a string for two letters or more?

Compare list with an exact or scrambled match of a long text

How to understand the result of list comprehension of nested lists when the order is reversed?

How to split a list based on whether the elements were next to each other in the list they came from?

How to create a function that finds the index of a word in a list?

Categories

Resources