Efficiently searching for prefixes, with wildcards and mismatches - python

Given a string str and a list of variable-length prefixes p, I want to find all possible prefixes found at the start of str, allowing for up to k mismatches and wildcards (dot character) in str.
I only want to search at the beginning of the string and need to do this efficiently for len(p) <= 1000; k <= 5 and millions of strs.
So for example:
str = 'abc.efghijklmnop'
p = ['abc', 'xxx', 'xbc', 'abcxx', 'abcxxx']
k = 1
result = ['abc', 'xbc', 'abcxx'] #but not 'xxx', 'abcxxx'
Is there an efficient algorithm for this, ideally with a python implementation already available?
My current idea would be to walk through str character by character and keep a running tally of each prefix's mismatch count.
At each step, I would calculate a new list of candidates which is the list of prefixes that do not have too many mismatches.
If I reach the end of a prefix it gets added to the returned list.
So something like this:
def find_prefixes_with_mismatches(str, p, k):
p_with_end = [prefix+'$' for prefix in p]
candidates = list(range(len(p)))
mismatches = [0 for _ in candidates]
result = []
for char_ix in range(len(str)):
#at each iteration we build a new set of candidates
new_candidates = []
for prefix_ix in candidates:
#have we reached the end?
if p_with_end[prefix_ix][char_ix] == '$':
#then this is a match
result.append(p[prefix_ix])
#do not add to new_candidates
else:
#do we have a mismatch
if str[char_ix] != p_with_end[prefix_ix][char_ix] and str[char_ix] != '.' and p_with_end[prefix_ix][char_ix] != '.':
mismatches[prefix_ix] += 1
#only add to new_candidates if the number is still not >k
if mismatches[prefix_ix] <= k:
new_candidates.append(prefix_ix)
else:
#if not, this remains a candidate
new_candidates.append(prefix_ix)
#update candidates
candidates = new_candidates
return result
But I'm not sure if this will be any more efficient than simply searching one prefix after the other, since it requires rebuilding this list of candidates at every step.

I do not know of something that does exactly this.
But if I were to write it, I'd try constructing a trie of all possible decision points, with an attached vector of all states you wound up in. You would then take each string, walk the trie until you hit a final matched node, then return the precompiled vector of results.
If you've got a lot of prefixes and have set k large, that trie may be very big. But if you're amortizing creating it against running it on millions of strings, it may be worthwhile.

Related

How to find the longest common substring in a list of strings (>2 strings)? Trying FuzzyWuzzy and Sequence matcher

So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal:
['10.1001/jamacardio.2016.5501',
'10.1001/jamacardio.2017.3145',
'10.1001/jamacardio.2018.3029',
'10.1001/jamacardio.2020.5573',
'10.1001/jamacardio.2020.0647']
(The list is much longer than this)
I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.
journal_list
def longestSubstring(str1,str2):
#initialize SequenceMatcher object with
#input string
seqMatch = SequenceMatcher(None,str1,str2)
#find match of longest sub-string
#output will be like Match(a=0, b=0, size=5)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('No longest common sub-string found')
for journal in journal_list:
str1 = journal_list[1]
print(longestSubstring(str1,journal))
Expected output:
'10.1001/jamacardio.20'
I think it's overkill to use any fancy matching library for this and would start with a function that works with two strings:
def common_2(s1, s2):
longest = ""
for i in range(min(len(s1), len(s2))):
if s1[i] == s2[i]:
longest += s1[i]
else:
break
return longest
Then just apply this repeatedly to all the strings:
def common(ss):
if len(ss) < 1:
return ""
if len(ss) == 1:
return ss[0]
part = common_2(ss[0], ss[1])
for i in range(2, len(ss)):
part = common_2(part, ss[i])
return part
>>> journals = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
>>> common(journals)
'10.1001/jamacardio.20'
This only finds the common prefix; if you want general substrings, just modify common_2.
Simple brute force one, with your list multiplied by 1000 (so it's 5000 strings) it still takes less than a second:
ss = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
def substrings(s):
return {s[i:j]
for j in range(len(s)+1)
for i in range(j+1)}
common = set.intersection(*map(substrings, ss))
print(max(common, key=len))
According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.
The idea is as follows:
Choose one suffix from each string.
Find the longest common prefix of these suffixes.
If the prefix is longer than original records, replace them. Or:
If the prefix has the same length as original records, append it into the records.
If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.
Suppose original string list is called list. When we insert all suffixes of list[i], we attach information i into visited nodes. When a node has all i from 0 to len(list) - 1, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.
Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.
Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):
from dataclasses import dataclass, field
#dataclass
class Node:
char: str
layer: int
parent: "Node"
# sources: set = field(default_factory=set)
# By using an integer counter we can save ~40% time
sources: int = 0
children: dict = field(default_factory=dict)
dois = [
'10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145',
'10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573',
'10.1001/jamacardio.2020.0647'
]
# Sort the input so the first doi has fewer substrings
dois.sort(key=len)
def full(node):
global dois
# return len(node.sources) == len(dois)
return node.sources == len(dois)
def find_nodes(root):
"""
Find ending nodes of full-node sequences.
Since nodes have the property `parent`, it would be easy to trace back,
and saving nodes is more efficient than building strings.
"""
results = []
maxh = 0
def dfs(node):
nonlocal results, maxh
'''
We can expect the full sequences to start from the top of the suffix
tree. If not, since an abnormal sequence is a suffix of other suffixes,
it conflicts with the fact that all possible suffixes have been inserted.
'''
if not full(node) and node.layer > 0:
return
if node.layer > maxh:
maxh = node.layer
results = []
if node.layer == maxh and maxh > 0:
results.append(node)
for next in node.children.values():
dfs(next)
dfs(root)
return results
def build_string(node):
'''
Get expected strings from ending nodes.
'''
s = ''
cur = node
while cur != None and full(cur):
s += cur.char
cur = cur.parent
# Reverse s. Weird that `str(reversed(s))` doesn't work
return ''.join(reversed(s))
def insert(root, s, source):
cur = root
for i in range(len(s)):
ch = s[i]
if ch not in cur.children:
cur.children[ch] = Node(ch, i + 1, cur)
cur = cur.children[ch]
# cur.sources.add(source)
if cur.sources == source:
cur.sources += 1
# All following nodes won't be full.
# This early return saves another 55% time.
elif cur.sources < source:
return
root = Node(0, 0, None)
# Insert all suffixes of dois into the tree.
for i in range(len(dois)):
doi = dois[i]
for j in range(len(doi)):
insert(root, doi[j:], i)
results = find_nodes(root)
# Transform nodes to strings.
for i in range(len(results)):
results[i] = build_string(results[i])
print(results)
There is a pruning step in insert(), which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion to
len(dois).
To visualize the suffix tree, you can use Graphviz:
def toGraphviz(root):
s = 'digraph G {\n overlap=scale\n node [style=filled shape=circle]\n'
stack = []
stack.append(root)
while len(stack) > 0:
node: Node = stack.pop(len(stack) - 1)
s += f' {id(node)} [label=""]'
for k in node.children:
v = node.children[k]
s += f' "{id(node)}" -> "{id(v)}" [label="{k}" {"color=red penwidth=2.0" if full(v) else ""}]\n'
stack.append(v)
s += '}'
return s
# Graphviz should be installed first: https://graphviz.org/download/
# Use `dot tree.dot -Tsvg -o tree.svg` to render a svg file.
with open('tree.dot', 'w') as f:
f.write(toGraphviz(root))
Note that visualization only works when list (dois in the code) is small and strings are not too long. dot can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changed dois to ["ABAB", "BABA", "ABBA"] (and without the pruing step), the visualized suffix tree was like:
All full sequences are marked as red, and ["AB", "BA"] is the result. Your original example can be visualized as well, but the image is way too large to display here.
BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.
Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.

Time limit exceeded error. Word Ladder leetcode

I am trying to solve leetcode problem(https://leetcode.com/problems/word-ladder/description/):
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Note:
Return 0 if there is no such transformation sequence.
All words have the same length.
All words contain only lowercase alphabetic characters.
You may assume no duplicates in the word list.
You may assume beginWord and endWord are non-empty and are not the same.
Input:
beginWord = "hit",
endWord = "cog",
wordList = ["hot","dot","dog","lot","log","cog"]
Output:
5
Explanation:
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" ->
"cog", return its length 5.
import queue
class Solution:
def isadjacent(self,a, b):
count = 0
n = len(a)
for i in range(n):
if a[i] != b[i]:
count += 1
if count > 1:
return False
if count == 1:
return True
def ladderLength(self,beginWord, endWord, wordList):
word_queue = queue.Queue(maxsize=0)
word_queue.put((beginWord,1))
while word_queue.qsize() > 0:
queue_last = word_queue.get()
index = 0
while index != len(wordList):
if self.isadjacent(queue_last[0],wordList[index]):
new_len = queue_last[1]+1
if wordList[index] == endWord:
return new_len
word_queue.put((wordList[index],new_len))
wordList.pop(index)
index-=1
index+=1
return 0
Can someone suggest how to optimise it and prevent the error!
The basic idea is to find the adjacent words faster. Instead of considering every word in the list (even one that has already been filtered by word length), construct each possible neighbor string and check whether it is in the dictionary. To make those lookups fast, make sure the word list is stored in something like a set that supports fast membership tests.
To go even faster, you could store two sorted word lists, one sorted by the reverse of each word. Then look for possibilities involving changing a letter in the first half in the reversed list and for the latter half in the normal list. All the existing neighbors can then be found without making any non-word strings. This can even be extended to n lists, each sorted by omitting one letter from all the words.

Find substrings in a set of strings

I have a large (50k-100k) set of strings mystrings. Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). Right now I'm using a naive method, which has O(N^2) complexity.
unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
keep = True
for us in unique_strings:
if s in us:
keep = False
break
if keep:
unique_strings.add(s)
Which data structures or algorithms would make this task easier and not require O(N^2) operations. Libraries are ok, but I need to stay pure Python.
Finding a substring in a set():
name = set()
name.add('Victoria Stuart') ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio')) ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
me = 'Victoria'
if str(name).find(me):
print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
That's pretty easy -- but somewhat problematic, if you want to return the matching string:
for item in name:
if item.find(me):
print(item)
'''
Jazz
Georgio
Carmine Wilson
'''
print(str(name).find(me))
# 39 ## character offset for match (i.e., not a string)
As you can see, the loop above only executes until the condition is True, terminating before printing the item we want (the matching string).
It's probably better, easier to use regex (regular expressions):
import re
for item in name:
if re.match(me, item):
full_name = item
print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart
for item in name:
if re.search(me, item):
print(item)
# Victoria Stuart
From the Python docs:
search() vs. match()
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string ...
A naive approach:
1. sort strings by length, longest first # `O(N*log_N)`
2. foreach string: # O(N)
3. insert each suffix into tree structure: first letter -> root, and so on.
# O(L) or O(L^2) depending on string slice implementation, L: string length
4. if inserting the entire string (the longest suffix) creates a new
leaf node, keep it!
O[N*(log_N + L)] or O[N*(log_N + L^2)]
This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).
You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set. The algorithmic big O should be the same as for the worse case above (O[N*(log_N + L^2)]), but the implementation is much simpler:
seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
if s not in seen_strings:
keep_strings.add(s)
l = len(s)
for start in range(0, l-1):
for end in range(start+1, l):
seen_strings.add(s[start:end])
In the mean time I came up with this approach.
from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
if suffix_tree.with_prefix(contig) == []:
unique_strings.add(s)
for i in range(len(s)):
suffix_tree[s[i:]] = 1
The good: ≈15 minutes --> ≈20 seconds for the data set I was working with. The bad: introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).
You can presort the strings and create a dictionary that maps strings to positions in the sorted list. Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update). So in total this has O(N*L) complexity where L is the average string length.
strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
if s is None:
continue
unique.add(s)
for k in range(1, len(s)):
try:
index = index_map[s[k:]]
except KeyError:
pass
else:
if strings[index] is None:
break
strings[index] = None
Testing on the following sample data gives a speedup factor of about 21:
import random
from string import ascii_lowercase
mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
for __ in range(1000)]
mystrings = set(mystrings)

Algorithm to compute edit set for transforming one string into another?

I'd like to compute the edits required to transform one string, A, into another string B using only inserts and deletions, with the minimum number of operations required.
So something like "kitten" -> "sitting" would yield a list of operations something like ("delete at 0", "insert 's' at 0", "delete at 4", "insert 'i' at 3", "insert 'g' at 6")
Is there an algorithm to do this, note that I don't want the edit distance, I want the actual edits.
I had an assignment similar to this at one point. Try using an A* variant. Construct a graph of possible 'neighbors' for a given word and search outward using A* with the distance heuristic being the number of letter needed to change in the current word to reach the target. It should be clear as to why this is a good heuristic-it's always going to underestimate accurately. You could think of a neighbor as a word that can be reached from the current word only using one operation. It should be clear that this algorithm will correctly solve your problem optimally with slight modification.
I tried to make something that works, at least for your precise case.
word_before = "kitten"
word_after = "sitting"
# If the strings aren't the same length, we stuff the smallest one with spaces
if len(word_before) > len(word_after):
word_after += " "*(len(word_before)-len(word_after))
elif len(word_before) < len(word_after):
word_before += " "*(len(word_after)-len(word_before))
operations = []
for idx, char in enumerate(word_before):
if char != word_after[idx]:
if char != " ":
operations += ["delete at "+str(idx)]
operations += ["insert '"+word_after[idx]+"' at "+str(idx)]
print(operations)
This should be what you're looking for, using itertools.zip_longest to zip the lists together and iterate over them in pairs compares them and applies the correct operation, it appends the operation to a list at the end of each operation, it compares the lists if they match and breaks out or continues if they don't
from itertools import zip_longest
a = "kitten"
b = "sitting"
def transform(a, b):
ops = []
for i, j in zip_longest(a, b, fillvalue=''):
if i == j:
pass
else:
index = a.index(i)
print(a, b)
ops.append('delete {} '.format(i)) if i != '' else ''
a = a.replace(i, '')
if a == b:
break
ops[-1] += 'insert {} at {},'.format(j, index if i not in b else b.index(j))
return ops
result = transform(a, b)
print(result, ' {} operation(s) was carried out'.format(len(result)))
Since you only have delete and insert operations, this is an instance of the Longest Common Subsequence Problem : https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Indeed, there is a common subsequence of length k in two strings S and T, S of length n and T of length m, if and only only you can transform S into T with m+n-2k insert and delete operations. Think about this as intuition : the order of the letters is preserved both when adding and deleting letters, as well as when taking a subsequence.
EDIT : since you asked for the list of edits, a possible way to do the edits is to first remove all the characters of S not in the common subsequence, and then insert all the characters of T that are not the in common subsequence.

Looking for the shared motif between several sequences

I need to write a script which will loop over a list of sequences, find shared motifs between them (it is possible multiple solutions exist for different motifs) and print this motif which has been shared between all sequences.
In the below example
chains = ['GATTACA', 'TAGACCA', 'ATACA']
the AT is one of the shared motifs. I'll be thankful for any solution of such task including usage of BioPython functions.
Recently I've made script which have loop the same set for the shorter sequence setting its as the reference and then try to find this ref sequence in each positions of the other chains. But I really don't know how to find shared motifs without defining the reference
# reference
xz=" ".join(chains)
ref= min(xz.split(), key=len)
# LOOKING FOR THE MOTIFS
for chain in chains:
for i in range(len(chain)):
if chain==ref:
pass
elif ref not in chain:
print "%s has not been found in the %s"%(ref, chain)
break
elif chain[i:].startswith(ref):
print "%s has been detected in %s in the %d position" %(ref, chain, i+1)
It is only quick idea. You have to improve it, because it search almost all space. I hope it will help.
def cut_into_parts(chain, n):
return [chain[x:x+n] for x in range(0, len(chain)-n)]
def cut_chains(chains, n):
rlist = []
for k,v in enumerate(chains):
rlist.extend(cut_into_parts(chains, n))
return rlist
def is_str_common(str, chains):
for k,v in enumerate(chains):
if !chains[k].contains(str):
return false
return true
def find_best_common(chains):
clist = []
for i in inverse(range(0, len(chains)))://inverse - I dont remmeber exactly the name of func
clist.extend(cut_chains(chains, i))
for k, v in enumerate(clist):
return is_str_common(clist[k], chains)
The simplest approach starts with the realization that the longest common substring can not be longer than the shortest string we're looking at. It should also be obvious that if we start with the longest possible candidate and only examine shorter candidates after eliminating longer ones, then we can stop as soon as we find a common substring.
So, we begin by sorting the DNA strings by length. We'll refer to the length of the shortest one as l. Then the procedure is to test its substrings, beginning with the single substring of length l, and then the two substrings of length l-1, and so forth, until a match is found and we return it.
from Bio import SeqIO
def get_all_substrings(iterable):
s = tuple(iterable)
seen = set()
for size in range(len(s)+1, 1, -1):
for index in range(len(s)+1-size):
substring = iterable[index:index+size]
if substring not in seen:
seen.add(substring)
yield substring
def main(input_file, return_all=False):
substrings = []
records = list(SeqIO.parse(open(input_file),'fasta'))
records = sorted(records, key=lambda record: len(str(record.seq)))
first, rest = records[0], records[1:]
rest_sequences = [str(record.seq) for record in rest]
for substring in get_all_substrings(str(first.seq)):
if all(substring in seq for seq in rest_sequences):
if return_all:
substrings.append(substring)
else:
return substring
return substrings

Categories