Looking for the shared motif between several sequences - python

I need to write a script which will loop over a list of sequences, find shared motifs between them (it is possible multiple solutions exist for different motifs) and print this motif which has been shared between all sequences.
In the below example
chains = ['GATTACA', 'TAGACCA', 'ATACA']
the AT is one of the shared motifs. I'll be thankful for any solution of such task including usage of BioPython functions.
Recently I've made script which have loop the same set for the shorter sequence setting its as the reference and then try to find this ref sequence in each positions of the other chains. But I really don't know how to find shared motifs without defining the reference
# reference
xz=" ".join(chains)
ref= min(xz.split(), key=len)
# LOOKING FOR THE MOTIFS
for chain in chains:
for i in range(len(chain)):
if chain==ref:
pass
elif ref not in chain:
print "%s has not been found in the %s"%(ref, chain)
break
elif chain[i:].startswith(ref):
print "%s has been detected in %s in the %d position" %(ref, chain, i+1)

It is only quick idea. You have to improve it, because it search almost all space. I hope it will help.
def cut_into_parts(chain, n):
return [chain[x:x+n] for x in range(0, len(chain)-n)]
def cut_chains(chains, n):
rlist = []
for k,v in enumerate(chains):
rlist.extend(cut_into_parts(chains, n))
return rlist
def is_str_common(str, chains):
for k,v in enumerate(chains):
if !chains[k].contains(str):
return false
return true
def find_best_common(chains):
clist = []
for i in inverse(range(0, len(chains)))://inverse - I dont remmeber exactly the name of func
clist.extend(cut_chains(chains, i))
for k, v in enumerate(clist):
return is_str_common(clist[k], chains)

The simplest approach starts with the realization that the longest common substring can not be longer than the shortest string we're looking at. It should also be obvious that if we start with the longest possible candidate and only examine shorter candidates after eliminating longer ones, then we can stop as soon as we find a common substring.
So, we begin by sorting the DNA strings by length. We'll refer to the length of the shortest one as l. Then the procedure is to test its substrings, beginning with the single substring of length l, and then the two substrings of length l-1, and so forth, until a match is found and we return it.
from Bio import SeqIO
def get_all_substrings(iterable):
s = tuple(iterable)
seen = set()
for size in range(len(s)+1, 1, -1):
for index in range(len(s)+1-size):
substring = iterable[index:index+size]
if substring not in seen:
seen.add(substring)
yield substring
def main(input_file, return_all=False):
substrings = []
records = list(SeqIO.parse(open(input_file),'fasta'))
records = sorted(records, key=lambda record: len(str(record.seq)))
first, rest = records[0], records[1:]
rest_sequences = [str(record.seq) for record in rest]
for substring in get_all_substrings(str(first.seq)):
if all(substring in seq for seq in rest_sequences):
if return_all:
substrings.append(substring)
else:
return substring
return substrings

Related

How to find the longest common substring in a list of strings (>2 strings)? Trying FuzzyWuzzy and Sequence matcher

So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal:
['10.1001/jamacardio.2016.5501',
'10.1001/jamacardio.2017.3145',
'10.1001/jamacardio.2018.3029',
'10.1001/jamacardio.2020.5573',
'10.1001/jamacardio.2020.0647']
(The list is much longer than this)
I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.
journal_list
def longestSubstring(str1,str2):
#initialize SequenceMatcher object with
#input string
seqMatch = SequenceMatcher(None,str1,str2)
#find match of longest sub-string
#output will be like Match(a=0, b=0, size=5)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('No longest common sub-string found')
for journal in journal_list:
str1 = journal_list[1]
print(longestSubstring(str1,journal))
Expected output:
'10.1001/jamacardio.20'
I think it's overkill to use any fancy matching library for this and would start with a function that works with two strings:
def common_2(s1, s2):
longest = ""
for i in range(min(len(s1), len(s2))):
if s1[i] == s2[i]:
longest += s1[i]
else:
break
return longest
Then just apply this repeatedly to all the strings:
def common(ss):
if len(ss) < 1:
return ""
if len(ss) == 1:
return ss[0]
part = common_2(ss[0], ss[1])
for i in range(2, len(ss)):
part = common_2(part, ss[i])
return part
>>> journals = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
>>> common(journals)
'10.1001/jamacardio.20'
This only finds the common prefix; if you want general substrings, just modify common_2.
Simple brute force one, with your list multiplied by 1000 (so it's 5000 strings) it still takes less than a second:
ss = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
def substrings(s):
return {s[i:j]
for j in range(len(s)+1)
for i in range(j+1)}
common = set.intersection(*map(substrings, ss))
print(max(common, key=len))
According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.
The idea is as follows:
Choose one suffix from each string.
Find the longest common prefix of these suffixes.
If the prefix is longer than original records, replace them. Or:
If the prefix has the same length as original records, append it into the records.
If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.
Suppose original string list is called list. When we insert all suffixes of list[i], we attach information i into visited nodes. When a node has all i from 0 to len(list) - 1, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.
Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.
Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):
from dataclasses import dataclass, field
#dataclass
class Node:
char: str
layer: int
parent: "Node"
# sources: set = field(default_factory=set)
# By using an integer counter we can save ~40% time
sources: int = 0
children: dict = field(default_factory=dict)
dois = [
'10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145',
'10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573',
'10.1001/jamacardio.2020.0647'
]
# Sort the input so the first doi has fewer substrings
dois.sort(key=len)
def full(node):
global dois
# return len(node.sources) == len(dois)
return node.sources == len(dois)
def find_nodes(root):
"""
Find ending nodes of full-node sequences.
Since nodes have the property `parent`, it would be easy to trace back,
and saving nodes is more efficient than building strings.
"""
results = []
maxh = 0
def dfs(node):
nonlocal results, maxh
'''
We can expect the full sequences to start from the top of the suffix
tree. If not, since an abnormal sequence is a suffix of other suffixes,
it conflicts with the fact that all possible suffixes have been inserted.
'''
if not full(node) and node.layer > 0:
return
if node.layer > maxh:
maxh = node.layer
results = []
if node.layer == maxh and maxh > 0:
results.append(node)
for next in node.children.values():
dfs(next)
dfs(root)
return results
def build_string(node):
'''
Get expected strings from ending nodes.
'''
s = ''
cur = node
while cur != None and full(cur):
s += cur.char
cur = cur.parent
# Reverse s. Weird that `str(reversed(s))` doesn't work
return ''.join(reversed(s))
def insert(root, s, source):
cur = root
for i in range(len(s)):
ch = s[i]
if ch not in cur.children:
cur.children[ch] = Node(ch, i + 1, cur)
cur = cur.children[ch]
# cur.sources.add(source)
if cur.sources == source:
cur.sources += 1
# All following nodes won't be full.
# This early return saves another 55% time.
elif cur.sources < source:
return
root = Node(0, 0, None)
# Insert all suffixes of dois into the tree.
for i in range(len(dois)):
doi = dois[i]
for j in range(len(doi)):
insert(root, doi[j:], i)
results = find_nodes(root)
# Transform nodes to strings.
for i in range(len(results)):
results[i] = build_string(results[i])
print(results)
There is a pruning step in insert(), which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion to
len(dois).
To visualize the suffix tree, you can use Graphviz:
def toGraphviz(root):
s = 'digraph G {\n overlap=scale\n node [style=filled shape=circle]\n'
stack = []
stack.append(root)
while len(stack) > 0:
node: Node = stack.pop(len(stack) - 1)
s += f' {id(node)} [label=""]'
for k in node.children:
v = node.children[k]
s += f' "{id(node)}" -> "{id(v)}" [label="{k}" {"color=red penwidth=2.0" if full(v) else ""}]\n'
stack.append(v)
s += '}'
return s
# Graphviz should be installed first: https://graphviz.org/download/
# Use `dot tree.dot -Tsvg -o tree.svg` to render a svg file.
with open('tree.dot', 'w') as f:
f.write(toGraphviz(root))
Note that visualization only works when list (dois in the code) is small and strings are not too long. dot can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changed dois to ["ABAB", "BABA", "ABBA"] (and without the pruing step), the visualized suffix tree was like:
All full sequences are marked as red, and ["AB", "BA"] is the result. Your original example can be visualized as well, but the image is way too large to display here.
BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.
Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Efficiently searching for prefixes, with wildcards and mismatches

Given a string str and a list of variable-length prefixes p, I want to find all possible prefixes found at the start of str, allowing for up to k mismatches and wildcards (dot character) in str.
I only want to search at the beginning of the string and need to do this efficiently for len(p) <= 1000; k <= 5 and millions of strs.
So for example:
str = 'abc.efghijklmnop'
p = ['abc', 'xxx', 'xbc', 'abcxx', 'abcxxx']
k = 1
result = ['abc', 'xbc', 'abcxx'] #but not 'xxx', 'abcxxx'
Is there an efficient algorithm for this, ideally with a python implementation already available?
My current idea would be to walk through str character by character and keep a running tally of each prefix's mismatch count.
At each step, I would calculate a new list of candidates which is the list of prefixes that do not have too many mismatches.
If I reach the end of a prefix it gets added to the returned list.
So something like this:
def find_prefixes_with_mismatches(str, p, k):
p_with_end = [prefix+'$' for prefix in p]
candidates = list(range(len(p)))
mismatches = [0 for _ in candidates]
result = []
for char_ix in range(len(str)):
#at each iteration we build a new set of candidates
new_candidates = []
for prefix_ix in candidates:
#have we reached the end?
if p_with_end[prefix_ix][char_ix] == '$':
#then this is a match
result.append(p[prefix_ix])
#do not add to new_candidates
else:
#do we have a mismatch
if str[char_ix] != p_with_end[prefix_ix][char_ix] and str[char_ix] != '.' and p_with_end[prefix_ix][char_ix] != '.':
mismatches[prefix_ix] += 1
#only add to new_candidates if the number is still not >k
if mismatches[prefix_ix] <= k:
new_candidates.append(prefix_ix)
else:
#if not, this remains a candidate
new_candidates.append(prefix_ix)
#update candidates
candidates = new_candidates
return result
But I'm not sure if this will be any more efficient than simply searching one prefix after the other, since it requires rebuilding this list of candidates at every step.
I do not know of something that does exactly this.
But if I were to write it, I'd try constructing a trie of all possible decision points, with an attached vector of all states you wound up in. You would then take each string, walk the trie until you hit a final matched node, then return the precompiled vector of results.
If you've got a lot of prefixes and have set k large, that trie may be very big. But if you're amortizing creating it against running it on millions of strings, it may be worthwhile.

Find substrings in a set of strings

I have a large (50k-100k) set of strings mystrings. Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). Right now I'm using a naive method, which has O(N^2) complexity.
unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
keep = True
for us in unique_strings:
if s in us:
keep = False
break
if keep:
unique_strings.add(s)
Which data structures or algorithms would make this task easier and not require O(N^2) operations. Libraries are ok, but I need to stay pure Python.
Finding a substring in a set():
name = set()
name.add('Victoria Stuart') ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio')) ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
me = 'Victoria'
if str(name).find(me):
print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
That's pretty easy -- but somewhat problematic, if you want to return the matching string:
for item in name:
if item.find(me):
print(item)
'''
Jazz
Georgio
Carmine Wilson
'''
print(str(name).find(me))
# 39 ## character offset for match (i.e., not a string)
As you can see, the loop above only executes until the condition is True, terminating before printing the item we want (the matching string).
It's probably better, easier to use regex (regular expressions):
import re
for item in name:
if re.match(me, item):
full_name = item
print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart
for item in name:
if re.search(me, item):
print(item)
# Victoria Stuart
From the Python docs:
search() vs. match()
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string ...
A naive approach:
1. sort strings by length, longest first # `O(N*log_N)`
2. foreach string: # O(N)
3. insert each suffix into tree structure: first letter -> root, and so on.
# O(L) or O(L^2) depending on string slice implementation, L: string length
4. if inserting the entire string (the longest suffix) creates a new
leaf node, keep it!
O[N*(log_N + L)] or O[N*(log_N + L^2)]
This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).
You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set. The algorithmic big O should be the same as for the worse case above (O[N*(log_N + L^2)]), but the implementation is much simpler:
seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
if s not in seen_strings:
keep_strings.add(s)
l = len(s)
for start in range(0, l-1):
for end in range(start+1, l):
seen_strings.add(s[start:end])
In the mean time I came up with this approach.
from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
if suffix_tree.with_prefix(contig) == []:
unique_strings.add(s)
for i in range(len(s)):
suffix_tree[s[i:]] = 1
The good: ≈15 minutes --> ≈20 seconds for the data set I was working with. The bad: introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).
You can presort the strings and create a dictionary that maps strings to positions in the sorted list. Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update). So in total this has O(N*L) complexity where L is the average string length.
strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
if s is None:
continue
unique.add(s)
for k in range(1, len(s)):
try:
index = index_map[s[k:]]
except KeyError:
pass
else:
if strings[index] is None:
break
strings[index] = None
Testing on the following sample data gives a speedup factor of about 21:
import random
from string import ascii_lowercase
mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
for __ in range(1000)]
mystrings = set(mystrings)

Word segmentation using dynamic programming

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:
We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:
iterative bottom-up version
recursive memorized version
Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.
So I'm having two main issues:
I know that this can and should be done in O(N^2) and I don't think mine is
The lookup table isn't adding all the words it seems such that it can reduce the time complexity
What I'd like:
Any kind of input (better way to do it, something you see wrong in the code, how I can get the lookup table working, how to use the table of booleans to build a sequence of valid words)
Some idea on how to tackle the recursive version although I feel once I am able to solve the iterative solution I will be able to engineer the recursive one from it.
As always thanks for any time and or effort anyone gives this, it is always appreciated.
Here's my attempt:
#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
diction = open("diction10k.txt",'r')
for x in diction:
x = x.strip("\n \r")
if s == x:
return True
return False
def iterativeSplit(s):
n = len(s)
i = j = k = 0
A = [-1] * n
word = [""] * n
booly = False
for i in range(0, n):
for j in range(0, i+1):
prefix = s[j:i+1]
for k in range(0, n):
if word[k] == prefix:
#booly = True
A[k] = 1
#print "Array below at index k %d and word = %s"%(k,word[k])
#print A
# print prefix, A[i]
if(((A[i] == -1) or (A[i] == 0))):
if (dictW(prefix)):
A[i] = 1
word[i] = prefix
#print word[i], i
else:
A[i] = 0
for i in range(0, n):
print A[i]
For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.
In particular, segment illustrates the memoization approach:
def segment(text):
"Return a list of words that is the best segmenation of `text`."
memo = dict()
def search(text, prev='<s>'):
if text == '':
return 0.0, []
def candidates():
for prefix, suffix in divide(text):
prefix_score = log10(score(prefix, prev))
pair = (suffix, prefix)
if pair not in memo:
memo[pair] = search(suffix, prefix)
suffix_score, suffix_words = memo[pair]
yield (prefix_score + suffix_score, [prefix] + suffix_words)
return max(candidates())
result_score, result_words = search(clean(text))
return result_words
If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.
Here is the solution in C++. Read and understand the concept, and then implement.
This video is very helpful for understanding DP approach.
One more approach which I feel can help is Trie data structure. It is a better way to solve the above problem.

Categories