I have a 15-mer nucleotide motif that uses degenerate nucleotide sequences. Example: ATNTTRTCNGGHGCN.
I would search a set of sequences for the occurrence of this motif. However, my other sequences are exact sequences, i.e. they have no ambiguity.
I have tried doing a for loop within the sequences to search for this, but I have not been able to do non-exact searches. The code I use is modeled after the code on the Biopython cookbook.
for pos,seq in m.instances.search(test_seq):
print pos, seq
I would like to search for all possible exact instances of the non-exact 15-mer. Is there a function available, or would I have to resort to defining my own function for that? (I'm okay doing the latter, just wanted to triple-check with the world that I'm not duplicating someone else's efforts before I go ahead - I have already browsed through what I thought was the relevant parts of the docs.)
Use Biopython's nt_search. It looks for a subsequence in a DNA sequence, expanding ambiguity codes to the possible nucleotides in that position. Example:
>>> from Bio import SeqUtils
>>> pat = "ATNTTRTCNGGHGCN"
>>> SeqUtils.nt_search("CCCCCCCATCTTGTCAGGCGCTCCCCCC", pat)
['AT[GATC]TT[AG]TC[GATC]GG[ACT]GC[GATC]', 7]
It returns a list where the first item is the search pattern, followed by the positions of the matches.
Related
I have a list of words like [bike, motorbike, copyright].
Now I want to check if th word consists of subwords which are also stand alone words. That means the ouput of my algorithm should be something like: [bike, motor, motorbike, copy, right, copyright].
I already now how to check if a word is a english word:
import enchant
english_words = []
arr = [bike, motorbike, copyright, apfel]
d_brit = enchant.Dict("en_GB")
for word in arr:
if d_brit.check(word):
english_words.append(word)
I also found an algorithm which decomposes the word in all possible ways: Splitting a word into all possible 'subwords' - All possible combinations
Unfortunately, splitting the word like this and then check if it is an english word takes simply to long, because my dataset is way to huge.
Can anyone help?
The nested for loops used in the code are extremely slow in Python. As performance seems to be the main issue, I would recommend to look for available Python packages to do parts of the job, or to build your own extension module, e.g. using Cython, or to not use Python at all.
Some alternatives to splitting the word like this:
searching for words that start with the first characters of str. If found word is start of str, check if rest is a word in dataset
split the str in two portions that make sense when looking at the length distribution of the dataset i.e. what are the most common word lengths? Then searching for matches with basic comparison. (just a wild idea)
These are a few quick ideas for faster algorithms i can think of. But if these are not quick enough, then BernieD is right.
I have following problem:
There are n=20 characters in the sequence. For each position there is a predefined list of possible characters which can be 1 to m (where m usually is a single digit).
How can I enumerate all possible permutations efficiently?
Or in essence is there some preexisting library (numpy?) that could do that before I try it myself?
itertools.product seems to offer what I need. I just need to pass it a list of list:
itertools.product(*positions)
where positions is a list of lists (eg which chars at which positions).
In my case the available options for each position are small and often also just 1 so that keeps the number of possibilities in check but might crash your application if too many get generated.
I then build the final string:
for s in itertools.product(*positions):
result = ''.join(s)
results.append(result)
I have a list of over 500 very important, but arbitrary strings. they look like:
list_important_codes = ['xido9','uaid3','frps09','ggix21']
What I know
*Casing is not important, but all other characters must match exactly.
*Every string starts with 4 alphabetical characters, and ends with either one or two numerical characters.
*I have a list of about 100,000 strings,list_recorded_codes that were hand-typed and should match list_important_codes exactly, but about 10,000 of them dont. Because these strings were typed manually, the incorrect strings are usually only about 1 character off. (errors such as: *has an added space, *got two letters switched around, *has "01" instead of "1", etc)
What I need to do
I need to iterate through list_recorded_codes and find all of their perfect matches within list_important_codes.
What I tried
I spent about 10 hours trying to manually program a way to fix each word, but it seems to be impractical and incredibly tedious. not to mention, when my list doubles in size at a later date, i would have to manually go about that process again.
The solution I think I need, and the expected output
Im hoping that Python's NLTK can efficiently 'score' these arbitrary terms to find a 'best score'. For example, if the word in question is inputword = "gdix88", and that word gets compared to score(inputword,"gdox89")=.84 and score(inputword,"sudh88")=.21. with my expected output being highscore=.84, highscoreword='gdox89'
for manually_entered_text in ['xido9','uaid3','frp09','ggix21']:
--get_highest_score_from_important_words() #returns word_with_highest_score
--manually_entered_text = word_with_highest_score
I am also willing to use a different set of tools to fix this issue if needed. but also, the simpler the better! Thank you!
The 'score' you are looking for is called an edit distance. There is quite a lot of literature and algorithms available - easy to find, but only after you know the proper term :)
See the corresponding wikipedia article.
The nltk package provides an implementation of the so-called Levenshtein edit-distance:
from nltk.metrics.distance import edit_distance
if __name__ == '__main__':
print(edit_distance("xido9", "xido9 "))
print(edit_distance("xido9", "xido8"))
print(edit_distance("xido9", "xido9xxx"))
print(edit_distance("xido9", "xido9"))
The results are 1, 1, 3 and 0 in this case.
Here is the documentation of the corresponding nltk module
There are more specialized versions of this score that take into account how frequent various typing errors are (for example 'e' instead of 'r' might occur quite often because the keys are next to each other on a qwert keyboard).
But classic Levenshtein would were I would start.
You could apply a dynamic programming approach to this problem. Once you have your scoring matrix, you alignment_matrix and your local and global alignment functions set up, you could iterate through the list_important_codes and find the highest scoring alignment in the list_recorded_codes. Here is a project I did for DNA sequence alignment: DNA alignment. You can easily adapt it to your problem.
I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.
using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')
As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.
I want to be able to search a Seq object for a subsequnce Seq object accounting for ambiguity codes. For example, the following should be true:
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
amb = IUPACAmbiguousDNA()
s1 = Seq("GGAAAAGG", amb)
s2 = Seq("ARAA", amb) # R = A or G
print s1.find(s2)
If ambiguity codes were taken into account, the answer should be
>>> 2
But the answer i get is that no match is found, or
>>> -1
Looking at the biopython source code, it doesnt appear that ambiguity codes are taken into account, as the subseqeunce is converted to a string using the private _get_seq_str_and_check_alphabet method, then the built in string method find() is used. Of course if this is the case, the "R" ambiguity code will be taken as a literal "R", not an A or G.
I could figure out how to do this with a home made method, but it seems like something that should be taken care of in the biopython packages using its Seq objects. Is there something I am missing here.
Is there a way to search for sub sequence membership accounting for ambiguity codes?
From what I can read from the documentation for Seq.find here:
http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#find
It appears that this method works similar to the str.find method in that it looks for exact match. So, while the dna sequence can contain ambiguity codes, the Seq.find() method will only return a match when the exact subsequence matches.
To do what you want maybe the ntsearch function will work:
Search for motifs with degenerate positions