Can Biopython perform Seq.find() accounting for ambiguity codes - python

I want to be able to search a Seq object for a subsequnce Seq object accounting for ambiguity codes. For example, the following should be true:
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
amb = IUPACAmbiguousDNA()
s1 = Seq("GGAAAAGG", amb)
s2 = Seq("ARAA", amb) # R = A or G
print s1.find(s2)
If ambiguity codes were taken into account, the answer should be
>>> 2
But the answer i get is that no match is found, or
>>> -1
Looking at the biopython source code, it doesnt appear that ambiguity codes are taken into account, as the subseqeunce is converted to a string using the private _get_seq_str_and_check_alphabet method, then the built in string method find() is used. Of course if this is the case, the "R" ambiguity code will be taken as a literal "R", not an A or G.
I could figure out how to do this with a home made method, but it seems like something that should be taken care of in the biopython packages using its Seq objects. Is there something I am missing here.
Is there a way to search for sub sequence membership accounting for ambiguity codes?

From what I can read from the documentation for Seq.find here:
http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#find
It appears that this method works similar to the str.find method in that it looks for exact match. So, while the dna sequence can contain ambiguity codes, the Seq.find() method will only return a match when the exact subsequence matches.
To do what you want maybe the ntsearch function will work:
Search for motifs with degenerate positions

Related

How to append character to specific part of item in list - Python

I have the following list Novar:
["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
How can I append a character (e.g. $) to a specific part of the items in the list, for instance to road_class_3_3000, so that the upgraded list becomes:
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Most similar questions on Stack Overflow seem to focus on manipulating the item itself, rather than a part of the item, e.g. here and here.
Therefore, applying the following code:
if (item == "road_class_3_3000"):
item.append("$")
Would be of no use since road_class_3_3000 is part of the items "'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'" and "'population_3000|road_class_3_3000|trafBuf25|population_1000'"
You might harness re module (part of standard library) for this task following way
import re
novar = ["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
novar2 = [re.sub(r'(?<=road_class_3_3000)', '$', i) for i in novar]
print(novar2)
output
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Feature I used is called positive lookbehind, it is kind of zero-length assertion. I look for place after road_class_3_3000 of zer-length, which I then replace using $ character.

Python match case shows error with list element as case. Expected ":"

Hi I was using the newly added match case function but I encountered this problem.
Here's my code:
from typing import List
class Wee():
def __init__(self) -> None:
self.lololol: List[str] = ["car", "goes", "brrrr"]
def func1(self):
self.weee = input()
try:
match self.weee:
case self.lololol[0]:
print(self.lololol[0])
case self.lololol[1]:
print(self.lololol[1])
case _:
print(self.lololol[2])
except SyntaxError as e:
print(e)
waa = Wee()
waa.func1()
At line 11 and 13, errors show up saying SyntaxError: expected ':'. However, when I change case self.lololol[0]: to case "car":, the errors disappear. What is happening?
You can’t use arbitrary expressions as patterns (since that would sometimes be ambiguous), only a certain subset.
If you want to match against the elements of a list, you should probably, depending on the situation, either make them separate variables, or use list.index.
This is because the match case statement is intended for structural matching. Specifically PEP 635 says:
Although patterns might superficially look like expressions, it is important to keep in mind that there is a clear distinction. In fact, no pattern is or contains an expression. It is more productive to think of patterns as declarative elements similar to the formal parameters in a function definition.
Specifically, plain variables are used as capture patterns and not as match patterns - definitely not what a C/C++ programmer could expect... - and functions or subscripts or just not allowed.
On a practical point of view, only litteral or dotted expressions can be used as matching patterns.

How do I get a count of matching literals when matching regex's in Python?

For context, my use-case is the determine a "precision" score of a regex as it relates to an input string, for the purposes of ranking. While I realize that regex precision can be a nebulous concept, I am defining it for my use-case as the number of matching literals.
Here is an example of inputs & expected results from this mythical function:
>>> get_literal_count(regex=r'foo(bizz|b[a-z]+)*', string='foobarbar')
>>> 5
The way that example breaks down to get a result of 5:
foo [3 literals]
(bizz| [Not matching = 0 literals]
b[a-z]+)* [1 literal (the "b") matched 2 times = 2 literals]
My investigation and playing around with re & sre_parse modules have lead me to the conclusion that the only way to achieve what I am after is to essentially re-write a home-brew re.match function using sre_parse that tracks the number of matching literals as it goes.
Going down that path, I would need to re-implement all (or at least a large sub-set) of the sre symbols used in matching, just to end up with what is essentially a really slow re.match function (the standard lib re.match is written in C, but my implementation would be in Python), I pose my question here.
Does anyone have any insight into this problem? Is there a better way?
EDIT: To clarify, the regex being used in my example is arbitrary, the method must operate on any regex.

Searching words without diacritics in a sorted list of words

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.
using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')
As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

Search for motifs with degenerate positions

I have a 15-mer nucleotide motif that uses degenerate nucleotide sequences. Example: ATNTTRTCNGGHGCN.
I would search a set of sequences for the occurrence of this motif. However, my other sequences are exact sequences, i.e. they have no ambiguity.
I have tried doing a for loop within the sequences to search for this, but I have not been able to do non-exact searches. The code I use is modeled after the code on the Biopython cookbook.
for pos,seq in m.instances.search(test_seq):
print pos, seq
I would like to search for all possible exact instances of the non-exact 15-mer. Is there a function available, or would I have to resort to defining my own function for that? (I'm okay doing the latter, just wanted to triple-check with the world that I'm not duplicating someone else's efforts before I go ahead - I have already browsed through what I thought was the relevant parts of the docs.)
Use Biopython's nt_search. It looks for a subsequence in a DNA sequence, expanding ambiguity codes to the possible nucleotides in that position. Example:
>>> from Bio import SeqUtils
>>> pat = "ATNTTRTCNGGHGCN"
>>> SeqUtils.nt_search("CCCCCCCATCTTGTCAGGCGCTCCCCCC", pat)
['AT[GATC]TT[AG]TC[GATC]GG[ACT]GC[GATC]', 7]
It returns a list where the first item is the search pattern, followed by the positions of the matches.

Categories