Biopython blast parameters for short nucleotidic sequences - python

I am trying to run blastn through biopython with NCBIWWW.
I am using the qblast function on a given sample file.
I have a few methods defined and everything works like a charm when my fasta contains sequences that are long enough. The only case where it fails it is when I need to blast reads coming from Illumina sequencing that are too short. So I would say it is probably due to the fact that there no automatic redefinition of blasting parameters when submitting the work.
I tried everything I could to come close to blastn-short conditions (see table C2 from here) without any success.
It looks like I am not capable to feed in the correct parameters.
The closer I think I came to working situation is with the following :
result_handle = NCBIWWW.qblast("blastn", "nr",
fastaSequence,
word_size=7,
gapcosts='5 2',
nucl_reward=1,
nucl_penalty='-3',
expect=1000)
Thank you for any tip / advice to make it work.
My sample fasta read is the following one :
>TEST 1-211670
AGACTGCGATCCGAACTGAGAAC
The error that I get is the following one :
>ValueError: Error message from NCBI: Message ID#24 Error: Failed to read the Blast query: Protein FASTA provided for nucleotide sequence
And when I look at this page, it seems that my problem is about fixing the threshold but obviously I didn't manage to make it work so far.
Thank you for any help.

Once I had problems with blasting peptides and it appeared that it was an issue of proper parameters selection. It took me terribly long time to find out what they actually should be (inconsistent and scarce data on various websites including quite convoluted in this aspect NCBI documentation). I know you are interested in blasting nucleotide sequences but supposedly you will find your solution whilst having a look on the code below. Pay attention especially to params as filter, composition_based_statistics, word_size and matrix_name. In my case they appeared to be crucial.
blast_handle = NCBIWWW.qblast("blastp", "nr",
peptide_seq,
expect=200000.0,
filter=False,
word_size=2,
composition_based_statistics=False,
matrix_name="PAM30",
gapcosts="9 1",
hitlist_size=1000)

This code works for me (Biopython 1.64):
from Bio.Blast import NCBIWWW
fastaSequence = ">TEST 1-211670\nAGACTGCGATCCGAACTGAGAAC"
result_handle = NCBIWWW.qblast("blastn", "nr",
fastaSequence,
word_size=7,
gapcosts='5 2',
nucl_reward=1,
nucl_penalty='-3',
expect=1000)
print result_handle.getvalue()
Maybe you are passing a wrong fastaSequence. Biopython doesn't make any transformation from SeqRecords (or anything) to plain FASTA. You have to provide the query as shown above.
Blast determines if a sequence is a nucleotide or a protein reading the first few chars. If they are in the "ACGT" above a threshold, it's a nucleotide, otherwise it's a protein. Thus your sequence is at a 100% threshold of "ACGT", impossible to be interpreted as a protein.

Related

How to get the list of matched featured names along with the predict_prob in CalibratedClassifierCV?

I am trying to find the profanity score of a given text which is received on the chats.
For this is I went through a couple of python(base) libraries and found some of the relevant ones as:
profanity-check
alt-profanity-check -- (currently using)
profanity-filter
detoxify
Now, The one which I am using (profanity-check) is giving me proper results when using
predict and predict_prob against the calibrated_classifier used underhood after training.
The problem is that I am unable to identify the words which were used to give the prediction or calculate the probability. In short the list of feature names (profane words) used in the test data when passed as an input.
I know there are no methods to return the same, but I would like to fork and use the library.
I wanted to understand if we can add something to this place (edit) to create a method for the same.
e.g
text = ["this is crap"]
predict([text]) - array([1])
predict_prob([text]) - array([0.99868968])
> predict_words([text]) - array(["crap"]) ---- (NEED THIS)

How to get partial match with soup.find()?

For some reason I wasn't able to find the answer to this somewhere.
So, I'm using this
soup.find(text="Dimensions").find_next().text
to grab the text after "Dimensions". My problem is on the website I'm scraping sometimes it is displayed as "Dimensions:" (with colon) or sometimes it has space "Dimensions " and my code throws an error. So that's why I'm looking for smth like (obviously, this is an invalid piece of code) to get a partial match:
soup.find(if "Dimensions" in text).find_next().text
How can I do that?
Ok, I've just found out looks like it's much simpler than I thought
soup.find(text=re.compile(r"Dimensions")).find_next().text
does what I need

IDA python Find issues

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?
in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

How to find the root of a word from its present participle or other variations in Python?

I'm working on a NLP project, and right now, I'm stuck on detecting antonyms for certain phrases that aren't in their "standard" forms (like verbs, adjectives, nouns) instead of present-participles, past tense, or something to that effect. For instance, if I have the phrase "arriving" or "arrived", I need to convert it to "arrive". Similarly, "came" should be "come". Lastly, “dissatisfied” should be “dissatisfy”. Can anyone help me out with this? I have tried several stemmers and lemmanizers in NLTK with Python, to no avail; most of them don’t produce the correct root. I’ve also thought about the ConceptNet semantic network and other dictionary APIs, but it seems far too complicated for what I need. Any advice is helpful. Thanks!
If you know you'll be working with a limited set, you could create a dictionary.
Example :
look_up = {'arriving' : 'arrive',
'arrived' : 'arrive',
'came' : 'come',
'dissatisfied' : 'dissatisfy'}
test = 'arrived'
print (look_up [test])

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

Categories