Fixed number of results biopython - python

I am trying to retrieve the search results using the following code for a query from pubmed via biopython
from Bio import Entrez
from Bio import Medline
Entrez.email = "A.N.iztb#bobxx.com"
LIM = 3
def search(Term):
handle = Entrez.esearch(db='pmc', term=Term, retmax=100000000)
record = Entrez.read(handle)
idlist = record["IdList"]
handle = Entrez.efetch(db='pmc', id=idlist, rettype="medline", retmode="text")
records = Medline.parse(handle)
return list(records)
mydic=search('(pathological conditions, signs and symptoms[MeSH Terms]) AND (pmc cc license[filter]) ')
print(len(mydic))
No matter how many times I try, I get 10000 in the output. Tried different queries but I still get 10000. When I manually check how many results via browser I get random numbers.
What exactly is going wrong and how to ensure that I get the maximum results?

You only seem to be changing the esearch limit, but leave efetch alone (and the NCBI seems to default to a limit of 10000). You need to use the retstart and retmax arguments.
See the "Searching for and downloading abstracts using the history" example in the Biopython Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html or http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Related

biopython's efetch only returns the first features from any database

Any idea why this code:
handle = Entrez.efetch(db="nuccore",
id=['556503834'], rettype="gb",
retmode="txt")
print(handle.read())
doesn't return the full features that are found on the ncbi description? Only the first feature is returned (I was aiming to get the CDS features).
I tried other databases to the same conclusion.
change rettype by "gbwithparts"
from Bio import Entrez
Entrez.email = "your#mail.com" #put real mail
handle = Entrez.efetch(db="nuccore", id=['556503834'],
rettype="gbwithparts", retmode="txt")
print(handle.read())
Note : It may take a few seconds

Using Biopython.Entrez to return pubmed records associated with a list of gene symbols

I want to use a list of gene symbols (named t below) in a search in a pubmed database in order to (ultimately) retrieve the DNA sequence of the associated gene. I want to restrict my search to humans only but my current code gives me organisms other than human.
from Bio import Entrez
Entrez.email = '...' #my email: always tell Entrez who you are
t = ['FOXO3']
for i in range(len(t)):
search = 'human[orgn]'+t[i]
handle = Entrez.esearch(db='gene',term=search)
record = Entrez.read(handle)
t = record[u'IdList']
handle = Entrez.efetch('nucleotide',id=t[0],rettype='gb',retmode='text')
print handle.read()
Can anybody see where I'm going wrong?
You're messing the databases. In the esearch you use db=gene, but in the efetch you change it to db=nucleotide. They are different things:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=7157
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=7157

How do I navigate results of a Biopython Entrez efetch?

When I run the following;
from Bio.Blast import NCBIWWW
from Bio import Entrez, SeqIO
Entrez.email = "A.N.Other#example.com"
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "xml")
record = Entrez.read(handle)
I get back a "Bio.Entrez.Parser.DictionaryElement" that is really difficult to search through. If I want to say get the the get the amino acid sequence I have to type something like this;
record["Bioseq-set_seq-set"][0]["Seq-entry_seq"]["Bioseq"]["Bioseq_inst"]["Seq-inst"]["Seq-inst_seq-data"]["Seq-data"]["Seq-data_iupacaa"]["IUPACaa"]
I know that there has to be an easier way to index the elements in these results. If anyone out there can lend me a hand with this I'd appreciate it very much.
If what you want is the sequence, then instead of querying it in "xml" format, query it in (for example) FASTA format, by changing the rettype argument. Then it's as simple as parsing it using SeqIO.
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "fasta")
for r in SeqIO.parse(handle, "fasta"):
print r.id, r.seq
This works because the contents of handle look like:
print handle.read()
# >gi|75192198|sp|Q9MAH8.1|TCP3_ARATH RecName: Full=Transcription factor TCP3
# MAPDNDHFLDSPSPPLLEMRHHQSATENGGGCGEIVEVQGGHIVRSTGRKDRHSKVCTAKGPRDRRVRLS
# APTAIQFYDVQDRLGFDRPSKAVDWLITKAKSAIDDLAQLPPWNPADTLRQHAAAAANAKPRKTKTLISP
# PPPQPEETEHHRIGEEEDNESSFLPASMDSDSIADTIKSFFPVASTQQSYHHQPPSRGNTQNQDLLRLSL
# QSFQNGPPFPNQTEPALFSGQSNNQLAFDSSTASWEQSHQSPEFGKIQRLVSWNNVGAAESAGSTGGFVF
# ASPSSLHPVYSQSQLLSQRGPLQSINTPMIRAWFDPHHHHHHHQQSMTTDDLHHHHPYHIPPGIHQSAIP
# GIAFASSGEFSGFRIPARFQGEQEEHGGDNKPSSASSDSRH
If you still want some of the other meta information (such as transcription factor binding sites within the gene, or the taxonomy of the organism), you can also download it in genbank format by giving the argument rettype="gb" and parsing with "gb". You can learn more about that in the example here.

How to increase the number of tweets read using python for sentiment analyser

For the purpose of making a sentiment summariser i require to read large number of tweets.I use the following code to fetch tweets from twitter.The number of tweets returned are just 10 to 20.What changes can be made in this code to increase the number of tweets to 100 or more
t.statuses.home_timeline()
raw_input(query)
data = t.search.tweets(q=query)
for i in range (len(data['statuses'])):
test = data['statuses'][i]['text']
print test
By default, it returns only 20 tweets. Use Count Parameter in your query. Here's statuses/home_timeline doc page.
So, below is the code to get 100 tweets. Also, it must be less than or equal to 200.
t.statuses.home_timeline(count=100)
Updated at 4.48 after getting output
I tried and got huge tweets in 50 & 100. Here's the code:
Save the below code as test.py. Create a new directory - Paste test.py & this latest Twitter 1.14.1 library in it - Click Terminal & go the path where you created your new directoy using cd path command - now run python test.py.
from twitter import *
t = Twitter(
auth=OAuth('OAUTH_TOKEN','OAUTH_SECRET',
'CONSUMER_KEY', 'CONSUMER_SECRET')
)
query = int(raw_input("Type how many tweets do you need:\n"))
x = t.statuses.home_timeline(count=query)
for i in range(query):
print x[i]['text']
There is a limit to the number of tweets an application can fetch in a single request. You need to iterate through the results to get more than what you are returned in a single request. Take a look at this article on the twitter developer site that explains how to work with iterating through the results.
Note that the number of results also depends on the query you are searching for.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

Categories