How do I navigate results of a Biopython Entrez efetch? - python

When I run the following;
from Bio.Blast import NCBIWWW
from Bio import Entrez, SeqIO
Entrez.email = "A.N.Other#example.com"
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "xml")
record = Entrez.read(handle)
I get back a "Bio.Entrez.Parser.DictionaryElement" that is really difficult to search through. If I want to say get the the get the amino acid sequence I have to type something like this;
record["Bioseq-set_seq-set"][0]["Seq-entry_seq"]["Bioseq"]["Bioseq_inst"]["Seq-inst"]["Seq-inst_seq-data"]["Seq-data"]["Seq-data_iupacaa"]["IUPACaa"]
I know that there has to be an easier way to index the elements in these results. If anyone out there can lend me a hand with this I'd appreciate it very much.

If what you want is the sequence, then instead of querying it in "xml" format, query it in (for example) FASTA format, by changing the rettype argument. Then it's as simple as parsing it using SeqIO.
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "fasta")
for r in SeqIO.parse(handle, "fasta"):
print r.id, r.seq
This works because the contents of handle look like:
print handle.read()
# >gi|75192198|sp|Q9MAH8.1|TCP3_ARATH RecName: Full=Transcription factor TCP3
# MAPDNDHFLDSPSPPLLEMRHHQSATENGGGCGEIVEVQGGHIVRSTGRKDRHSKVCTAKGPRDRRVRLS
# APTAIQFYDVQDRLGFDRPSKAVDWLITKAKSAIDDLAQLPPWNPADTLRQHAAAAANAKPRKTKTLISP
# PPPQPEETEHHRIGEEEDNESSFLPASMDSDSIADTIKSFFPVASTQQSYHHQPPSRGNTQNQDLLRLSL
# QSFQNGPPFPNQTEPALFSGQSNNQLAFDSSTASWEQSHQSPEFGKIQRLVSWNNVGAAESAGSTGGFVF
# ASPSSLHPVYSQSQLLSQRGPLQSINTPMIRAWFDPHHHHHHHQQSMTTDDLHHHHPYHIPPGIHQSAIP
# GIAFASSGEFSGFRIPARFQGEQEEHGGDNKPSSASSDSRH
If you still want some of the other meta information (such as transcription factor binding sites within the gene, or the taxonomy of the organism), you can also download it in genbank format by giving the argument rettype="gb" and parsing with "gb". You can learn more about that in the example here.

Related

Fixed number of results biopython

I am trying to retrieve the search results using the following code for a query from pubmed via biopython
from Bio import Entrez
from Bio import Medline
Entrez.email = "A.N.iztb#bobxx.com"
LIM = 3
def search(Term):
handle = Entrez.esearch(db='pmc', term=Term, retmax=100000000)
record = Entrez.read(handle)
idlist = record["IdList"]
handle = Entrez.efetch(db='pmc', id=idlist, rettype="medline", retmode="text")
records = Medline.parse(handle)
return list(records)
mydic=search('(pathological conditions, signs and symptoms[MeSH Terms]) AND (pmc cc license[filter]) ')
print(len(mydic))
No matter how many times I try, I get 10000 in the output. Tried different queries but I still get 10000. When I manually check how many results via browser I get random numbers.
What exactly is going wrong and how to ensure that I get the maximum results?
You only seem to be changing the esearch limit, but leave efetch alone (and the NCBI seems to default to a limit of 10000). You need to use the retstart and retmax arguments.
See the "Searching for and downloading abstracts using the history" example in the Biopython Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html or http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

biopython's efetch only returns the first features from any database

Any idea why this code:
handle = Entrez.efetch(db="nuccore",
id=['556503834'], rettype="gb",
retmode="txt")
print(handle.read())
doesn't return the full features that are found on the ncbi description? Only the first feature is returned (I was aiming to get the CDS features).
I tried other databases to the same conclusion.
change rettype by "gbwithparts"
from Bio import Entrez
Entrez.email = "your#mail.com" #put real mail
handle = Entrez.efetch(db="nuccore", id=['556503834'],
rettype="gbwithparts", retmode="txt")
print(handle.read())
Note : It may take a few seconds

Using Biopython.Entrez to return pubmed records associated with a list of gene symbols

I want to use a list of gene symbols (named t below) in a search in a pubmed database in order to (ultimately) retrieve the DNA sequence of the associated gene. I want to restrict my search to humans only but my current code gives me organisms other than human.
from Bio import Entrez
Entrez.email = '...' #my email: always tell Entrez who you are
t = ['FOXO3']
for i in range(len(t)):
search = 'human[orgn]'+t[i]
handle = Entrez.esearch(db='gene',term=search)
record = Entrez.read(handle)
t = record[u'IdList']
handle = Entrez.efetch('nucleotide',id=t[0],rettype='gb',retmode='text')
print handle.read()
Can anybody see where I'm going wrong?
You're messing the databases. In the esearch you use db=gene, but in the efetch you change it to db=nucleotide. They are different things:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=7157
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=7157

removing blank lines python

I am pulling information from a web site (in this case ip/location etc) using python 3
import urllib.request
data = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for search in data:
if b'align="center">' in search:
print(next(data).decode().rstrip())
data.close()
How can I remove blank lines / put information into tuples / save as variables etc. I want to be able to start using the data gathered.
If you're doing html scaping / parsing etc, use a library like BeautifulSoup.
It sure beats manually handling scraping.
As mentioned by #jordanm, the best option is to use the GeoIP Python API for this.
But to answer your question - your code should probably look more like this:
import urllib.request, pprint
data = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
fields = []
for line in data:
if b'class=output' in line:
fields.append(next(data).decode('iso-8859-1').strip())
data.close()
Note that I have changed the test string, and blank lines have been included. This is to ensure that the fields can be easily identified by index.
To access the field values, you can do:
address = fields[0]
isp = fields[8]
domain = fields[-1]
If you want to remove specific fields:
del fields[3], fields[4], fields[6]

How do I turn an RSS feed back into RSS?

According to the feedparser documentation, I can turn an RSS feed into a parsed object like this:
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
but I can't find anything showing how to go the other way; I'd like to be able do manipulate 'd' and then output the result as XML:
print d.toXML()
but there doesn't seem to be anything in feedparser for going in that direction. Am I going to have to loop through d's various elements, or is there a quicker way?
Appended is a not hugely-elegant, but working solution - it uses feedparser to parse the feed, you can then modify the entries, and it passes the data to PyRSS2Gen. It preserves most of the feed info (the important bits anyway, there are somethings that will need extra conversion, the parsed_feed['feed']['image'] element for example).
I put this together as part of a little feed-processing framework I'm fiddling about with.. It may be of some use (it's pretty short - should be less than 100 lines of code in total when done..)
#!/usr/bin/env python
import datetime
# http://www.feedparser.org/
import feedparser
# http://www.dalkescientific.com/Python/PyRSS2Gen.html
import PyRSS2Gen
# Get the data
parsed_feed = feedparser.parse('http://reddit.com/.rss')
# Modify the parsed_feed data here
items = [
PyRSS2Gen.RSSItem(
title = x.title,
link = x.link,
description = x.summary,
guid = x.link,
pubDate = datetime.datetime(
x.modified_parsed[0],
x.modified_parsed[1],
x.modified_parsed[2],
x.modified_parsed[3],
x.modified_parsed[4],
x.modified_parsed[5])
)
for x in parsed_feed.entries
]
# make the RSS2 object
# Try to grab the title, link, language etc from the orig feed
rss = PyRSS2Gen.RSS2(
title = parsed_feed['feed'].get("title"),
link = parsed_feed['feed'].get("link"),
description = parsed_feed['feed'].get("description"),
language = parsed_feed['feed'].get("language"),
copyright = parsed_feed['feed'].get("copyright"),
managingEditor = parsed_feed['feed'].get("managingEditor"),
webMaster = parsed_feed['feed'].get("webMaster"),
pubDate = parsed_feed['feed'].get("pubDate"),
lastBuildDate = parsed_feed['feed'].get("lastBuildDate"),
categories = parsed_feed['feed'].get("categories"),
generator = parsed_feed['feed'].get("generator"),
docs = parsed_feed['feed'].get("docs"),
items = items
)
print rss.to_xml()
If you're looking to read in an XML feed, modify it and then output it again, there's a page on the main python wiki indicating that the RSS.py library might support what you're after (it reads most RSS and is able to output RSS 1.0). I've not looked at it in much detail though..
from xml.dom import minidom
doc= minidom.parse('./your/file.xml')
print doc.toxml()
The only problem is that it do not download feeds from the internet.
As a method of making a feed, how about PyRSS2Gen? :)
I've not played with FeedParser, but have you tried just doing str(yourFeedParserObject)? I've often been suprised by various modules that have str methods to just output the object as text.
[Edit] Just tried the str() method and it doesn't work on this one. Worth a shot though ;-)

Categories