According to the feedparser documentation, I can turn an RSS feed into a parsed object like this:
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
but I can't find anything showing how to go the other way; I'd like to be able do manipulate 'd' and then output the result as XML:
print d.toXML()
but there doesn't seem to be anything in feedparser for going in that direction. Am I going to have to loop through d's various elements, or is there a quicker way?
Appended is a not hugely-elegant, but working solution - it uses feedparser to parse the feed, you can then modify the entries, and it passes the data to PyRSS2Gen. It preserves most of the feed info (the important bits anyway, there are somethings that will need extra conversion, the parsed_feed['feed']['image'] element for example).
I put this together as part of a little feed-processing framework I'm fiddling about with.. It may be of some use (it's pretty short - should be less than 100 lines of code in total when done..)
#!/usr/bin/env python
import datetime
# http://www.feedparser.org/
import feedparser
# http://www.dalkescientific.com/Python/PyRSS2Gen.html
import PyRSS2Gen
# Get the data
parsed_feed = feedparser.parse('http://reddit.com/.rss')
# Modify the parsed_feed data here
items = [
PyRSS2Gen.RSSItem(
title = x.title,
link = x.link,
description = x.summary,
guid = x.link,
pubDate = datetime.datetime(
x.modified_parsed[0],
x.modified_parsed[1],
x.modified_parsed[2],
x.modified_parsed[3],
x.modified_parsed[4],
x.modified_parsed[5])
)
for x in parsed_feed.entries
]
# make the RSS2 object
# Try to grab the title, link, language etc from the orig feed
rss = PyRSS2Gen.RSS2(
title = parsed_feed['feed'].get("title"),
link = parsed_feed['feed'].get("link"),
description = parsed_feed['feed'].get("description"),
language = parsed_feed['feed'].get("language"),
copyright = parsed_feed['feed'].get("copyright"),
managingEditor = parsed_feed['feed'].get("managingEditor"),
webMaster = parsed_feed['feed'].get("webMaster"),
pubDate = parsed_feed['feed'].get("pubDate"),
lastBuildDate = parsed_feed['feed'].get("lastBuildDate"),
categories = parsed_feed['feed'].get("categories"),
generator = parsed_feed['feed'].get("generator"),
docs = parsed_feed['feed'].get("docs"),
items = items
)
print rss.to_xml()
If you're looking to read in an XML feed, modify it and then output it again, there's a page on the main python wiki indicating that the RSS.py library might support what you're after (it reads most RSS and is able to output RSS 1.0). I've not looked at it in much detail though..
from xml.dom import minidom
doc= minidom.parse('./your/file.xml')
print doc.toxml()
The only problem is that it do not download feeds from the internet.
As a method of making a feed, how about PyRSS2Gen? :)
I've not played with FeedParser, but have you tried just doing str(yourFeedParserObject)? I've often been suprised by various modules that have str methods to just output the object as text.
[Edit] Just tried the str() method and it doesn't work on this one. Worth a shot though ;-)
Related
I'm new to Python and having some trouble with an API scraping I'm attempting. What I want to do is pull a list of book titles using this code:
r = requests.get('https://api.dp.la/v2/items?q=magic+AND+wizard&api_key=09a0efa145eaa3c80f6acf7c3b14b588')
data = json.loads(r.text)
for doc in data["docs"]:
for title in doc["sourceResource"]["title"]:
print (title)
Which works to pull the titles, but most (not all) titles are outputting as one character per line. I've tried adding .splitlines() but this doesn't fix the problem. Any advice would be appreciated!
The problem is that you have two types of title in the response, some are plain strings "Germain the wizard" and some others are arrays of string ['Joe Strong, the boy wizard : or, The mysteries of magic exposed /']. It seems like in this particular case, all lists have length one, but I guess that will not always be the case. To illustrate what you might need to do I added a join here instead of just taking title[0].
import requests
import json
r = requests.get('https://api.dp.la/v2/items?q=magic+AND+wizard&api_key=09a0efa145eaa3c80f6acf7c3b14b588')
data = json.loads(r.text)
for doc in data["docs"]:
title = doc["sourceResource"]["title"]
if isinstance(title, list):
print(" ".join(title))
else:
print(title)
In my opinion that should never happen, an API should return predictable types, otherwise it looks messy on the users' side.
I'm trying to get url from object data, but it isn't right. This program has stopped on line 4. Code is under.
My code:
import requests
gifs = str(requests.get("https://api.giphy.com/v1/gifs/random?
api_key=APIKEY"))
dump = json.dumps(gifs)
json.loads(dump['data']['url'])
Your description is not clear enough. You expect to read a json and select a field that brings you something?
I recommend you check this section of requests quickstart guide this i suspect you want to read the data to json and extract from some fields.
Maybe something like this might help:
r = requests.get('http://whatever.com')
url = r.json()['url']
When I run the following;
from Bio.Blast import NCBIWWW
from Bio import Entrez, SeqIO
Entrez.email = "A.N.Other#example.com"
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "xml")
record = Entrez.read(handle)
I get back a "Bio.Entrez.Parser.DictionaryElement" that is really difficult to search through. If I want to say get the the get the amino acid sequence I have to type something like this;
record["Bioseq-set_seq-set"][0]["Seq-entry_seq"]["Bioseq"]["Bioseq_inst"]["Seq-inst"]["Seq-inst_seq-data"]["Seq-data"]["Seq-data_iupacaa"]["IUPACaa"]
I know that there has to be an easier way to index the elements in these results. If anyone out there can lend me a hand with this I'd appreciate it very much.
If what you want is the sequence, then instead of querying it in "xml" format, query it in (for example) FASTA format, by changing the rettype argument. Then it's as simple as parsing it using SeqIO.
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "fasta")
for r in SeqIO.parse(handle, "fasta"):
print r.id, r.seq
This works because the contents of handle look like:
print handle.read()
# >gi|75192198|sp|Q9MAH8.1|TCP3_ARATH RecName: Full=Transcription factor TCP3
# MAPDNDHFLDSPSPPLLEMRHHQSATENGGGCGEIVEVQGGHIVRSTGRKDRHSKVCTAKGPRDRRVRLS
# APTAIQFYDVQDRLGFDRPSKAVDWLITKAKSAIDDLAQLPPWNPADTLRQHAAAAANAKPRKTKTLISP
# PPPQPEETEHHRIGEEEDNESSFLPASMDSDSIADTIKSFFPVASTQQSYHHQPPSRGNTQNQDLLRLSL
# QSFQNGPPFPNQTEPALFSGQSNNQLAFDSSTASWEQSHQSPEFGKIQRLVSWNNVGAAESAGSTGGFVF
# ASPSSLHPVYSQSQLLSQRGPLQSINTPMIRAWFDPHHHHHHHQQSMTTDDLHHHHPYHIPPGIHQSAIP
# GIAFASSGEFSGFRIPARFQGEQEEHGGDNKPSSASSDSRH
If you still want some of the other meta information (such as transcription factor binding sites within the gene, or the taxonomy of the organism), you can also download it in genbank format by giving the argument rettype="gb" and parsing with "gb". You can learn more about that in the example here.
I am pulling information from a web site (in this case ip/location etc) using python 3
import urllib.request
data = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for search in data:
if b'align="center">' in search:
print(next(data).decode().rstrip())
data.close()
How can I remove blank lines / put information into tuples / save as variables etc. I want to be able to start using the data gathered.
If you're doing html scaping / parsing etc, use a library like BeautifulSoup.
It sure beats manually handling scraping.
As mentioned by #jordanm, the best option is to use the GeoIP Python API for this.
But to answer your question - your code should probably look more like this:
import urllib.request, pprint
data = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
fields = []
for line in data:
if b'class=output' in line:
fields.append(next(data).decode('iso-8859-1').strip())
data.close()
Note that I have changed the test string, and blank lines have been included. This is to ensure that the fields can be easily identified by index.
To access the field values, you can do:
address = fields[0]
isp = fields[8]
domain = fields[-1]
If you want to remove specific fields:
del fields[3], fields[4], fields[6]
I want to use the urllib module to send HTTP requests and grab data. I can get the data by using the urlopen() function, but not really sure how to incorporate it into classes. I really need help with the query class to move forward. From the query I need to pull
• Top Rated
• Top Favorites
• Most Viewed
• Most Recent
• Most Discussed
My issue is, I can't parse the XML document to retrieve this data. I also don't know how to use classes to do it.
Here is what I have so far:
import urllib #this allows the programm to sen HTTP requests and to read the responses.
class Query:
'''performs the actual HTTP requests and initial parsing to build the Video-
objects from the response. It will also calculate the following information
based on the video and user results. '''
def __init__(self, feed_id, max_results):
'''Takes as input the type of query (feed_id) and the maximum number of
results (max_results) that the query should obtain. The correct HTTP
request must be constructed and submitted. The results are converted
into Video objects, which are stored within this class.
'''
self.feed = feed_id
self.max = max_results
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
top_rated.close()
def __str__(self):
''' prints out the information on each video and Youtube user. '''
pass
class Video:
pass
class User:
pass
#main function: This handles all the user inputs and stuff.
def main():
useinput = raw_input('''Welcome to the YouTube text-based query application.
You can select a popular feed to perform a query on and view statistical
information about the related videos and users.
1) today
2) this week
3) this month
4) since youtube started
Please select a time(or 'Q' to quit):''')
secondinput = raw_input("\n1) Top Rated\n2) Top Favorited\n3) Most Viewed\n4) Most Recent\n5) Most Discussed\n\nPlease select a feed (or 'Q' to quit):")
thirdinput = raw_input("Enter the maximum number of results to obtain:")
main()
toplist = []
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
result_str = top_rated.read()
top_rated.close()
splittedlist = result_str.split('<entry')
results_str = top_rated.read()
x=splittedlist[1].find('title')#find the title index
splittedlist[1][x: x+75]#string around the title (/ marks the end of the title)
w=splittedlist[1][x: x+75].find(">")#gives you the start index
z=splittedlist[1][x: x+75].find("<")#gives you the end index
titles = splittedlist[1][x: x+75][w+1:z]#gives you the title!!!!
toplist.append(titles)
print toplist
I assume that your challenge is parsing XML.
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
And I see you are using string functions to parse XML. Such functions based on finite automata (regular languages) are NOT suited for parsing context-free languages such as XML. Expect it to break very easily.
For more reasons, please refer RegEx match open tags except XHTML self-contained tags
Solution: consider using an XML parser like elementree. It comes with Python and allows you to browse the XML tree pythonically. http://effbot.org/zone/element-index.htm
Your may come up with code like:
import elementtree.ElementTree as ET
..
results_str = top_rated.read()
root = ET.fromstring(results_str)
for node in root:
print node
I also don't know how to use classes to do it.
Don't be in a rush to create classes :-)
In the above example, you are importing a module, not importing a class and instantiating/initializing it, like you do for Java. Python has powerful primitive types (dictionaries, lists) and considers modules as objects: so (IMO) you can go easy on classes.
You use classes to organize stuff, not because your teacher has indoctrinated you "classes are good. Lets have lots of them".
Basically you want to use the Query class to communicate with the API.
def __init__(self, feed_id, max_results, time):
qs = "http://gdata.youtube.com/feeds/api/standardfeeds/"+feed_id+"?max- results="+str(max_results)+"&time=" + time
self.feed_id = feed_id
self.max_results = max_results
wo = urllib.urlopen(qs)
result_str = wo.read()
wo.close()