I built a simple RSS reader on Python and it is not working.
In addition, I want to get the featured image source link of every post and I didn't find a way to do so.
it shows me the Error: Traceback (most recent call last): File
"RSS_reader.py", line 7, in
feed_title = feed['feed']['title']
If there are some other RSS feeds that work fine. So I don't understand why there are some RSS feeds that are working and others that aren't
So I would like to understand why the code doesn't work and also how to get the featured image source link of a post
I attached the code, is written on Python 3.7
import feedparser
import webbrowser
feed = feedparser.parse("https://finance.yahoo.com/rss/")
feed_title = feed['feed']['title']
feed_entries = feed.entries
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
article_published_at = entry.published # Unicode string
article_published_at_parsed = entry.published_parsed # Time object
article_author = entry.author
content = entry.summary
article_tags = entry.tags
print ("{}[{}]".format(article_title, article_link))
print ("Published at {}".format(article_published_at))
print ("Published by {}".format(article_author))
print("Content {}".format(content))
print("catagory{}".format(article_tags))
A few things.
1) First feed['feed']['title'] does not exist.
2) At least for this site entry.author, entry.tags do not exist
3) It seems feedparser is not compatible with python3.7 (it gives me KeyError, "object doesn't have key 'category')
So as a starting point try to run the following code in python 3.6 and go from there.
import feedparser
import webbrowser
feed = feedparser.parse("https://finance.yahoo.com/rss/")
# feed_title = feed['feed']['title'] # NOT VALID
feed_entries = feed.entries
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
article_published_at = entry.published # Unicode string
article_published_at_parsed = entry.published_parsed # Time object
# article_author = entry.author DOES NOT EXIST
content = entry.summary
# article_tags = entry.tags DOES NOT EXIST
print ("{}[{}]".format(article_title, article_link))
print ("Published at {}".format(article_published_at))
# print ("Published by {}".format(article_author))
print("Content {}".format(content))
# print("catagory{}".format(article_tags))
Good luck.
You can also use xml parser libraries like beatifulsoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and create custom parsers. A sample customer parser code can be found here (https://github.com/vintageplayer/RSS-Parser). A walk through the same can read here (https://towardsdatascience.com/rss-feed-parser-in-python-553b1857055c)
Though libraries can be useful, beautifulsoup is an extremely handy library to try out.
I have used BeautifulSoup for a beginner RSS feed reader project (You need to install lxml for it to work since we are dealing with xml):
from bs4 import BeautifulSoup
import requests
url = requests.get('https://realpython.com/atom.xml')
soup = BeautifulSoup(url.content, 'xml')
entries = soup.find_all('entry')
for i in entries:
title = i.title.text
link = i.link['href']
summary = i.summary.text
print(f'Title: {title}\n\nSummary: {summary}\n\nLink: {link}\n\n------------------------\n')
You can find the Youtube video here:
https://www.youtube.com/watch?v=8HbqO-TfjlI
Related
I'm working on a system that scrapes news articles from RSS files and passes them to a sentiment analysis API.
It is my first time working on a project of that scale. I'm at a stage where I can get raw text out of links that are in an RSS file. I now need to put in place a system that can automatically fetch RSS files when they are updated.
Any high-level ideas of how this could be achieved?
Thanks
feedparser does a good job of sourcing RSS feeds. It also has features not used in this example for efficiently getting new items ETags
Google gave me the site https://blog.feedspot.com/world_news_rss_feeds/ for a source of multiple RSS news feeds. I just scraped this to get a dictionary. Then it's a simple case of looping over RSS sources.
import feedparser
from bs4 import BeautifulSoup
import urllib.parse, xml.sax
import pandas as pd
# get some RSS feeds....
resp = requests.get("https://blog.feedspot.com/world_news_rss_feeds/")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
rawfeeds = soup.find_all("h2")
feeds = {}
for rf in rawfeeds:
a = rf.find("a")
if a is not None:
feeds[a.string.replace("RSS Feed", "").strip()] = urllib.parse.parse_qs(a['href'])["q"][0].replace("site:","")
# now source them all into a dataframe
df = pd.DataFrame()
for k, url in feeds.items():
try:
df = pd.concat([df, pd.json_normalize(feedparser.parse(url)["entries"]).assign(Source=k)])
except (Exception, xml.sax.SAXParseException):
print(f"invalid xml: {url}")
re-entrant
use etag and modified capabilities of feedparser
persist dataframes so when run again it takes off from where it left off
I would use threading so that it is not purely sequential. Obviously with threading you need to think about synchronising your save points. Then you can just run in a scheduler to periodically source new items in RSS feeds and get associated article.
import feedparser, requests, newspaper
from bs4 import BeautifulSoup
import urllib.parse, xml.sax
from pathlib import Path
import pandas as pd
if not Path().cwd().joinpath("news").is_dir(): Path.cwd().joinpath("news").mkdir()
p = Path().cwd().joinpath("news")
# get some RSS feeds....
if p.joinpath("rss.pickle").is_file():
dfrss = pd.read_pickle(p.joinpath("rss.pickle"))
else:
resp = requests.get("https://blog.feedspot.com/world_news_rss_feeds/")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
rawfeeds = soup.find_all("h2")
feeds = []
for rf in rawfeeds:
a = rf.find("a")
if a is not None:
feeds.append({"name":a.string.replace("RSS Feed", "").strip(),
"url":urllib.parse.parse_qs(a['href'])["q"][0].replace("site:",""),
"etag":"","status":0, "dubug_msg":"", "modified":""})
dfrss = pd.DataFrame(feeds).set_index("url")
if p.joinpath("rssdata.pickle").is_file():
df = pd.read_pickle(p.joinpath("rssdata.pickle"))
else:
df = pd.DataFrame({"id":[],"link":[]})
# now source them all into a dataframe. head() is there for testing purposes
for r in dfrss.head(5).itertuples():
# print(r.Index)
try:
fp = feedparser.parse(r.Index, etag=r.etag, modified=r.modified)
if fp.bozo==1: raise Exception(fp.bozo_exception)
except Exception as e:
fp = feedparser.FeedParserDict(**{"etag":r.etag, "entries":[], "status":500, "debug_message":str(e)})
# keep meta information of what has already been sourced from a RSS feed
if "etag" in fp.keys(): dfrss.loc[r.Index,"etag"] = fp.etag
dfrss.loc[r.Index,"status"] = fp.status
if "debug_message" in fp.keys(): dfrss.loc[r.Index,"debug_mgs"] = fp.debug_message
# 304 means upto date... getting 301 and entries hence test len...
if len(fp["entries"])>0:
dft = pd.json_normalize(fp["entries"]).assign(Source=r.Index)
# don't capture items that have already been captured...
df = pd.concat([df, dft[~dft["link"].isin(df["link"])]])
# save to make re-entrant...
dfrss.to_pickle(p.joinpath("rss.pickle"))
df.to_pickle(p.joinpath("rssdata.pickle"))
# finally get the text...
if p.joinpath("text.pickle").is_file():
dftext = pd.read_pickle(p.joinpath("text.pickle"))
else:
dftext = pd.DataFrame({"link":[], "text":[]})
# head() is there for testing purposes
for r in df[~df["link"].isin(dftext["link"])].head(5).itertuples():
a = newspaper.Article(r.link)
a.download()
a.parse()
dftext = dftext.append({"link":r.link, "text":a.text},ignore_index=True)
dftext.to_pickle(p.joinpath("text.pickle"))
An analysis of data that has been retrieved.
I have developed a webscraper with beautiful soup that scrapes news from a website and then sends them to a telegram bot. Every time the program runs it picks up all the news currently on the news web page, and I want it to just pick the new entries on the news and send only those.
How can I do this? Should I use a sorting algorithm of some sort?
Here is the code:
#Lib requests
import requests
import bs4
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
body = soup.body
for paragrafo in body.find_all('p', class_='article-thumb-text'):
print(paragrafo.text)
conteudo = paragrafo.text
id = requests.get('https://api.telegram.org/bot<TOKEN>/getUpdates')
chat_id = id.json()['result'][0]['message']['from']['id']
print(chat_id)
msg = requests.post('https://api.telegram.org/bot<TOKEN>/sendMessage', data = {'chat_id': chat_id ,'text' : conteudo})
You need to keep track of articles that you have seen before, either by using a full database solution or by simply saving the information in a file. The file needs to be read before starting. The website is then scraped and compared against the existing list. Any articles not in the list are added to the list. At the end, the updated list is saved back to the file.
Rather that storing the whole text in the file, a hash of the text can be saved instead. i.e. convert the text into a unique number, in this case a hex digest is used to make it easier to save to a text file. As each hash will be unique, they can be stored in a Python set to speed up the checking:
import hashlib
import requests
import bs4
import os
# Read in hashes of past articles
db = 'past.txt'
if os.path.exists(db):
with open(db) as f_past:
past_articles = set(f_past.read().splitlines())
else:
past_articles = set()
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
for paragrafo in soup.body.find_all('p', class_='article-thumb-text'):
m = hashlib.md5(paragrafo.text.encode('utf-8'))
if m.hexdigest() not in past_articles:
print('New {} - {}'.format(m.hexdigest(), paragrafo.text))
past_articles.add(m.hexdigest())
# ...Update telegram here...
# Write updated hashes back to the file
with open(db, 'w') as f_past:
f_past.write('\n'.join(past_articles))
The first time this is run, all articles will be displayed. The next time, no articles will be displayed until the website is updated.
Okay guys, I'm new to parsing XML and Python, and am trying to get this to work. If someone could help me with this it would be greatly appreciated. If you can help me (educate me) on how to figure it out for myself, that would be even better!
I am having trouble trying to figure out the range to reference for an XML document as I can't find any documentation on it. Here is my code and I'll include the entire Traceback after.
#import library to do http requests:
import urllib.request
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('Data.Results.Power.ID')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<id>','').replace('</id>','')
#print out the xml tag and data in this format: <tag>data</tag>
print(xmlTag)
#just print the data
print(xmlData)
Traceback
/usr/bin/python3.4 /home/mint/PycharmProjects/DnD_Project/Power_Name.py
Traceback (most recent call last):
File "/home/mint/PycharmProjects/DnD_Project/Power_Name.py", line 14, in <module>
xmlTag = dom.getElementsByTagName('id')[0].toxml()
IndexError: list index out of range
Process finished with exit code 1
print len( dom.getElementsByTagName('id') )
EDIT:
ids = dom.getElementsByTagName('id')
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
# rest of code
EDIT: I add example because I saw in other comment tha you don't know how to use it
BTW: I add some comment in code about file/connection
import urllib.request
from xml.dom.minidom import parseString
# create connection to data/file on server
connection = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
# read from server as string (not "convert" to string):
data = connection.read()
#close connection because we dont need it anymore:
connection.close()
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# check if there are any data
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
xmlData=xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
else:
print("Sorry, there was no data")
or you can use for loop if there is more tags
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
xmlData = xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
BTW:
getElementsByTagName() expects tagname ID - not path Data.Results.Power.ID
tagname is ID so you have to replace <ID> not <id>
for this tag you can event use one_tag.firstChild.nodeValue in place of xmlTag.replace
.
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('ID') # tagname
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
#xmlData = xmlTag.replace('<ID>','').replace('</ID>','')
xmlData = one_tag.firstChild.nodeValue
print(xmlTag)
print(xmlData)
I haven't used the built in xml library in a while, but it's covered in Mark Pilgrim's great Dive into Python book.
-- I see as I'm typing this that your question has already been answered but since you mention being new to Python I think you will find the text useful for xml parsing and as an excellent introduction to the language.
If you would like to try another approach to parsing xml and html, I highly recommend lxml.
I am working on a python script to parse RSS links.
I use the Universal Feed Parser and I am encountering issues with some links, for example while trying to parse the FreeBSD Security Advisories
Here is the sample code:
feed = feedparser.parse(url)
items = feed["items"]
Basically the feed["items"] should return all the entries in the feed, the fields that start with item, but it always returns empty.
I can also confirm that the following links are parsed as expected:
Ubuntu
Redhat
Is this a issue with the feeds, in that the ones from FreeBSD do nor respect the standard ?
EDIT:
I am using python 2.7.
I ended up using feedparser, in combination with BeautifulSoup, like Hai Vu proposed.
Here is the sample code I ended up with, slightly changed:
def rss_get_items_feedparser(self, webData):
feed = feedparser.parse(webData)
items = feed["items"]
return items
def rss_get_items_beautifulSoup(self, webData):
soup = BeautifulSoup(webData)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
if subitem_node.name is not None:
item[str(subitem_node.name)] = str(subitem_node.contents[0])
yield item
def rss_get_items(self, webData):
items = self.rss_get_items_feedparser(webData)
if (len(items) > 0):
return items;
return self.rss_get_items_beautifulSoup(webData)
def parse(self, url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
webData = response .read()
for item in self.rss_get_items(webData):
#parse items
I also tried passing the response directly to rss_get_items, without reading it, but it throws and exception, when BeautifulSoup tries to read:
File "bs4/__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
I found out the problem was with the use of namespace.
for FreeBSD's RSS feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/1999/xhtml"
version="2.0">
For Ubuntu's feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
version="2.0">
When I remove the extra namespace declaration from FreeBSD's feed, everything works as expected.
So what does it means for you? I can think of a couple of different approaches:
Use something else, such as BeautifulSoup. I tried it and it seems to work.
Download the whole RSS feed, apply some search/replace to fix up the namespaces, then use feedparser.parse() afterward. This approach is a big hack; I would not use it myself.
Update
Here is a sample code for rss_get_items() which will returns you a list of items from an RSS feed. Each item is a dictionary with some standard keys such as title, pubdate, link, and guid.
from bs4 import BeautifulSoup
import urllib2
def rss_get_items(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
key = subitem_node.name
value = subitem_node.text
item[key] = value
yield item
if __name__ == '__main__':
url = 'http://www.freebsd.org/security/rss.xml'
for item in rss_get_items(url):
print item['title']
print item['pubdate']
print item['link']
print item['guid']
print '---'
Output:
FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...
Notes:
I omit error checking for sake of brevity.
I recommend only using the BeautifulSoup API when feedparser fails. The reason is feedparser is the right tool the the job. Hopefully, they will update it to be more forgiving in the future.
I am trying to parse data from a page using python which can be pretty straightforward but all the data is hidden under jquery elements and such which makes it harder to grab the data. Please forgive me as i am a newbie to Python and programming as a whole so still getting familiar with it.The website i am getting it from is http://www.asusparts.eu/partfinder/Asus/All In One/E Series so i just need all the data from the E This is the code i have so far:
import string, urllib2, csv, urlparse, sys
from bs4 import BeautifulSoup
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
redirects = []
model_info = []
select = soup.find(id='myselectListModel')
print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
sys.exit()
However the only problem is, it just prints out the data for the first model which is
Asus->All In One->E Series->ET10B->AC Adapter. The actual HTML page prints out like the following... (output was too long - just pasted the main output needed)
I am unsure on how i would grab the data for all the E Series parts as i assumed this would grab everything? Also i would appreciate if any answers you show relate to the current method i am using as this is the way the person in charge would like it done, Thanks.
[EDIT]
This is how i am trying to parse the HTML:
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s
data = soup.find(id='accordion')
selection = data.findAll('td')
for s in selections:
if(selection.has_attr('class', 'ProduktLista')):
redirects.append(td['class', 'ProduktLista'])
This is the error i come up with:
Traceback (most recent call last):
File "C:\asus.py", line 31, in <module>
selection = data.findAll('td')
AttributeError: 'NoneType' object has no attribute 'findAll'
You need to remove the sys.exit() call you have in your loop:
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
# sys.exit() # remove this line, no need to exit your program
You also may want to use urllib.quote to properly quote the URLs you get from the option dropdown; this removes the need to manually replace spaces with '%20'. Use urlparse.urljoin() to construct the final URL:
from urllib import quote
from urlparse import
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s