I am trying to obtain the first title.text from this RSS feed: https://www.mmafighting.com/rss/current. The feed is up to date and operational. However, when I use the following code, it appears the script is not finding any tags. Running another code sample, it is also not finding any tags.
I tried the following. I was expecting this to return the text of the first that falls within the first tag.
import requests
from xml.etree import ElementTree
rss_url = 'https://www.mmafighting.com/rss/current'
response = requests.get(rss_url)
if response.status_code == 200:
rss_feed = response.text
# parse the RSS feed using xml.etree.ElementTree
root = ElementTree.fromstring(rss_feed)
entries = root.findall(".//entry")
if len(entries) > 0:
title = entries[0].find("title")
if title is not None:
print(title.text)
else:
print("No title found in the first entry")
else:
print("No entry found in the RSS feed")
else:
print("Failed to get RSS feed. Status code:", response.status_code)
the code returns "No entry found in the RSS feed"
maybe you need module feedparser
install = pip install feedparser
then you need to write code like this
import feedparser
rss_url = 'https://www.mmafighting.com/rss/current'
feed = feedparser.parse(rss_url)
if feed.status == 200:
for entry in feed.entries:
print(entry.title)
print(entry.link)
else:
print("Failed to get RSS feed. Status code:", feed.status)
output from above code :
The issue is likely caused by the xml namespace, which is specified at the top of the feed element.
In the script you have used the findall method with the argument ".//entry" to find all the entry elements. However, since the feed uses the xml namespace "http://www.w3.org/2005/Atom", this path does not match any elements in the feed.
One way to handle this is by specifying the namespace when calling the findall method. You can do this by adding the namespace as a key-value pair in a dictionary and passing it as the second argument to the findall method.
import requests
from xml.etree import ElementTree
rss_url = 'https://www.mmafighting.com/rss/current'
response = requests.get(rss_url)
if response.status_code == 200:
rss_feed = response.text
# parse the RSS feed using xml.etree.ElementTree
root = ElementTree.fromstring(rss_feed)
ns = {'atom': 'http://www.w3.org/2005/Atom'}
entries = root.findall('.//atom:entry', ns)
# print(f"entries: {entries}")
if len(entries) > 0:
# for entry in entries:
# title = entry.find("atom:title", ns)
title = entries[0].find("atom:title", ns)
if title is not None:
print(title.text)
else:
print("No title found in this entry")
else:
print("No entry found in the RSS feed")
else:
print("Failed to get RSS feed. Status code:", response.status_code)
You can easily achieve this by using beautiful soup,
here the updated code:
import requests
# from xml.etree import ElementTree
from bs4 import BeautifulSoup
rss_url = 'https://www.mmafighting.com/rss/current'
response = requests.get(rss_url)
if response.status_code == 200:
rss_feed = response.text
soup = BeautifulSoup(rss_feed, features="xml")
entries = soup.find_all('title')
if len(entries) > 0:
title = entries[0]
if title is not None:
print(title.text)
else:
print("No title found in the first entry")
else:
print("No entry found in the RSS feed")
else:
print("Failed to get RSS feed. Status code:", response.status_code)
variable entries is a list that contains all the titles, you can iterate on entries to get all the titles.
Output:
How to install Beautiful Soup?
Just run this command-
pip install bs4
Related
When I use XPath to crawl and parse the content of Tencent commonweal, all the returned lists are empty.
The following below is my code(The information of headers is hidden).And the target url is https://gongyi.qq.com/succor/project_list.htm#s_tid=75.I would appreciate it if someone could help me solve this problem.
import requests
import os
from lxml import etree
if __name__ =='__main__':
url = 'https://gongyi.qq.com/succor/project_list.htm#s_tid=75'
headers = {
'User-Agent': XXX }
response = requests.get(url=url,headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[#class="pro_main"]//li')
for li in li_list:
title = li.xpath('./div[2]/div/a/text()')[0]
print(title)
So what is actually happening here is that you can only access the first ul inside the pro_main div, because all those li items and their parent are populated by JavaScript, thus your list won't be there by the time you scrape the html with requests.get(), it will be empty!
The good news is that the JS script in questions populates the data using an API and just exactly how the website does it, you may as well retrieve those titles using the actual API and print them.
import requests, json
import os
if __name__ =='__main__':
url = 'https://ssl.gongyi.qq.com/cgi-bin/WXSearchCGI?ptype=stat&s_status=1&s_tid=75'
resp = requests.get(url).text
resp = resp[1:-1] #Result is wrapped in (), so we get rid of those
jj = json.loads(resp)
for i in jj["plist"]:
title = i["title"]
print(title)
You can explore the API by printing jj to see if there's more info that you may need later!
Let me know if it works for you!
can you please help me with my python code? I wanted to parse several homepages with beautiful soup provided in the list html with the function stars
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
bsObj = BeautifulSoup(html.read())
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
lst=[]
lst.append(cleantext)
stars(html)
Instead I am getting an error "AttributeError: 'list' object has no attribute 'read'"
As some of the comments mentioned you need to use the requests library to actually grab the content of each link in your list.
import requests
from bs4 import BeautifulSoup
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
for url in html:
resp = requests.get(url)
bsObj = BeautifulSoup(resp.content, 'html.parser')
print(bsObj) # Should print the entire html document.
# Do other stuff with bsObj here.
stars(html)
The IndexError from bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16] is something you'll need to figure out yourself.
You have a couple of errors here.
you are trying to load the whole list of pages into BeautifulSoup. You should process page by page.
You should get the source code of the page before processing it.
there is no "section" element on the page you are loading, so you will get an exception as you are trying to get the 8th element. So you might need to evaluate whether you found anything.
def stars(html):
request = requests.get(html)
if request.status_code != 200:
return
page_content = request.content
bsObj = BeautifulSoup(page_content)
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
for page in html:
stars(page)
This question was edited, see update below
I'm trying to get the title(HTML tag content) of certain url for when it responds 404. I want to get the content only when returns 404 because I'm testing if an instagram username is avaiable using urls with that username like this:
https://instagram.com/jsd23sdoleo
Which returns a 404 html page with the title Page Not Found.
Now there's urls like this:
https://www.instagram.com/leop
Content Unavaiable screenshot
Page not found screenshot
Where it returns the title Content Unavaiable in the browser . It turns out that when it returns "Content Unavaiable" the usename is not taken but also not avaiable. The only case that is not taken and avaiable is when it returns "Page not Found".
But my python code it is only bringing Page Not Found as the title even for urls that in the browser shows Content Unavaiable.
Is there a reason for not getting the right HTML content for those urls?
UPDATE: My python code brings only "Page Not Found" because when I test in my browser I'm logged in, so the behavior is different.
Is there a way of getting the HTML page after the loggin with the requests library?
Here's my code so you can test it:
import urllib.request
from bs4 import BeautifulSoup
import requests
chars_clean = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','x','z','0','1','2','3','4','5','6','7','8','9']
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','x','z','0','1','2','3','4','5','6','7','8','9','_','.']
base_url = 'https://www.instagram.com/'
leo = 'leo'
urls = []
valid_urls = []
user = []
for char_clean in chars_clean:
for char in chars:
urls.append(base_url + char_clean + char + leo)
for url in urls:
try:
#fetch html
source = urllib.request.urlopen(url)
#parse with BeautifulSoup
BS = BeautifulSoup(source, "html5lib")
#search for title element, print element.text
title = BS.title.string
except urllib.error.URLError as err:
if err.code == 404:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html5lib")
title = soup.title.string
print("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
print(url)
#x = url.split('m/')
print(title) # Prints
print('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
continue
else:
valid_urls.append(url)
Goodreads claims I can get XML that begins with a root called <GoodreadsResponse>, whose 1st child is <book>, the 8th child of which is image_url. Trouble is, I can't event get it to recognize the proper root (it prints root not GoodreadsResponse and fails to recognize that the root has any children at all, though the response code is 200. I prefer to work with JSON and, allegedly, you can convert it to JSON, but I had zero luck with that.
Here's the function I have at the moment. Where am I going wrong?
def main(url, payload):
"""Retrieves image from Goodreads API endpoint returning XML response"""
res = requests.get(url, payload)
status = res.status_code
print(status)
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(res.content, parser=parser)
root = etree.Element("root")
print(root.text)
if __name__ == '__main__':
main("https://www.goodreads.com/book/isbn/", '{"isbns": "0441172717", "key": "my_key"}')
The goodreads info is here:
**Get the reviews for a book given an ISBN**
Get an xml or json response that contains embed code for the iframe reviews widget that shows excerpts (first 300 characters) of the most popular reviews of a book for a given ISBN. The reviews are from all known editions of the book.
URL: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT (sample url)
HTTP method: GET
At the moment you are receiving HTML not XML with your request.
You need to set the format of the response you want: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT
And you need to use params not payload:
Constructing requests with URL Query String in Python
P.S. For the request you are doing you can use JSON.
https://www.goodreads.com/api/index#book.show_by_isbn
Here's the solution that worked best for me:
import requests
from bs4 import BeautifulSoup
def main():
key = 'myKey'
isbn = '0441172717'
url = 'https://www.goodreads.com/book/isbn/{}?key={}'.format(isbn, key)
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml-xml")
print(soup.find('image_url').text)
The issue was that XML contents were wrapped in tags. Using the Beautiful Soup 'lxml-xml' parser, rather than 'lxml' retained the content included in CDATA tags and allowed them to be parsed correctly.
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a')
print(link)
This keeps giving me:
[Element a at 0x1c64c963f48]
response instead the actual number I am seeking in the page? Any idea why?
Also, why can't I get a type(link) value to see the type?
Try below code to get "192,322" as output:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
try:
link = doc.xpath('//a[#href="/metrics"]/text()')[0]
print(link.split()[0])
except IndexError:
print("No link found")
Your XPath gives you <a> elements. You want their text. So... print their text.
link = doc.xpath("//label[#for='search-header']//a")
for a in link:
print( a.text )
Notes
/html/body/header/div[4]/div/div/h4/label/small/a is way too specific. It will break very easily when they make even the slightest change to their HTML layout. Don't use auto-generated XPath expressions. Write all your XPath expressions yourself.
XPath always returns a list of nodes, even if there is only one hit. Use a loop or pick a specific list item (like link[0]).
You can use the feature for extracting the href by changing your code to use the text(). See below:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a/text()')
print(link)
Example in Chrome Developer Tools:
> $x("/html/body/header/div[4]/div/div/h4/label/small/a/text()")[0]
> 192,322 DATASETS