python Beautiful Soup - cannot find the feed URL - python

I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that

How can I make the code work for all sites
You can't. Not every site follows the best practices.
It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.
But I found by googling that the feed of CNN is here.
That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.
Actually http://cnn.com/ redirects to https://edition.cnn.com/
Requests handles redirection for you automatically:
>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]
If I need to use any module other than BeautifulSoup, I am ready to do that
This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.
For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.

Looking at this answer. This should work perfectly:
feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')
Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.

Related

Using Beautiful soup to get the stock prices

I try to make a simple price tracker for bitcoin and other cryptocurrencies or stocks. I intend to use web scraping to get prices from google finance relying on BeautifulSoup and requests libraries.
The code is this:
from bs4 import BeautifulSoup
import requests
import time
def getprice():
url = 'https://www.google.com/search?q=bitcoin+price'
HTML = requests.get(url)
soup = BeautifulSoup(HTML.text, 'html.parser')
text = soup.find('div', attrs={'class':'BNeawe iBp4i AP7Wnd'}).find("div", attrs={'class':'BNeawe iBp4i AP7Wnd'}).text
return text
if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)
I get this error
File "c:\Users\gabri\Visual Studio\crypto\bitcoinprice.py", line 19, in <module>
bitcoin = getprice()
File "c:\Users\gabri\Visual Studio\crypto\bitcoinprice.py", line 15, in getprice
text = soup.find('div', attrs={'class':'BNeawe iBp4i AP7Wnd'}).find("div", attrs={'class':'BNeawe iBp4i AP7Wnd'}).text
AttributeError: 'NoneType' object has no attribute 'find'
How can I solve it?
The reason this doesn't work is because you will run into the following problems:
You will be hitting Google's bot detection, which means when you do requests.get you won't get back the Google results, instead you'll get a response from the bot detection asking you to tick a box to confirm you are human.
The class you are searching for doesn't exist.
You are using the default html.parser which is going to be useless as Google does not put the price data in the raw HTML code. Instead you want to use something more advanced like the lxml parser.
Based on what you are trying to do, you could try to trick Google's bot detection by making your request seem more legitimate, for example add in the user agent that a Chrome browser would normally send. Additionally, to get the price it seems like you want the pclqee class in a span element.
Try this instead:
First install the lxml parser:
pip3 install lxml
Then use the below snippet instead:
from bs4 import BeautifulSoup
import requests
import time
def getprice():
url = "https://www.google.com/search?q=bitcoin+price"
HTML = requests.get(
url,
headers={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
},
)
soup = BeautifulSoup(HTML.text, "lxml")
text = soup.find("span", attrs={"class": "pclqee"}).text
return text
if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)
Although the above modified snippet will work, I wouldn't advise using it. Google will still be able to detect your request as a bot occassionally and so this code would be unreliable.
If you want stock data I suggest you try to web scrape some API's directly or use API's that do that for you already, e.g. have a look at https://www.alphavantage.co/
The soup object is None, so you can't call .find() on it. It works on my machine, so trying printing soup and troubleshoot from there.

Is there a way to print this unshowed tag text? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I'm trying to webscrape a webpage inventories, but the problem is that they don't show up in the output of the my Python script
Here's the original tag that appears on the navigator, with the text i want to scrape:
<span class="currentInv">251</span>
" in stock"
and this is the tag after parsing it using beautifulsoup as a library and lxml as a parser, I even tries other parsers like html.parser and html5lib:
<span class="currentInv"></span>
Here's my full Python script:
import requests
from bs4 import BeautifulSoup as bs
url = f'https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1'
parser = 'lxml'
headers = {'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
response = requests.get(url, headers=headers)
data = response.text
soup = bs(data, parser)
print(soup.find('span', class_ = 'currentInv').text)
The output is empty
I tried many times over and over, but nothing seems to work well for me
Any help would be so much appreciated.
So if you go to view source of the page you'll see the server side render HTML that gets sent down to the page actually also contains no value in that span tag. (i.e. view-source:https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1).
The value 251 is likely getting added client-side after the DOM is loaded via JavaScript.
I'd go through this answer Web-scraping JavaScript page with Python for more ways to try and extract that JavaScript value.
Most likely the page you see in your browser contains dynamic content. This means that when you inspect the page, you see the final result after some JavaScript code ran and manipulated the DOM that is rendered in the browser. When you load the same page in Python code using Beautiful Soup, you get the raw HTML that comes from the request. The JavaScript code for the dynamic content isn't executed, so you will not see the same results.
One solution is to use Selenium instead of Beautiful Soup. Selenium will load a page in a browser and provides an API to interact with that page.

Is there a way to extract CSS from a webpage using BeautifulSoup?

I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.

Using requests with bs4 and or json

Here is the source of the of the page I am looking for. Page Source.
If page source is not working here is the link for the source only. "view-source:https://sports.bovada.lv/baseball/mlb"
Here is the Link: Link to page
I am not to familiar with using bs4 but here is the script below which works, but does not return anything I need.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://sports.bovada.lv/baseball/mlb/game-lines-market-group')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())
I can return the soup just fine. But what see from just inspecting the site and the returned soup are not the same.
Here is a sample of what I can see from inspect.
The goal is to remove the Team, pitcher, odds and total runs. Which I can clearly see in the inspect version. When I print soupthat information does not come with.
Then I dove a little further and on the bottom of the Page source i can see an iFrame and below that it looks like json dictionary with everything I am looking to extract but running a similar script to retrieve json data does not work like I had hoped:
import requests
req = requests.get('view-source:https://sports.bovada.lv//baseball/mlb/game-lines-market-group')
data = req.json()['itemList']
print(data)
I believe i should be using bs4 but I am confused on why the same html is not being returned.
The data in json is dynamic which means it puts it into the HTML.
To access it with BS you need to access the var contained in the source which contains the json data. then load it into json and you can access it from there.
This is from the link you gave from var swc_market_lists =
So in the source it will look like
<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"136","link":"/baseball/mlb/game-lines-market-group","baseLink":"/baseball/mlb/game-lines-market-........
now you can use the swc_market_lists in the pattern regular expression to only return that script.
Use soup.find to return just that section.
Because the .text will include the var part I have returned the data from the start of the json string. In this case from 24 which is the first {
This means you now have a string of JSON data which you can then load as json and manipulate as required.
Hopefully you can work with this to find what you want
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
json_data = json.loads(test1)
pprint(json_data['items'])

why can't I access full html of this page using urllib, beautifulsoup

I'm new in web scraping and for learning purpose I want to find all href link in https://retty.me/ website.
But I found that my code only find one link in that website. But I viwed page source it has many link which didn't print. I also print full page where only one link contains.
what did I do wrong?
please correct me.
here is my python code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
data=[]
html = urlopen('https://retty.me')
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
file=open('scraped_data.txt','w')
for item in data:
file.write("%s\n"%item)
file.close()
If you enter the message shown in the html you get into google translate it says "We apologize for your trouble".
They don't want people scraping their site so they filter requests based on the user agent. You just need to add a user agent to the request header that looks like a browser.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
data=[]
url = 'https://retty.me'
req = Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urlopen(req)
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
for item in data:
print(item)
In fact, this particular site only requires the presence of the user agent header and will accept any user agent even an empty string. The requests library as mentioned by Rishav provides a user agent by default, that's why it works without adding a custom header there.
I don't know why the website returns different HTML when used with urllib, but you can use the excellent requests library which is much easier to use than urllib anyway.
from bs4 import BeautifulSoup
import re
import requests
data = []
html = requests.get('https://retty.me').text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
print(data)
You can find the official documentation for requests here and for Beautiful Soup here.
import requests
from bs4 import BeautifulSoup
# your Response object called response
response = requests.get('https://retty.me')
# your html as string
html = response.text
#verify that you get the correct html code
print(html)
#make the html, a soup object
soup = BeautifulSoup(html, 'html.parser')
# initialization of your list
data = []
# append to your list all the URLs found within a page’s <a> tags
for link in soup.find_all('a'):
data.append(link.get('href'))
#print your list items
print(data)

Categories