Beautiful Soup Page Source Error - python

I am trying to fetch the html source from this usl:
http://books.google.com/books?id=NZlV0M5Ije4C&dq=isbn:0470284889
I used the following code:
#!/usr/bin/env python
import urllib, urllib2, urlparse, argparse, re
from bs4 import BeautifulSoup
def getPageSoup(address):
request = urllib2.Request(address, None, {'User-Agent':'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
urlfile = urllib2.urlopen(request)
page = urlfile.read()
urlfile.close()
print 'soup has been obtained!'
return BeautifulSoup(page)
soup2 = getPageSoup(address)
metadata = soup2.findAll("metadata_row")#this content is present when viewing from the web browser
However, the html source from soup2 looks hardly like the source from the Google Books page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head><title>Quantitative Trading: How to Build Your Own Algorithmic Trading Business - Ernie Chan - Google Books</title><script>(function(){function a(c){this.t={};this.tick=function(c,e,b){b=void 0!=b?b:(new Date).getTime();this.t[c]=[b,e]};this.tick("start",null,c)}var d=new a;window.jstiming={Timer:a,load:d};try{var f=null;window.chrome&&window.chrome.csi&&(f=Math.floor(window.chrome.csi().pageT));null==f&&window.gtbExternal&&(f=window.gtbExternal.pageT());null==f&&window.external&&(f=window.external.pageT);f&&(window.jstiming.pt=f)}catch(g){};})();
</script><link href="/books/css/_9937a87cb2905e754d8d5e36995f224d/kl_about_this_book_kennedy_full_bundle.css" rel="stylesheet" type="text/css"/></head></html>
HTML source from urllib2 and my web browser are very different. How can I get the correct page source?
Thanks!

It is correct page source. All visible content of page is generated by JavaScript. So, it's impossible to fetch actual content using urllib. You should use browser extension, webkit bindings or something like that.

Related

Is there a way to extract CSS from a webpage using BeautifulSoup?

I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.

Python // Requests // ASP.net // No permission to access

im still learing at this. But first time I see, when I used requests module in Python, website give me feedback that I have no permission to access.
My code should only get data from site, and that's all.
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Answer I get from this:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7" on this server.<p>
Reference #18.36d61202.1596089808.1cc0ef55
</p></body>
</html>
Website is using ASP.net. Site from chromebrowser is visible, but from requests is not.
Can you maybe give me show a way? It's problem with authentication? Maybe .ASPXAUTH or ASP.NET_SessionId I had to use?
Thanks in advance for your time, and any anwsers.
Use custom User-Agent HTTP header to obtain correct response:
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
sr.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'})
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Prints:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=10" http-equiv="X-UA-Compatible"/>
... and so on.
You can use it . If you don't have the lib, You can install first. pip install requests-html
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
sr = HTMLSession()
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
With login is all good :) can download all data, but its a problem when I have something like below.
price_catalog = soup.find_all("td",class_="priceDetailsListPrice")
After soup need to find some values, writing as find_all "td"
I get output:
[<td class="priceDetailsListPrice">244,86 EUR
</td>]
its some other way than write "for" function like:
for price_catalog in price_catalog:
output = price_catalog.text
I think its too much to use "for" for single value :(

Python Beautiful Soup Returns Nonetype

I am trying to develop a program that can grab runes for a specific champion in League of Legends.
And here is my code:
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
response = requests.get(url).text
soup = BeautifulSoup(response,'lxml')
tables = soup.find('div',class_ = 'img-align-block')
print(tables)
And here is the original HTML File:
<img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" alt="征服者" tooltip="<itemname><img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" width="24" height="24" alt="征服者" /> 征服者</itemname><br/><br/>基礎攻擊或技能在命中敵方英雄時獲得 2 層征服者效果,持續 6 秒,每層效果提供 2-5 適性之力。 最多可以疊加 10 次。遠程英雄每次普攻只會提供 1 層效果。<br><br>在疊滿層數後,你對英雄造成的 15% 傷害會轉化為對自身的回復效果(遠程英雄則為 8%)。" height="36" width="36" class="requireTooltip">
I am not able to by any chance access this part and parse it nor find the IMG src. However, I can browse through this on their website.
How could I fix this issue?
The part you are interested in is not in the HTML. You can double check by searching:
soup.prettify()
Probably parts of the website are loaded with JavaScript, so you could use code that opens a browser and visit that page. For example, you could use selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(6) # give the website some time to load
page = driver.page_source
soup = BeautifulSoup(page,'lxml')
tables = soup.find('div', class_='img-align-block')
print(tables)
The website uses JavaScript processing, so you need to use Selenium or another scraping tool that supports JS loading.
Try setting a User-Agent on the headers of your request, without it, the website sends a different content, i.e.:
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
h = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
response = requests.get(url, headers=h).text
soup = BeautifulSoup(response,'html.parser')
images = soup.find_all('img', {"class" : 'mainPicture'})
for img in images:
print(img['src'])
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
Notes:
Demo
If my answer helped you, please consider accepting it as the correct answer, thanks!

python Beautiful Soup - cannot find the feed URL

I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that
How can I make the code work for all sites
You can't. Not every site follows the best practices.
It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.
But I found by googling that the feed of CNN is here.
That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.
Actually http://cnn.com/ redirects to https://edition.cnn.com/
Requests handles redirection for you automatically:
>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]
If I need to use any module other than BeautifulSoup, I am ready to do that
This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.
For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.
Looking at this answer. This should work perfectly:
feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')
Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.

Extracting links from HTML table using BeautifulSoup with unclean source code

I am trying to scrape articles from a Chinese newspaper database. Here is some of the source code (pasting excerpt b/c keyed site):
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%# page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>
</html>
When I try to do some straightforward scraping of the links in the table like so:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(
Python does not even find anything in the html, aka returns an empty list. I think this is because the source code starts with the base href tag, and Python only recognizes two tags in the document: base href and html.
Any idea how to scrape the links in this case? Thank you so much!!
Removing the second line made BS find all the tags. I didn't find a better way to parse this.
page = br.open(url)
page = page.read().replace('<! -- <%# page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)
BS isnt really developed any longer - and would suggest you have a look at lxml
Dont have access to that specific url, but I was able to get this to work, using the html fragment (to which I added an a tag)
>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example
When your html is very messed up, it's better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.

Categories