Extracting links from HTML table using BeautifulSoup with unclean source code - python

I am trying to scrape articles from a Chinese newspaper database. Here is some of the source code (pasting excerpt b/c keyed site):
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%# page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>
</html>
When I try to do some straightforward scraping of the links in the table like so:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(
Python does not even find anything in the html, aka returns an empty list. I think this is because the source code starts with the base href tag, and Python only recognizes two tags in the document: base href and html.
Any idea how to scrape the links in this case? Thank you so much!!

Removing the second line made BS find all the tags. I didn't find a better way to parse this.
page = br.open(url)
page = page.read().replace('<! -- <%# page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)

BS isnt really developed any longer - and would suggest you have a look at lxml
Dont have access to that specific url, but I was able to get this to work, using the html fragment (to which I added an a tag)
>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example

When your html is very messed up, it's better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.

Related

Is there a way to extract CSS from a webpage using BeautifulSoup?

I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.

Python // Requests // ASP.net // No permission to access

im still learing at this. But first time I see, when I used requests module in Python, website give me feedback that I have no permission to access.
My code should only get data from site, and that's all.
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Answer I get from this:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7" on this server.<p>
Reference #18.36d61202.1596089808.1cc0ef55
</p></body>
</html>
Website is using ASP.net. Site from chromebrowser is visible, but from requests is not.
Can you maybe give me show a way? It's problem with authentication? Maybe .ASPXAUTH or ASP.NET_SessionId I had to use?
Thanks in advance for your time, and any anwsers.
Use custom User-Agent HTTP header to obtain correct response:
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
sr.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'})
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Prints:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=10" http-equiv="X-UA-Compatible"/>
... and so on.
You can use it . If you don't have the lib, You can install first. pip install requests-html
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
sr = HTMLSession()
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
With login is all good :) can download all data, but its a problem when I have something like below.
price_catalog = soup.find_all("td",class_="priceDetailsListPrice")
After soup need to find some values, writing as find_all "td"
I get output:
[<td class="priceDetailsListPrice">244,86 EUR
</td>]
its some other way than write "for" function like:
for price_catalog in price_catalog:
output = price_catalog.text
I think its too much to use "for" for single value :(

python requests only returning empty sets when scraping .htm page

I am attempting to scrape a .htm link and cannot get my script to return anything besides '[]'.
link = https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload.htm
import requests
from bs4 import BeautifulSoup as bs
link = 'https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload.htm'
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(link, headers=headers)
soup = bs(r.text, 'lxml') #I've tried other html parsers in here as well as r.content
I believe the issue lies in my attempt to interact with the page (possibly incorrect encoding?). The above format is how I've always set up any web-scraping performed in the past and haven't had any issues that I couldn't address. What stands out the most is when I call r.content or r.text I get a response that seems foreign:
'<HTML>\r\n<HEAD>\r\n<TITLE>481-caseload</TITLE>\r\n<META NAME="GENERATOR" CONTENT="Microsoft FrontPage 5.0">\r\n<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">\r\n</HEAD>\r\n\r\n<FRAMESET ROWS="*,48" FRAMESPACING="0" FRAMEBORDER="no" BORDER="0">\r\n<FRAME NAME="ReportArea" SRC="481-caseload/by_county_tribe/by_county_tribe.htm"\r\n MARGINWIDTH="0" MARGINHEIGHT="0" SCROLLING="no" FRAMEBORDER="0" NORESIZE>\r\n<FRAMESET COLS="*" FRAMESPACING="0" FRAMEBORDER="0" BORDER="0">\r\n<FRAME NAME="ReportLinks" SRC="481-caseload/DocLinks.htm" FRAMEBORDER="0" MARGINWIDTH="2" MARGINHEIGHT="3" scrolling="auto">\r\n</FRAMESET></FRAMESET></HTML>'
This makes me think that my script isn't properly written to handle whatever this is above. I've never seen "Microsoft FrontPage 5.0" before and don't know if that might be what's throwing off my code. I've tried forcing an encoding by changing r.encoding = #encoding here. Any guidance would be helpful.
This is because the page consists of multiple nested iframes - basically, separate pages with its own URLs loaded by the browser when the main "container" page is loaded. Use browser developer tools to inspect the page and see in what iframe your desired content is located.
The main content of this page is coming from the this url:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://www.forwardhealth.wi.gov/WIPortal/StaticContent/Member/caseloads/481-caseload/by_county_tribe/0.htm"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "lxml")
In [6]: soup.select_one("td.s2").get_text()
Out[6]: 'Wisconsin Medicaid'

Programatically opening web page given unexpected results

I'm trying to get information from this site:
http://www.gocrimson.com/sports/mbkb/2011-12/roster
If you look at that page in a browser, you see a nice <table> that contains all the player info, with the coach's info below it.
When I pull that page into a python program (using urllib2) or a ruby program (using nokogiri) the table is represented as a bunch of div elements. I thought there might be some javascript running, so I disabled javascript on my browser and revisited the page. It still loads up wit the tables in place.
If I use Selenium to pull in the page source, I do get the table format.
Any idea on why the page comes in with the divs?
Python:
page = urllib2.urlopen(url)
html = page.read()
print html output (I put one of the divs on the last line to draw attention to it. That is a tr in the browser page. Shortened to stay under character limit):
'\t\t\t\r\n\t\t\r\n\t\t\r\n\t\t\r\n\r\n\r\n\r\n\r\n\r\n\t\t\t\t\r\n\r\n\r\n<?xml version="1.0" encoding="iso-8859-1"?>\r\n<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=iso-8859-1"/> <meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0"/>\r\n<meta forua="true" http-equiv="Cache-Control" content="must-revalidate" />\r\n<meta http-equiv="Pragma" content="no-cache, must-revalidate" />\r\n
<title>The Official Website of Harvard University Athletics: Harvard Athletics - GoCrimson.com : Men\'s Basketball - 2011-12 Roster </title>\r\n<link rel="stylesheet" href="/info/mobile/mobile.css" type="text/css"></link>\r\n<link rel="stylesheet" href="/mobile-overwrite.css" type="text/css"></link>\r\n</head>\r\n
<body class="classic">\r\n\r\n\r\n\t<strong>News</strong>\r\n | \r\n\tScores\r\n<br /><br />\r\n\r\n<p class="goBack-link"><<< Back</p>\r\n\r\n\r\n<div class="roster ">\r\n\t\t\t<div class="title">Men\'s Basketball - 2011-12 Roster</div>\r\n\t\t<div class="table">\r\n\t\t<div class="titles">\r\n\t\t\t
<div class="number">No.</div>\r\n\t\t\t<div class="name">Name</div>\r\n\t\t\t<div class="positions">Position</div>\r\n\t\t</div>\r\n\t\t\r\n\t\t\t\t\t<div class="item even clearfix">\r\n\t\t\t\t<div class="data">\r\n\t\t\t\t\t<div class="number">\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t3\r\n\t\t\t\t\t\t\t\t\t\t\t</div>\r\n\t\t\t\t\t<div class="name">
ruby:
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.css('tr').each do |node|
puts node.text
end
finds no trs, but
doc.css('div').each do |node|
puts node.text
end
finds the divs
I was able to get a <table> instead of divs by adding User-Agent headers. Specifically I pretended to be a known popular browser.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) '
'AppleWebKit/535.1 (KHTML, like Gecko) '
'Chrome/13.0.782.13 Safari/535.1'))
]
response = opener.open('http://www.gocrimson.com/sports/mbkb/2011-12/roster')
print response.readlines() # divs are now a table

Beautiful Soup Page Source Error

I am trying to fetch the html source from this usl:
http://books.google.com/books?id=NZlV0M5Ije4C&dq=isbn:0470284889
I used the following code:
#!/usr/bin/env python
import urllib, urllib2, urlparse, argparse, re
from bs4 import BeautifulSoup
def getPageSoup(address):
request = urllib2.Request(address, None, {'User-Agent':'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
urlfile = urllib2.urlopen(request)
page = urlfile.read()
urlfile.close()
print 'soup has been obtained!'
return BeautifulSoup(page)
soup2 = getPageSoup(address)
metadata = soup2.findAll("metadata_row")#this content is present when viewing from the web browser
However, the html source from soup2 looks hardly like the source from the Google Books page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head><title>Quantitative Trading: How to Build Your Own Algorithmic Trading Business - Ernie Chan - Google Books</title><script>(function(){function a(c){this.t={};this.tick=function(c,e,b){b=void 0!=b?b:(new Date).getTime();this.t[c]=[b,e]};this.tick("start",null,c)}var d=new a;window.jstiming={Timer:a,load:d};try{var f=null;window.chrome&&window.chrome.csi&&(f=Math.floor(window.chrome.csi().pageT));null==f&&window.gtbExternal&&(f=window.gtbExternal.pageT());null==f&&window.external&&(f=window.external.pageT);f&&(window.jstiming.pt=f)}catch(g){};})();
</script><link href="/books/css/_9937a87cb2905e754d8d5e36995f224d/kl_about_this_book_kennedy_full_bundle.css" rel="stylesheet" type="text/css"/></head></html>
HTML source from urllib2 and my web browser are very different. How can I get the correct page source?
Thanks!
It is correct page source. All visible content of page is generated by JavaScript. So, it's impossible to fetch actual content using urllib. You should use browser extension, webkit bindings or something like that.

Categories