Programatically opening web page given unexpected results - python

I'm trying to get information from this site:
http://www.gocrimson.com/sports/mbkb/2011-12/roster
If you look at that page in a browser, you see a nice <table> that contains all the player info, with the coach's info below it.
When I pull that page into a python program (using urllib2) or a ruby program (using nokogiri) the table is represented as a bunch of div elements. I thought there might be some javascript running, so I disabled javascript on my browser and revisited the page. It still loads up wit the tables in place.
If I use Selenium to pull in the page source, I do get the table format.
Any idea on why the page comes in with the divs?
Python:
page = urllib2.urlopen(url)
html = page.read()
print html output (I put one of the divs on the last line to draw attention to it. That is a tr in the browser page. Shortened to stay under character limit):
'\t\t\t\r\n\t\t\r\n\t\t\r\n\t\t\r\n\r\n\r\n\r\n\r\n\r\n\t\t\t\t\r\n\r\n\r\n<?xml version="1.0" encoding="iso-8859-1"?>\r\n<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=iso-8859-1"/> <meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0"/>\r\n<meta forua="true" http-equiv="Cache-Control" content="must-revalidate" />\r\n<meta http-equiv="Pragma" content="no-cache, must-revalidate" />\r\n
<title>The Official Website of Harvard University Athletics: Harvard Athletics - GoCrimson.com : Men\'s Basketball - 2011-12 Roster </title>\r\n<link rel="stylesheet" href="/info/mobile/mobile.css" type="text/css"></link>\r\n<link rel="stylesheet" href="/mobile-overwrite.css" type="text/css"></link>\r\n</head>\r\n
<body class="classic">\r\n\r\n\r\n\t<strong>News</strong>\r\n | \r\n\tScores\r\n<br /><br />\r\n\r\n<p class="goBack-link"><<< Back</p>\r\n\r\n\r\n<div class="roster ">\r\n\t\t\t<div class="title">Men\'s Basketball - 2011-12 Roster</div>\r\n\t\t<div class="table">\r\n\t\t<div class="titles">\r\n\t\t\t
<div class="number">No.</div>\r\n\t\t\t<div class="name">Name</div>\r\n\t\t\t<div class="positions">Position</div>\r\n\t\t</div>\r\n\t\t\r\n\t\t\t\t\t<div class="item even clearfix">\r\n\t\t\t\t<div class="data">\r\n\t\t\t\t\t<div class="number">\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t3\r\n\t\t\t\t\t\t\t\t\t\t\t</div>\r\n\t\t\t\t\t<div class="name">
ruby:
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.css('tr').each do |node|
puts node.text
end
finds no trs, but
doc.css('div').each do |node|
puts node.text
end
finds the divs

I was able to get a <table> instead of divs by adding User-Agent headers. Specifically I pretended to be a known popular browser.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) '
'AppleWebKit/535.1 (KHTML, like Gecko) '
'Chrome/13.0.782.13 Safari/535.1'))
]
response = opener.open('http://www.gocrimson.com/sports/mbkb/2011-12/roster')
print response.readlines() # divs are now a table

Related

selenium webdriver(chrome) elements differ from those of driver.page_source

I tried to scrape an article in the medium.
But it failed because selenium.webdriver.page_source doesn't contain the target div.
[E.G.] Demystifying Python Decorators in Less Than 10 Minutes https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c
In this site, the content holder div's class is "x y z ab ac ez af ag", but this element doesn't show up in driver.page_source.
shortcode: below.
It is NOT the kind of timeout problem.
It seems like the drive.page_source is not processed with javascript, but I don't know.
ARTICLE = "https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c"
driver.get(ARTICLE)
text_soup = BeautifulSoup(driver.page_source,"html5lib")
text = text_soup.select(".x.y.z.ab.ac.ez.af.ag")
print(text) # => []
I expect the output of driver.page_source is the same as that of the chrome developer console's elements.
Update: I did some experiment.
I doubted webdriver couldn't get the html source processedby javascript, so I "selenium-ed" the below html file.
But I got "element-removed" html file.
result:
webdriver and ordinary chrome console are same -> processed
<html lang="en">
<body>
<script type="text/javascript">
document.querySelector("#id").remove();
</script>
</body></html>
wget / requests -> not processed
<html lang="en">
<body>
<div id="id">
test element
</div>
<script type="text/javascript">
document.querySelector("#id").remove();
</script>
</body></html>

Why does Python 3 urllib redirect to Yahoo?

I am using urlopen in urllib.request in Python 3.5.1 (64-bit version on Windows 10) to load content from www.wordreference.com for a French project. Somehow, whenever I request anything outside the domain itself, page content is instead loaded from yahoo.com.
Here, I print the first 350 characters from http://www.wordreference.com:
>>> from urllib import request
>>> page = request.urlopen("http://www.wordreference.com")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>
<html lang="en">
<head>
<title>English to French, Italian, German & Spanish Dictionary -
WordReference.com</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="description" content="Free online dictionaries - Spanish, French,
Italian, German and more. Conjugations, audio pronunciations and
Next, I requested a specific document on the domain:
>>> page = request.urlopen("http://www.wordreference.com/enfr/test")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>
<html id="atomic" lang="en-US" class="atomic my3columns l-out Pos-r https fp
fp-v2 rc1 fp-default mini-uh-on viewer-right ltr desktop Desktop bkt201">
<head>
<title>Yahoo</title><meta http-equiv="x-dns-prefetch-control" content="on"
<link rel="dns-prefetch" href="//s.yimg.com"><link rel="preconnect"
href="//s.yimg.com"><li
The last request takes about six seconds longer to read (which could be my slow internet) and the content comes straight from http://www.yahoo.com/. I can access the above URLs fine in a web browser.
Why is this happening? Is this something related to Windows 10? I have tried this on other domains and the problem does not occur.
I tried the following code and it's working.
import requests
page = requests.get("http://www.wordreference.com/enfr/test")
content = page.text
print(content.encode('utf-8')[:350])

Dynamic browsing using python (Mechanize, Beautifulsoup...)

I am currently writing a python's parser for extract automatically some information from a website. I am using mechanize to browse the website. I obtain the following html code:
<html>
<head>
<title>
XXXXX
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
</head>
<frameset cols="*,370" border="1">
<frame src="affiche_cp.php?uid=yyyyyyy&type=entree" name="cdrg" />
<frame src="affiche_bp.php?uid=yyyyyyy&type=entree" name="cdrd" />
</frameset>
</html>
I want to access to the both frames:
in cdrd I must fill some forms and submit
in cdrg I will obtain the result of the submission
How can I do this?
Personally, I do not use BeautifulSoup for parsing HTML. But instead I use PyQuery, which is similar but I like the CSS selector syntax as opposed to XPath. I also use Requests to make HTTP requests.
That alone is enough to scrape data, and submit requests. It can do what you want. I understand this probably isn't the answer you're looking for but it might very well be useful to you.
Scraping Frames with PyQuery
import requests
import pyquery
response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')
frame_one = frames[0]
frame_two = frames[1]
Making HTTP Requests
import requests
response = requests.post('http://example.com/signup', data={
'username': 'someuser',
'password': 'secret'
})
response_text = response.text
data is a dictionary with the POST data to submit to the forms. You should use Chrome's network explorer, Fiddlr or Burp Suite to monitor requests. Whilst monitoring manually submit both forms. Inspect the HTTP requests and recreate the request using Requests.
Hope that helps a little. I work in this field, so if you require any more information feel free to hit me up.
The solution of my issue was to load the first frame and fill the form in this page. Then I load the second frame and I can read it and obtain the results associated to the form in the first frame.

Extracting links from HTML table using BeautifulSoup with unclean source code

I am trying to scrape articles from a Chinese newspaper database. Here is some of the source code (pasting excerpt b/c keyed site):
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%# page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>
</html>
When I try to do some straightforward scraping of the links in the table like so:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(
Python does not even find anything in the html, aka returns an empty list. I think this is because the source code starts with the base href tag, and Python only recognizes two tags in the document: base href and html.
Any idea how to scrape the links in this case? Thank you so much!!
Removing the second line made BS find all the tags. I didn't find a better way to parse this.
page = br.open(url)
page = page.read().replace('<! -- <%# page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)
BS isnt really developed any longer - and would suggest you have a look at lxml
Dont have access to that specific url, but I was able to get this to work, using the html fragment (to which I added an a tag)
>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example
When your html is very messed up, it's better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.

Only Firefox displays HTML Code and not the page

I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')

Categories