How can I use python to get rendered ASP pages? - python

I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?

Your target page is a frameset. There is nothing fancy going on from the server side that I can tell. When I use requests or urllib to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset with a single frame in it.
The frame target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML framesets work. You're downloading the frameset, not downloading the referenced frame, and wondering why the resulting page looks like an empty frameset.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset: For each frame tag, it follows the src attribute, then it replaces the contents of the frame tag with a #document tag with no attributes, with the results of reading the src URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget). So, next time you can say "In Chrome's Inspect Page, I see a #document under the frame, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame has no children".

Related

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Handling responses with Python bindings for Selenium

So I have a webpage that has some javascript that gets executed when a link is clicked. This javascript opens a new window and calls some other javascript which requests an xml document which it then parses for a url to pass to a video player. How can I get that xml response using selenium?
Short answer, unless the xml is posted to the page, you can't. Long answer, you can use Selenium to do JS injection on the page so that the xml document is replicated to some hidden page element you can expect, or stored to a file locally that you can open. This is, of course, assuming that the xml document is actually retrieved client side; if this is all serverside, you'll need to integrate with the backend or emulate the call yourself. Oh, and one last option to explore would be to proxy the browser Selenium is driving, then inspect the traffic for the response containing the xml. Though more complicated, that actually could be argued to be the best solution, since you aren't modifying the system under test to test it.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

How do I screen capture a URL with PIL?

I'm trying to write a server process that will allow you to enter a URL, then every 30 min ping that URL and capture it as an image. Is this possible with a combination of something like CURL, urllib2 and PIL?
Curl, urllib2, etc., grab the HTML code for a web page. But a page doesn't look like anything on its own. Instead, a browser uses that code and renders a web page according to its own internal rules of how that code should be used. And, of course, each browser renders the page slightly differently.
In other words, you can't take a snapshot of a page without having a web browser to generate the page to take the snapshot of.
If you're feeling very ambitious, you can create your own custom, scriptable page renderer by using the rendering engine from the browser of your choice -- they all make the rendering engine a separate component that you can work with separately. IE's is called "Trident", Firefox's is called "Gecko", Chrome's is "WebKit", etc.
Otherwise you'll want to just do some sort of UI scripting, like you might do with iOpus or Selenium. Selenium is scriptable with python, so that's one for you right there.
EDIT
Here you go. That looks pretty simple.
The ImageGrab can be used to take a screenshot on windows. However, you can't do this purely using CURL, urllib2 and PIL, because you will have to render the web site. The easiest would probably be to open the website in a browser and grab a screenshot.

Categories