Working with html generated from javascript

Working with html generated from javascript - python

I have some html-page. There is a javascript which generates some content. I have to parse this content from python-script. I have saved copy of file on the computer. Are there any ways to work with 'already generated' html? Like I can see in the browser after opening page-file. As I understand, I have to work with DOM (maybe, xml2dom lib).

Have you saved "the file" (web page, I imagine) before or after Javascript has altered it?
If "after", then it doesn't matter any more that some of the HTML was done via Javascript -- you can just use popular parsers like lxml or BeautifulSoup to handle the HTML you have.
If "before", then first you need to let Javascript do its work by automating a real browser; for that task, I would recommend SeleniumRC -- which brings you back to the "after" case;-).

I think you may have a fundamental misunderstanding in regards to what runs where: At the time JavaScript generates the content (on client side), the server side processing of the document has already taken place. There is no direct way for a server side Python script to access HTML created by JavaScript. Basically, that HTML lives only "virtually" in the browser's DOM.
You would have to find a way to transmit that HTML to your Python script. Most likely using Ajax. You would take the HTML, and add it as a parameter to your Ajax call (Remember to use POST as the request method so you don't get size limitation problems.)
An example using jQuery's AJAX functions:
$.ajax({
url: "myscript.py",
type: "POST",
data: { html: your_html_content_here },
success: function(){
alert("sent HTML to python script!");
}});

Related

Missing html data when parsing from Python [duplicate]

Please advise how to scrape AJAX pages.

Overview:
All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.
When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.
You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.
Example:
Take this as an example, assume the page you want to scrape from has the following script:
<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
{
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e)
{
try
{
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e)
{
alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function()
{
if(xmlHttp.readyState==4)
{
document.myForm.time.value=xmlHttp.responseText;
}
}
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>
Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.
Advanced scraping with C++:
For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.
Advanced scraping with Java:
For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino
Advanced scraping with .NET:
For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.
The whole page is loaded, and it's very easy to scrape any ajax-related data.
You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS
You can also give a look at this example code, on how to scrape google suggests keywords :
/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);
if (!word) {
casper.echo('please provide a word').exit(1);
}
casper.start('http://www.google.com/', function() {
this.sendKeys('input[name=q]', word);
});
casper.waitFor(function() {
return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
suggestions = this.evaluate(function() {
var nodes = document.querySelectorAll('.gsq_a table span');
return [].map.call(nodes, function(node){
return node.textContent;
});
});
});
casper.run(function() {
this.echo(suggestions.join('\n')).exit();
});

If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.

The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjs is a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit (in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.
I wrote many articles about this subject like web scraping Ajax and Javascript sites and automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.

I like PhearJS, but that might be partially because I built it.
That said, it's a service you run in the background that speaks HTTP(S) and renders pages as JSON for you, including any metadata you might need.

Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.
Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.

I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.
For example,I capture a packet like the following:
GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330
HTTP/1.1
Accept: */*
Referer: http://quote.hexun.com/stock/default.aspx
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: quote.tool.hexun.com
Connection: Keep-Alive
Then the URL is :
http://quote.tool.hexun.com/hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330

As a low cost solution you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.

Selenium WebDriver is a good solution: you program a browser and you automate what needs to be done in the browser. Browsers (Chrome, Firefox, etc) provide their own drivers that work with Selenium. Since it works as an automated REAL browser, the pages (including javascript and Ajax) get loaded as they do with a human using that browser.
The downside is that it is slow (since you would most probably like to wait for all images and scripts to load before you do your scraping on that single page).

I have previously linked to MIT's solvent and EnvJS as my answers to scrape off Ajax pages. These projects seem no longer accessible.
Out of sheer necessity, I have invented another way to actually scrape off Ajax pages, and it has worked for tough sites like findthecompany which have methods to find headless javascript engines and show no data.
The technique is to use chrome extensions to do scraping. Chrome extensions are the best place to scrape off Ajax pages because they actually allow us access to javascript modified DOM. The technique is as follows, I will certainly open source the code in sometime. Create a chrome extension ( assuming you know how to create one, and its architecture and capabilities. This is easy to learn and practice as there are lots of samples),
Use content scripts to access the DOM, by using xpath. Pretty much get the entire list or table or dynamically rendered content using xpath into a variable as string HTML Nodes. ( Only content scripts can access DOM but they can't contact a URL using XMLHTTP )
From content script, using message passing, message the entire stripped DOM as string, to a background script. ( Background scripts can talk to URLs but can't touch the DOM ). We use message passing to get these to talk.
You can use various events to loop through web pages and pass each stripped HTML Node content to the background script.
Now use the background script, to talk to an external server (on localhost), a simple one created using Nodejs/python. Just send the entire HTML Nodes as string, to the server, where the server would just persist the content posted to it, into files, with appropriate variables to identify page numbers or URLs.
Now you have scraped AJAX content ( HTML Nodes as string ), but these are partial html nodes. Now you can use your favorite XPATH library to load these into memory and use XPATH to scrape information into Tables or text.
Please comment if you cant understand and I can write it better. ( first attempt ). Also, I am trying to release sample code as soon as possible.

Open and Receive JSON response from url

I have a document in JSON, with information that I intend for my addon, I found a code in this forum and tried to modify without success. What I intend is that through the function that I will leave, call this link (https://tugarepo.000webhostapp.com/lib/lib.json) so that I can see the content.
CODE:
return json.loads(openfile('lib.json',path.join('https://tugarepo.000webhostapp.com/lib/lib.json')))

Python Answer
You can use
import urllib2
urllib2.openurl('https://tugarepo.000webhostapp.com/lib/lib.json').read()
in Python 2.7 to perform a simple GET request on your file. I think you're confusing openfile, which is for local files only and a HTTP get request which is for hosted content. The result of the read() you can put into any JSON library available for your project.
Original Answer for Javascript tag
In plain Javascript, you can use a function like explained in the following: HTTP GET request in JavaScript?
If you're using Bootstrap or Jquery, you can use the following: http://api.jquery.com/jquery.getjson/
If you wanna see the content on the html page (associated with your Javascript), you'll simply have to grab an element from the page (document.getElementById or document.getElementByClass and such). Once you have a DOM element you can add html into it yourself, that contains your JSON data.
Example code: https://codepen.io/MrKickkiller/pen/prgVLe
The above code is based on having JQuery linked in your html Element. There is however an error since your link doesn't have Acces Control headers. Therefor currently only requests coming from the tugarepo.000webhostapp.com domain have access to the JSON file. Consider adding CORS Headers. https://enable-cors.org/

Simply do:
fetch('https://tugarepo.000webhostapp.com/lib/lib.json')
.then(function (response) { return response.json() })
.then(function (body) { console.log(body)});
But this throws an error as your JSON is invalid.

Using BeautifulSoup to call a JAVA function

I am trying to scrape some data from the following website
http://www.pro-football-reference.com/teams/crd/2000_roster.htm
In particular, I want to scrape the data in the roster table. There is a red link at the heading of the table named "CSV" and if you click on it, the page loads the table information in csv format. The HTML code of this link is
<span tip="Get a widget to embed this table on your site" class="tooltip" onclick="sr_display_embed(this,'games_played_team'); try { pageTracker._trackEvent('Tool','Action','Embed'); } catch (err) {}">Embed</span>
I assume the function table2csv() is what is being executed. I don't have any experience with web development so I'm not even sure what this function is, I'm assuming it's JAVA. I'm looking for some guidance on how I can use BeautifulSoup to automate executing this function and then scraping the text in the HTML parse tree that appears after the function executes. Thank you.

The code that the page execute is JavaScript more specific AJAX, I recommend you use Selenium to make this work, mainly because this up a browser and with this you can make a program to make a click in this link and load the AJAX call and then scrap the content. This is the more accurate solution. Selenium is available for a lot of languages like JAVA, C#, Python, etc.
If you don't want to use Selenium instead you can see the XHTML request browser do and obtain directly the CSV, I think. You can see this using Chrome pressing F12 for view the developer tool or installing Firebug for Firefox, all in the tag network.

I am not familiar with BeautifulSoup and know very little Python, but I have dabbled in trying to scrape profootball reference in java and JSoup and then later HtmlUnit...
JSoup, and likely BeautifulSoup (as they are similar according to my recent google search), are not designed to invoke javascript functions.
Additionally, the page does not invoke a network request when the CSV link is invoked. Therefore, there is no known url that can be invoked to obtain the data in CSV format. The table2csv function in javascript creates the csv data from the html table data.
Your best option is to do as the javascript table2csv function does. Take the table data, obtainable via BeautifulSoup, and parse that directly.

mechanize could not retrieve all forms(including some generated by js)

it is a html including two forms. One of them is generated dynamic by js when the page is loaded
So, if I try to fetch them, only one form could be return, and the form generated dynamic not found.
the question is
how to fetch all forms even if they are generated by js.

As far as I know Mechanize does not handle javascript.
That means that you should either generate the form yourself - by reading the JS that creates the form, and then "translating" it to python, and inserting it in your script. -
or:
Automate an actual browser that does understand Javascript using something like ruby's Watir

Launch Firefox, use HTTP Live Headers to inspect what the javascript does, then imitate that using Mechanize / relevant HTTP requests.

Use a browser that understands javascript as per WWW::Mechnize::FAQ, a browser like WWW::Mechanize::Firefox or WWW::Scripter

Extracting information from AJAX based sites using Python

I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier

This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.

You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).

You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.

One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.