So I have a webpage that has some javascript that gets executed when a link is clicked. This javascript opens a new window and calls some other javascript which requests an xml document which it then parses for a url to pass to a video player. How can I get that xml response using selenium?
Short answer, unless the xml is posted to the page, you can't. Long answer, you can use Selenium to do JS injection on the page so that the xml document is replicated to some hidden page element you can expect, or stored to a file locally that you can open. This is, of course, assuming that the xml document is actually retrieved client side; if this is all serverside, you'll need to integrate with the backend or emulate the call yourself. Oh, and one last option to explore would be to proxy the browser Selenium is driving, then inspect the traffic for the response containing the xml. Though more complicated, that actually could be argued to be the best solution, since you aren't modifying the system under test to test it.
Related
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).
I have a script (inside <script></script> tags) that is being executed every time I load a page. Is it possible to remove an WebElement before the page being loaded in the WebDriver to prevent that script from executing?
I am thinking of something in the lines of:
Somehow get the raw HTML code (perhaps get source or something), remove the part (with selenium or parser), "inject" the edited code back into Selenium (Firefox WebDriver or maybe PhantomJS) and finally execute it for all pages on that website.
Is it possible to do that or is this perhaps impossible by design?
If you install selenium-requests, you can make a GET request for the page, process the html/etc that is loaded, and then place in the tab.
It might be tricky to insert the processed result since you will likely need to also set the current browser URL to match (simply inserting it will cause issues with cross-domain loading of scripts, relative paths, etc) - perhaps there is a way of overriding (or allowing overriding) the 'get' response that selenium receives with the pre-processed information
Selenium-Requests makes a request using the requests library that uses the running webdriver's cookies for that domain and emulates the default HTTP headers sent by that webdriver. The result is a low-level HTTP request and response created with the webdriver's state. This is needed because the Selenium interface is very high-level, and doing much more than opening pages and exploring the DOM is not really natively possible in Python.
The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.
I am trying to scrape some data from the following website
http://www.pro-football-reference.com/teams/crd/2000_roster.htm
In particular, I want to scrape the data in the roster table. There is a red link at the heading of the table named "CSV" and if you click on it, the page loads the table information in csv format. The HTML code of this link is
<span tip="Get a widget to embed this table on your site" class="tooltip" onclick="sr_display_embed(this,'games_played_team'); try { pageTracker._trackEvent('Tool','Action','Embed'); } catch (err) {}">Embed</span>
I assume the function table2csv() is what is being executed. I don't have any experience with web development so I'm not even sure what this function is, I'm assuming it's JAVA. I'm looking for some guidance on how I can use BeautifulSoup to automate executing this function and then scraping the text in the HTML parse tree that appears after the function executes. Thank you.
The code that the page execute is JavaScript more specific AJAX, I recommend you use Selenium to make this work, mainly because this up a browser and with this you can make a program to make a click in this link and load the AJAX call and then scrap the content. This is the more accurate solution. Selenium is available for a lot of languages like JAVA, C#, Python, etc.
If you don't want to use Selenium instead you can see the XHTML request browser do and obtain directly the CSV, I think. You can see this using Chrome pressing F12 for view the developer tool or installing Firebug for Firefox, all in the tag network.
I am not familiar with BeautifulSoup and know very little Python, but I have dabbled in trying to scrape profootball reference in java and JSoup and then later HtmlUnit...
JSoup, and likely BeautifulSoup (as they are similar according to my recent google search), are not designed to invoke javascript functions.
Additionally, the page does not invoke a network request when the CSV link is invoked. Therefore, there is no known url that can be invoked to obtain the data in CSV format. The table2csv function in javascript creates the csv data from the html table data.
Your best option is to do as the javascript table2csv function does. Take the table data, obtainable via BeautifulSoup, and parse that directly.