Using BeautifulSoup to call a JAVA function - python

I am trying to scrape some data from the following website
http://www.pro-football-reference.com/teams/crd/2000_roster.htm
In particular, I want to scrape the data in the roster table. There is a red link at the heading of the table named "CSV" and if you click on it, the page loads the table information in csv format. The HTML code of this link is
<span tip="Get a widget to embed this table on your site" class="tooltip" onclick="sr_display_embed(this,'games_played_team'); try { pageTracker._trackEvent('Tool','Action','Embed'); } catch (err) {}">Embed</span>
I assume the function table2csv() is what is being executed. I don't have any experience with web development so I'm not even sure what this function is, I'm assuming it's JAVA. I'm looking for some guidance on how I can use BeautifulSoup to automate executing this function and then scraping the text in the HTML parse tree that appears after the function executes. Thank you.

The code that the page execute is JavaScript more specific AJAX, I recommend you use Selenium to make this work, mainly because this up a browser and with this you can make a program to make a click in this link and load the AJAX call and then scrap the content. This is the more accurate solution. Selenium is available for a lot of languages like JAVA, C#, Python, etc.
If you don't want to use Selenium instead you can see the XHTML request browser do and obtain directly the CSV, I think. You can see this using Chrome pressing F12 for view the developer tool or installing Firebug for Firefox, all in the tag network.

I am not familiar with BeautifulSoup and know very little Python, but I have dabbled in trying to scrape profootball reference in java and JSoup and then later HtmlUnit...
JSoup, and likely BeautifulSoup (as they are similar according to my recent google search), are not designed to invoke javascript functions.
Additionally, the page does not invoke a network request when the CSV link is invoked. Therefore, there is no known url that can be invoked to obtain the data in CSV format. The table2csv function in javascript creates the csv data from the html table data.
Your best option is to do as the javascript table2csv function does. Take the table data, obtainable via BeautifulSoup, and parse that directly.

Related

While webscraping a website in python I don't get the expected response but just a script tag containing a few line of codes

I'm trying to scrape data from a site in python, the payload is right and everything works but when I get the response of the site which would normally be the source code of the html page I instead, get just a script tag with some error written in it. See the response I get enclosed :
b'<script language="JavaScript">\nerr = "";\nlargeur = 1024;\nif (screen.width>largeur) { document.location.href="accueil.php?" +err;\t}\nelse { document.location.href="m.accueil.php?largeur=" +screen.width +\'&\' +err;\t}\n</script>'
Information :
after looking at the site it seems that it uses google analytics, I don't really know about what it is but maybe because of the preview things, it can't load the page since i'm not accessing it by a navigator.
What tool are you using to webscrape? Tools like beautiful soup parse pre-loaded HTML content. If a website uses client-side rendering and JavaScript to load content, often times HTML parsers will not function.
You can instead use an automated browser that interacts with a website just as a regular user would. These automated browsers can operate with or without a GUI. Automated browsers when run without a GUI (also known as a headless browser) take up less time and resources than running them with a GUI. Here's a fairly exhaustive list of headless browsers you can use. Note that not all are compatible with Python.
As Buran mentioned in the comments Selenium is an option. Selenium is very well documented and has a large community following so it's easy to find helpful articles or tutorials. It's a multi-driver so it can run different types of browsers (firefox, chrome, etc.), both headless and with a GUI.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

Handling responses with Python bindings for Selenium

So I have a webpage that has some javascript that gets executed when a link is clicked. This javascript opens a new window and calls some other javascript which requests an xml document which it then parses for a url to pass to a video player. How can I get that xml response using selenium?
Short answer, unless the xml is posted to the page, you can't. Long answer, you can use Selenium to do JS injection on the page so that the xml document is replicated to some hidden page element you can expect, or stored to a file locally that you can open. This is, of course, assuming that the xml document is actually retrieved client side; if this is all serverside, you'll need to integrate with the backend or emulate the call yourself. Oh, and one last option to explore would be to proxy the browser Selenium is driving, then inspect the traffic for the response containing the xml. Though more complicated, that actually could be argued to be the best solution, since you aren't modifying the system under test to test it.

Categories