Scrape data and interact with webpage rendered in HTML - python

I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.

Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal

Related

Scraping webpage with _ngcontent value within different html tags

I am new to scraping and coding as well. So far I am able to scrape data using beautiful soup using below code:
sub_soup = BeautifulSoup(sub_page, 'html.parser')
content = sub_soup.find('div',class_='detail-view-content')
print(content)
This works correct when tag and class are in format:
<div class="masthead-card masthead-hover">
But fail when format is with _ngcontent:
<span _ngcontent-ixr-c5="" class="btn-trailer-text">
or
<div _ngcontent-wak-c4="" class="col-md-6">
An example of _ngcontent webpage screenshot I am trying to scrape is below :
All I tried results in blank or 'None'. What am I missing.
BeautifulSoup runs faster than page loading.
so you should use Selenium library and ChromeDriver.
here it is

parse page with beautifulsoup

I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?
The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.

BeautifulSoup picking up div objects in some download requests but not others

I'm using python's bs4 module to parse HTML. However, I've run into a peculiar bug.
When downloading page HTML's for parsing, I've noticed BS4 will recognize div objects in some pages but not others, even though the specific object I'm referencing is present in both and the paths are the same.
e.g.
<div class = "item" data-year = "19-20">
<div class = "irrelevant">...</div>
<div class = "irrelevant">...</div>
<div class = "stats-grids">...</div>
<div class = "irrelevant">...</div>
</div>
I've done some digging and see it frequently proposed that something like this can be caused by Java use in the webpage, and not showing up in the HTML. However, I believe this not to be true in this case because BS4 is correctly identifying the path in other instances where the code remains unchanged.
When using...
res = requests.get('examplesite.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
element = soup.select('div[data-year = "19-20"] > div[class = "stats-grids"]')
For some pages, from the same website, element is correct. Other times, it can find div[data-year = "19-20"] and div[class = "stats-grids"] independently of one another, but not when I specify that one is the child of the other.
In other words, it's there, but only when I specify that stats grids is within the data year, it doesn't show up.
This may occur due to the site having incorrect HTML (for example a tag is not closed).
Try using html5lib. It will attempt to create a well-formed HTML document by adding additional tags.
Install it using
pip install html5lib
and specify it in the BeautifulSoup constructor
soup = bs4.BeautifulSoup(res.text, 'html5lib')
Ref:
Differences between parsers

HTML source while webscraping seems inconsistent for website

I checked out say:
https://www.calix.com/search-results.html?searchKeyword=C7
And if I inspect element on the first link I get this:
<a class="title viewDoc"
href="https://www.calix.com/content/dam/calix/mycalix-
misc/ed-svcs/learning_paths/C7_lp.pdf" data-
preview="/session/4e14b237-f19b-47dd-9bb5-d34cc4c4ce01/"
data-preview-count="1" target="_blank"><i class="fa fa-file-
pdf-o grn"></i><b>C7</b> Learning Path</a>
I coded:
import requests, bs4
res = requests.get('https://www.calix.com/search-results.html?
searchKeyword=C7',headers={'User-Agent':'test'})
print(res)
#res.raise_for_status()
bs_obj= bs4.BeautifulSoup(res.text, "html.parser")
elems=bs_obj.findAll('a',attrs={"class","title viewDoc"})
print(elems)
And there was [] as output (empty list).
So, I thought about actually looking through the "view-source" for the page.
view-source:https://www.calix.com/search-results.html?searchKeyword=C7
If you search through the "view-source" you will not find the code for the "inspect element" I mentioned earlier.
There is no "a class="title viewDoc"" in the view-source of the page.
That is probably why my code isn't returning anything.
The I went to www.nba.com, and inspected a link
<a class="content_list--item clearfix"
href="/article/2018/07/07/demarcus-cousins-discusses-
stacked-golden-state-warriors-roster"><h5 class="content_list-
-title">Cousins on Warriors' potential: 'Scary'</h5><time
class="content_list--time">in 5 hours</time></a>
The content of "inspect" for this link was in the "view-source" of the page.
And, obviously my code was working for this page.
I have seen a few examples of issue #1.
Just curious why the difference in html formats, or am I missing something?

Beautiful Soup not recognizing Button Tag

I'm currently experimenting with Beautiful Soup 4 in Python 2.7.6
Right now, I have a simple script to scrape Soundcloud.com. I'm trying to print out the number of button tags on the page, but I'm not getting the answer I expect.
from bs4 import BeautifulSoup
import requests
page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
data = page.text
soup = BeautifulSoup(data)
buttons = soup.findAll('button')
print len(buttons)
When I run this, I get the output
num buttons = 0
This confuses me. I know for a fact that the button tags exist on this page so it shouldn't be returning 0. Upon inspecting the button elements directly underneath the waveform, I find these...
<button class="sc-button sc-button-like sc-button-medium sc-button-responsive" tabindex="0" title="Like">Like</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtoset" tabindex="0" title="Add to playlist">Add to playlist</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtogroup" tabindex="0" title="Add to group">Add to group</button>
<button class="sc-button sc-button-share sc-button-medium sc-button-responsive" title="Share" tabindex="0">Share</button>
At first I thought that the way I was trying to find the button elements was incorrect. However, if I modify my code to scrape an arbitrary youtube page...
page = requests.get('http://www.youtube.com/watch?v=UiyDmqO59QE')
then I get the output
num buttons = 37
So that means that soup.findAll('button') is doing what it's suppose to, just not on soundcloud.
I've also tried specifying the exact button I want, expecting to get a return result of 1
buttons = soup.findAll('button', class_='sc-button sc-button-like sc-button-medium sc-button-responsive')
print 'num buttons =', len(buttons)
but it still returns 0.
I'm kind of stumped on this one. Can anyone explain why this is?
The reason you cannot get the buttons is that there are no button tags inside the html you are getting:
>>> import requests
>>> page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
>>> data = page.text
>>> '<button' in data
False
This means that there is much more involved in forming the page: AJAX requests, javascript function calls etc
Also, note that soundcloud provides an API - there is no need to crawl HTML pages of the site. There is also a python wrapper around the Soundcloud API available.
Also, be careful about web-scraping, study Terms of Use:
You must not employ scraping or similar techniques to aggregate,
repurpose, republish or otherwise make use of any Content.

Categories