Python - beautifulSoup unable to iterate repetitive blocks - python

Unsure how to properly word the issue.
I am trying to parse through an HTML document with a tree similar to that of
div(unique-class)
|-a
|-h4
|-div(class-a)
|-div(class-b)
|-div(class-c)
|-p
Etc, it continues. I only listed the few items I need. It is a lot of sibling hierarchy, all existing within one div.
I've been working quite a bit with BeautifulSoup for the past few hours, and I finally have a working version (Beta) of what I'm trying to parse, in this example.
from bs4 import BeautifulSoup
import urllib2
import csv
file = "C:\\Python27\\demo.html"
soup = BeautifulSoup (open(file), 'html.parser')
#(page, 'html.parser')
#Let's pull prices
names = []
pricing = []
discounts = []
for name in soup.find_all('div', attrs={'class': 'unique_class'}):
names.append(name.h4.text)
for price in soup.find_all('div', attrs={'class': 'class-b'}):
pricing.append(price.text)
for discount in soup.find_all('div', attrs={'class': 'class-a'}):
discounts.append(discount.text)
ofile = open('output2.csv','wb')
fieldname = ['name', 'discountPrice', 'originalPrice']
writer = csv.DictWriter(ofile, fieldnames = fieldname)
writer.writeheader()
for i in range(len(names)):
print (names[i], pricing[i], discounts[i])
writer.writerow({'name': names[i], 'discountPrice':pricing[i], 'originalPrice': discounts[i]})
ofile.close()
As you can tell this it iterating from top to bottom and appending to a distinct array for each one. The issue is, if I'm iterating over, let's say, 30,000 items and the website can modify itself (We'll say a ScoreBoard app on a JS Framework), by the time I get to the 2nd iteration, the order may have changed. (As I type this I realize this scenario actually would need more variables since BS would 'catch' the website at time of load, but I think the point still stands.)
I believe I need to leverage the next_sibling function within BS4 but when I did that I started capturing items I wasn't specifying, because I couldn't apply a 'class' to the sibling.
Update
An additional issue I encouraged when trying to do a loop within a loop to find the 3 children I need under the unique-class was I would end up with the first price being listed for all names.
Update - Adding sample HTML
<div class="unique_class">
<h4>World</h4>
<div class="class_b">$1.99</div>
<div class="class_a">$1.99</div>
</div>
<div class="unique_class">
<h4>World2</h4>
<div class="class_b">$2.99</div>
<div class="class_a">$2.99</div>
</div>
<div class="unique_class">
<h4>World3</h4>
<div class="class_b">$3.99</div>
<div class="class_a">$3.99</div>
</div>
<div class="unique_class">
<h4>World4</h4>
<div class="class_b">$4.99</div>
<div class="class_a">$3.99</div>
</div>
I have also found a fix, and submitted the answer to be Optimized - Located at CodeReview

If the site you are looking to scrape the data from is using JS you may want to use selenium and use its page_source method to extract snapshots of the page with loaded JS you can then load into BS.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(<url>)
page = driver.page_source
Then you can use BS to parse the JS loaded 'page'
If you want to wait for other JS events to load up you are able to specify events to wait for in selenium.

Related

Scrape data and interact with webpage rendered in HTML

I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal

Scraping webpage with _ngcontent value within different html tags

I am new to scraping and coding as well. So far I am able to scrape data using beautiful soup using below code:
sub_soup = BeautifulSoup(sub_page, 'html.parser')
content = sub_soup.find('div',class_='detail-view-content')
print(content)
This works correct when tag and class are in format:
<div class="masthead-card masthead-hover">
But fail when format is with _ngcontent:
<span _ngcontent-ixr-c5="" class="btn-trailer-text">
or
<div _ngcontent-wak-c4="" class="col-md-6">
An example of _ngcontent webpage screenshot I am trying to scrape is below :
All I tried results in blank or 'None'. What am I missing.
BeautifulSoup runs faster than page loading.
so you should use Selenium library and ChromeDriver.
here it is

parse page with beautifulsoup

I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?
The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.

Unable to Scrape Content that comes after a Comment Python BeautifulSoup

I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?
I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.

BS4 Scraping Hidden Content

I have been stuck on this for awhile... I am trying to scrape the player name and projection from this site: https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793
The script is going to loop through the past by just going through all the PID's in a range, but that isnt the problem. The main problem is when I inspect the element I find the value is stored within this class:
<div class="salarybox expanded"...
which is located in the 5th position of my projectionsView list.
The scraper finds the projectionsView class fine but can't find anything within it.
When I goto view the actual HTML of the site it seems this content just doesn't exsist within it..
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
I'm super new to scraping and have successfully scraped everything else I need for my project just not this damn site... I think it may be because I have to sign up for the site? But either way the information is viewable without signing in so I figured I didnt need to use Selenium, and even if I did that wouldn't find it I don't think.
Anyway here's the code I have so far that is obviously returning a blank list.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
import os
url = "https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793"
uClient = uReq(url)
page_read = uClient.read()
uClient.close()
page_soup = soup(page_read, "html.parser")
salarybox = page_soup.findAll("div",{"class":"projectionsView"})
print(salarybox[4].findAll("div",{"class":"salarybox expanded"}))
Any ideas would be greatly appreciated!
The whole idea of the script is to just find the ppText of each "salarybox expanded" class on each page. I just want to know how to find these elements. Perhaps a different parser?
Based on your url page, the <div id="salData" class="projectionsView">is re-write by the javascript, but urllib.request will get the whole response before running your callback, it means that the javascript generated content will be not in the response. Hence the div will be empty:
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
you better try with selenium and splash will work for this kind of dynamic website.
BTW, after you get the right response, you select div by id, it will be more specific:
salarybox = page_soup.find("div",{"id":"salData"})

Categories