python beautifulsoup iframe document html extract - python

I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far.
So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do.
For example, consider the below iFrame (this is what I see on chrome developer tools):
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>
where, <html>...</html> is the content I am interested in extracting.
However, when I use the following BS4 code:
iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
iFrames.append(soup.iframe.extract())
I get:
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">
In other words, I get the iFrames without the document <html>...</html> within them.
I tried something along the lines of:
iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
print iframe.find_all('html')
.. but this does not seem to work..
So, I guess my question is, how do I reliably extract these document objects <html>...</html> from the iFrame elements.

Browsers load the iframe content in a separate request. You'll have to do the same:
for iframe in iframexx:
response = urllib2.urlopen(iframe.attrs['src'])
iframe_soup = BeautifulSoup(response)
Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.

Related

Cannot find href on a page

I am trying to find the url for the trailer video from this page. https://www.binged.com/streaming-premiere-dates/black-monday/.
I checked the various properties of the div class="wordkeeper-video", I cannot find it. Can someone help?
Go ahead and play it. Then there will be something like this. The link is in src tag
<iframe frameborder="0" allowfullscreen="" allow="autoplay" src="https://www.youtube.com/embed/pzxGR6Q-7Mc?rel=0&showinfo=0&autoplay=1"></iframe>
PS: It is in div class="wordkeeper-video"
The video href is not initially present there.
You need first to click on the play button (actually the image), after that the href is presented inside the iframe there.
The iframe is .wordkeeper-video iframe
So you have to switch to the iframe and then extract it's src attribute
The full URL isn't there but all you need to build it is.
<div class="wordkeeper-video " data-type="youtube" data-embed="pzxGR6Q-7Mc" ...>
The data-embed attribute has what you need.
The URL is
https://www.youtube.com/watch?v=pzxGR6Q-7Mc
^ here's the data-embed value
You can get this by using
data_embed = driver.find_element_by_css_selector(".wordkeeper-video").get_attribute("data-embed")
video_url = "https://www.youtube.com/watch?v=" + data_embed

Scrape data and interact with webpage rendered in HTML

I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal

merging two html strings into one, using python

im trying to understand if there's a relatively simple way, to take an HTML string, and "insert" it inside a different HTML string. I Tried converting the HTML into a simple DIV, and put it in the first HTML, but that didn't work and caused weird failures.
Some more info: I'm creating a report using bokeh, and have some figures. My code is creating some figures and appending them to a list, which eventually is parsed into an HTML and saved on my PC. What i want to do, is read a different HTML string, and append it entirely in my report.
You can do that with BeautifulSoup. See this example:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<html><body><p>my paragraph</p></body></html>")
body = soup.find("body")
new_tag = soup.new_tag("a", href="http://www.example.com")
body.append(new_tag)
another_new_tag = soup.new_tag("p")
another_new_tag.insert(0, NavigableString("bla bla, and more bla"))
body.append(another_new_tag)
print(soup.prettify())
The result is:
<html>
<body>
<p>
my paragraph
</p>
<a href="http://www.example.com">
</a>
<p>
bla bla, and more bla
</p>
</body>
</html>
So what i was looking for, and what solves my problem, is just using iframe with srcdoc attribute.
iframe = '<iframe srcdoc="%s"></iframe>' % raw_html
and then i can push this iframe into the original HTML wherever i want

Extracting particular text

I am trying to extract all links to videos on a particular WordPress website. Each page has only one video.
Inside each page crawled, there is the following code:
<p><script src="https://www.vooplayer.com/v3/watch/video.js"></script>
<iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer.com/v3/watch/watch.php?v=123456;clearVars=1" frameborder="0" scrolling="no" width="660" height="410" >
</iframe></p>
I would like to extract the text from here
Google Chrome Inspector tells me that this can be addressed as:
Selector: //*[#id="post-255"]/div/p/iframe
XPath: #post-255 > div > p > iframe
But each webpage I am crawling has a different "post" number. They are quite random, hence I cannot easily use the aforementioned selectors.
If there is a dynamic part inside the id attribute, you can address it by partial-matching:
[id^=post] > div > p > iframe
where ^= means "starts with".
XPath alternative:
//*[starts-with(#id, "post")]/div/p/iframe
See also if you can avoid checking for div and p intermediate elements altogether and do:
[id^=post] iframe
//*[starts-with(#id, "post")]//iframe
You may additionally check for the iframe name as well:
[id^=post] iframe[name=vooplayerframe]
//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]

Get Iframe Src content using Selenium Python

I have added to an html file: maintenance.html an iframe:
<iframe name="iframe_name" src="maintenance_state.txt" frameborder="0" height="40" allowtransparency="allowtransparency" width="800" align="middle" ></iframe>
And I want to get the content of the src file maintenance_state.txt using Python and Selenium.
I'm locating the iframe element using:
maintain = driver.find_element_by_name("iframe_name")
However maintain.text is returning an empty value.
How can I get the text written in maintenance_state.txt file.
Thanks for your help.
As some sites' scripts stop the iframe from working properly if it's loaded as the main document, it's also worth knowing how to read the iframe's source without needing to issue a separate driver.get for its URL:
driver.switch_to.frame(driver.find_element_by_name("iframe_name"))
print(driver.page_source)
driver.switch_to.default_content()
The last line is needed only if you want to be able to do something else with the page afterwards.
You can get the src element, navigate to it and get the page_source:
from urlparse import urljoin
src = driver.find_element_by_name("iframe_name").get_attribute("src")
url = urljoin(base_url, src)
driver.get(url)
print(driver.page_source)

Categories